Over at the SAS Discussion Forums, someone asked how to use SAS to fit a Poisson distribution to data. The questioner asked how to fit the distribution but also how to overlay the fitted density on the data and to create a quantile-quantile (Q-Q) plot.
The questioner mentioned that the UNIVARIATE procedure does not fit the Poisson distribution. That is correct: the UNIVARIATE procedure fits continuous distributions, whereas the Poisson distribution is a discrete distribution. Nevertheless, you can fit Poisson data and visualize the results by combining several SAS procedures. This article shows one way to accomplish this. The method also works for other discrete distributions such as the negative binomial and the geometric distribution.
Do I receive emails at a constant rate?
For data I will use the number of emails that I received one day for each of 19 half-hour periods from 8:00 am to 5:30 pm. If I receive emails at a constant rate during the day, the number of emails in each 30-minute period follows a Poisson distribution. The following DATA step defines the data; PROC FREQ tabulates and plots the sample distribution:
/* number of emails received in each half-hour period 8:00am - 5:30pm on a weekday. */ data MyData; input N @@; /* 8a 9a 10a 11a 12p 1p 2p 3p 4p 5p */ datalines; 7 7 13 9 8 8 9 9 5 6 6 9 5 10 4 5 3 8 4 ; /* Tabulate counts and plot data */ proc freq data=MyData; tables N / out=FreqOut plots=FreqPlot(scale=percent); run;
The mean of the data is about 7. A Poisson(7) distribution looks approximately normal—which these data do not. On the other hand, there are less than 20 observations in the data, so let's proceed with the fit. (I actually looked at several days of email before I found a day that I could model as Poisson, so these data are NOT a random sample!)
Fit the data
The first step is to fit the Poisson parameter to the data. You can do this in PROC GENMOD by by using the DIST= option to specify a Poisson distribution. Notice that I do not specify any explanatory variables, which means that I am fitting the mean of the data.
/* 1. Estimate the rate parameter with PROC GENMOD: http://support.sas.com/kb/24/166.html */ proc genmod data=MyData; model N = / dist=poisson; output out=PoissonFit p=lambda; run;
At this point you should look at the goodness-of-fit and parameter estimates tables that PROC GENMOD creates to see how well the model fits the data. I will skip these steps.
Compute the fitted density
The P= option on the OUTPUT statement outputs the mean, which is also the parameter estimate for the fitted Poisson distribution. The mean is about 7.1. The following statements set a macro variable to that value and create a data set (PMF) that contains the Poisson(7.1) density for various x values. In a subsequent step, I'll overlay this fitted density on the empirical density.
/* 2. Compute Poisson density for estimated parameter value */ /* 2.1 Create macro variable with parameter estimate */ data _null_; set PoissonFit; call symputx("Lambda", Lambda); stop; run; /* 2.2 Use PDF function for range of x values */ data PMF; do t = 0 to 13; /* 0 to max(x) */ Y = pdf("Poisson", t, &Lambda); output; end; run;
Overlay the empirical and fitted densities
I want to overlay the discrete density on a bar chart of the data. One way to visualize the discrete density is as a scatter plot of (x, pdf(x)) values that represent the fitted density at x=0, 1,...,13. Unfortunately, you cannot use the VBAR and the SCATTER statements in the same SGPLOT call to overlay a bar chart and a scatter plot. However, in SAS 9.3 you can use the VBARPARM statement together with the SCATTER statement. (Thanks to "PGStats" for this suggestion.) The VBARPARM statement requires that you compute the heights of the bars yourself, but the heights are easily constructed from the PROC FREQ output that was created earlier:
/* 3. Use bar chart to plot data. To overlay a bar chart and scatter plot, use the VBARPARM stmt instead of VBAR. */ data Discrete; merge FreqOut PMF; Prop = Percent / 100; /* convert to same scale as PDF */ run; /* 3.2 Overlay VBARPARM and scatter plot of (x, pdf(x)) */ proc sgplot data=Discrete; /* VBARPARM is SAS 9.3 stmt */ vbarparm category=N response=Prop / legendlabel='Sample'; scatter x=T y=Y / legendlabel='PMF' markerattrs=GraphDataDefault(symbol=CIRCLEFILLED size=10); title "Emails per 30-Minute Period and Poisson Distribution"; run;
Create a discrete Q-Q plot
On the Discussion Forum, the questioner asked for a quantile-quantile plot. I don't know whether I've ever seen a Q-Q plot for a discrete distribution before; usually they are shown for continuous distributions. However, you can create a discrete Q-Q plot by following exactly the same steps that I described in my previous article on how to compute a Q-Q plot:
/* 4. Create a Q-Q plot */ /* 4.1 Compute theoretical quantiles */ proc sort data=MyData; by N; run; /* 1 */ data QQ; set MyData nobs=nobs; v = (_N_ - 0.375) / (nobs + 0.25); /* 2 */ q = quantile("Poisson", v, &Lambda); /* 3 */ run; proc sgplot data=QQ noautolegend; /* 4 */ scatter x=q y=N; lineparm x=0 y=0 slope=1; /* SAS 9.3 statement */ xaxis label="Poisson Quantiles" grid; yaxis label="Observed Data" grid; title "Poisson Q-Q Plot of Emails"; run;
I've created a discrete Q-Q plot, but is it useful? A drawback appears to be that the discrete Q-Q plot suffers from overplotting, whereas a continuous Q-Q plot does not. A continuous CDF function is one-to-one so the quantiles of the ranks of the data are unique. In contrast, the CDF function for a discrete distribution is a step function, which leads to duplicated quantiles and overplotting.
For example, in the discrete Poisson Q-Q plot for my email, there are 19 observations, but only 13 points are visible in the Q-Q plot due to overplotting. If I analyze 10 days of my email traffic, I could get 190 observations, but the Q-Q plot might show only a fraction of those points. (In simulated data, there were only 25 unique values in 190 observations drawn from a Poisson(7) distribution.)
The fact that I don't often see discrete Q-Q plots bothered me, so I did a little research. I found a reference to discrete Q-Q plots on p. 126 of Computational Statistics Handbook with MATLAB where it says:
Quantile plots...are primarily used for continuous data. We would like to have a similar technique for graphically comparing the shapes of discrete distributions. Hoaglin and Tukey  developed several plots to accomplish this [including] the Poissonness plot.
That sounds interesting! A future blog post will present an alternative way to visualize the fit of a Poisson model.