Modeling the distribution of data? Create a Q-Q plot

14

"I think that my data are exponentially distributed, but how can I check?"

I get asked that question a lot. Well, not specifically that question. Sometimes the question is about the normal, lognormal, or gamma distribution. A related question is "Which distribution does my data have," which was recently discussed by John D. Cook on his blog.

Regardless of the exact phrasing, the questioner wants to know "What methods are available for checking whether a given distribution fits the data?" In SAS, I recommend the UNIVARIATE procedure. It supports three techniques that are useful for comparing the distribution of data to some common distributions: goodness-of-fit tests, overlaying a curve on a histogram of the data, and the quantile-quantile (Q-Q) plot. (Some people drop the hyphen and write "the QQ plot.")

Constructing a Q-Q Plot for any distribution

The UNIVARIATE procedure supports many common distributions, such as the normal, exponential, and gamma distributions. In SAS 9.3, the UNIVARIATE procedure supports five new distributions. They are the Gumbel distribution, the inverse Gaussian (Wald) distribution, the generalized Pareto distribution, the power function distribution, and the Rayleigh distribution.

But what if you want to check whether your data fits some distribution that is not supported by PROC UNIVARIATE? No worries, creating a Q-Q plot is easy, provided you can compute the quantile function of the theoretical distribution. The steps are as follows:

  1. Sort the data.
  2. Compute n evenly spaced points in the interval (0,1), where n is the number of data points in your sample.
  3. Compute the quantiles (inverse CDF) of the evenly spaced points.
  4. Create a scatter plot of the sorted data versus the quantiles computed in Step 3.

If the data are in a SAS/IML vector, the following statements carry out these steps:

proc iml;
y = {1.7, 1.0, 0.5, 3.5, 1.9, 0.7, 0.4, 
     5.1, 0.2, 5.6, 4.6, 2.8, 3.8, 1.4, 
     1.6, 0.9, 0.3, 0.4, 1.9, 0.5};
 
n = nrow(y);
call sort(y, 1); /* 1 */
v = ((1:n) - 0.375) / (n + 0.25);  /* 2 (Blom, 1958) */
q = quantile("Exponential", v, 2); /* 3 */

If you plot the data (y) against the quantiles of the exponential distribution (q), you get the following plot:

"But, Rick," you might argue, "the plotted points fall neatly along the diagonal line only because you somehow knew to use a scale parameter of 2 in Step 3. What if I don't know what parameter to use?!"

Ahh, but that is the beauty of the Q-Q plot! If you plot the data against the standardized distribution (that is, use a unit scale parameter), then the slope of the line in a Q-Q plot is an estimate of the unknown scale parameter for your data! For example, modify the previous SAS/IML statements so that the quantiles of the exponential distribution are computed as follows: q = quantile("Exponential", v); /* 3 */

The resulting Q-Q plot shows points that lie along a line with slope 2, which implies that the distribution of the data is approximately exponentially distributed with a shape parameter close to 2.

Choice of quantiles for the theoretical distribution

The Wikipedia article on Q-Q plots states, "The choice of quantiles from a theoretical distribution has occasioned much discussion." Wow, is that an understatement! Literally dozens of papers have been written on this topic. SAS uses a formula suggested by Blom (1958): (i - 3/8) / (n + 1/4), i=1,2,...,n. Another popular choice is (i-0.5)/n, or even i/(n+1). For large n, the choices are practically equivalent. See O. Thas (2010), Comparing Distributions, p. 57–59 for a discussion of various choices. In PROC UNIVARIATE, the QQPLOT statement supports the RANKADJ= and NADJ= options to accomodate different offsets for the nummerator and denominator in the formula.

Repeating the construction by using the DATA step

These computations are simple enough to perform by using the DATA step and PROC SORT. For completeness, here is the SAS code:

data A;
input y @@;
datalines;
1.7 1.0 0.5 3.5 1.9 0.7 0.4 
5.1 0.2 5.6 4.6 2.8 3.8 1.4 
1.6 0.9 0.3 0.4 1.9 0.5
;
 
proc sort data=A; by y; run; /* 1 */
data Exp;
set A nobs=nobs;
v = (_N_ - 0.375) / (nobs + 0.25); /* 2 */
q = quantile("Exponential", v, 2); /* 3 */
run;
 
proc sgplot data=Exp noautolegend; /* 4 */
scatter x=q y=y;
lineparm x=0 y=0 slope=1; /* SAS 9.3 statement */
xaxis label="Exponential Quantiles" grid; 
yaxis label="Observed Data" grid;
run;

Use PROC RANK to generate normal quantiles

For the special case of a normal Q-Q plot, you can use PROC RANK to generate the normal quantiles. The Blom transformation of the data is accomplished by using the NORMAL=BLOM option, as described in this SAS Usage note on creating a Q-Q plot.

Use PROC UNIVARIATE for Simple Q-Q Plots

Of course, for this example, I don't need to do any computations at all, since PROC UNIVARIATE supports the exponential distribution and other common distributions. The following statements compute goodness-of-fit tests, overlay a curve on the histogram, and display a Q-Q plot:

proc univariate data=A;
var y;
histogram y / exp(sigma=2); 
QQplot y / exp(theta=0 sigma=2);
run;

However, if you think your data are distributed according to some distribution that is not built into PROC UNIVARIATE, the techniques in this article show how to construct a Q-Q plot to help you assess whether some "named" distribution might model your data.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

14 Comments

  1. Pingback: Fitting a Poisson distribution to data in SAS - The DO Loop

  2. Pingback: The Poissonness plot: A goodness-of-fit diagnostic - The DO Loop

  3. I came across you explanation of Q-Q plots when i was researching capacity planning ideas. I am interested in learning about the results of these Q-Q plots. Why do I try to find the dsn here ? Is it for predicting future dsn's ?

    Any book on such statistics accessible to a software dev. ?

    • Rick Wicklin

      What is a 'dsn'? For learning more about Q-Q plots, click on the links in this article. I'd start with the PROC UNIVARIATE documentation, then the Wikipedia links. If you want to know even more, try to get a copy of Chambers, Cleveland, Kleiner, and Tukey (1983), Graphical Methods for Data Analysis.

  4. Pingback: How to get data values out of ODS graphics - The DO Loop

  5. Pingback: A three-panel visualization of a distribution - The DO Loop

  6. Pingback: Exploratory Data Analysis: Quantile-Quantile Plots for New York’s Ozone Pollution Data | The Chemical Statistician

  7. How does proc Univariate calculate x-axis? I tried using proc Univariate using same data above, the range of X-axis is different then what I'm getting.

    With my own data I was trying to fit Weibull distribution. Above method gave me an x-axis range of 0-225 whereas proc Univariate gave me a range of 0-15. When I rescaled my x-axis (quantiles) that I got from using above method to range of 0-15, the shapes from proc uni and the custom method are different. What am I missing over here?

      • Hi Rick,

        I was also quite confused about the range of x-axis of the qqplot from univariate and the ones obtained from the Data Step described above. Then I realized that the univariate gives quantile of a standard distribution. So, by correctly specifying the second parameter of the quantile function, we can get the same result as the proc univariate.

        However, I have one question. For distributions such as gamma, the shape parameter needs to be specified. I have used the shape parameter estimated by the proc univariate in the quantile function. Then I get a similar qqplot from univariate and the above method. If I use the estimated scale parameter in the quantile function then I get quantiles of the non-standardized distribution.. Am I interpreting this corrctly?

        Thanks a lot in advance,

        • Rick Wicklin

          Ahh. Now I understand the original question. The doc for the UNIVARIATE procedure has some examples of interpreting Q-Q plots. Standardizing the distribution can be a little tricky. If f(x) is a standardized PDF, then (1/sigma)*f( (x-theta)/sigma ) is the PDF with location theta and scale sigma. Similarly, if Q(p) is the standardized quantile function, then theta + sigma*Q(p) is the quantile function with location theta and scale sigma. HTH.

          • Thanks much Rick.

            I am comparing several qq plots so , I used proc univariate for most, but for igauss I used the above step via calculating quantile. To make sure I am comparing the same qqplots, for iguass I calculated quantile as:
            q=quantile("igauss",v,shape parm) ; where I calculated the shape parameter via proc univariate.

            Is that correct? Or some other transformation is necessary?

            Thanks.

  8. Pingback: Sampling variation in small random samples - The DO Loop

  9. Pingback: An overview of regression diagnostic plots in SAS - The DO Loop

Leave A Reply

Back to Top