I've previously described how to overlay two or more density curves on a single plot. I've also written about how to use PROC SGPLOT to overlay custom curves on a graph. This article describes how to overlay a density curve on a histogram. For common distributions, you can overlay a density by using SAS procedures. However, this article shows how to use the Graphics Template Language (GTL) to overlay a custom density estimate. An example is shown in the figure to the left. The information in this article is adapted from Chapter 3 of my book Simulating Data with SAS.
Overlaying common densities
The UNIVARIATE procedure supports fitting about a dozen distributions to data, including the beta, exponential, gamma, lognormal, and Weibull distributions. There are several examples in the PROC UNIVARIATE documentation that demonstrate how to fit parametric densities to data.
For more complicated models, other SAS procedures are available. You can use the FMM procedure to fit finite mixtures of distributions. In SAS/ETS software, the SEVERITY procedure can fit many distributional models to data. For nonparametric models, the KDE procedure can fit one- and two-dimensional kernel density estimates.
These procedures enable you to overlay density curves for common distributions. For me, these procedures suffice 99% of the time. However, there are situations where you might want to do the following:
- Compute a probability density function (PDF) outside of any procedure. The density curve might come from a computation or from evaluating a formula.
- Add additional features to the graph. For example, you might want to add reference lines, modify the placement of tick marks, add legends, titles, and so forth.
The SGPLOT procedure does not enable you to use a HISTOGRAM statement and a SERIES statement in the same call. However, you can overlay a histogram and a curve by using the GTL.
Overlaying custom densities
The following template is a simplified version of a template that I used in my book. You can use it as is, or modify it to include additional features such as grid lines.
proc template; define statgraph ContPDF; dynamic _X _T _Y _Title _HAlign _binstart _binstop _binwidth; begingraph; entrytitle halign=center _Title; layout overlay /xaxisopts=(linearopts=(viewmax=_binstop)); histogram _X / name='hist' SCALE=DENSITY binaxis=true endlabels=true xvalues=leftpoints binstart=_binstart binwidth=_binwidth; seriesplot x=_T y=_Y / name='PDF' legendlabel="PDF" lineattrs=(thickness=2); discretelegend 'PDF' / opaque=true border=true halign=_HAlign valign=top across=1 location=inside; endlayout; endgraph; end; run;
The template, which is called ContPDF, uses the LAYOUT OVERLAY statement to overlay a histogram and a series (a continuous curve). The name of the histogram variable is provided in the dynamic variable _X. The histogram is plotted on the density scale. The dynamic variables _binstart, _binstop, and _binstart are used to set the histogram bins to convenient values. The variables for the series plot are provided by the dynamic variables _T and _Y. The other two dynamic variables are _HAlign, which determines the location of an inset, and _Title, which specifies the title of the graph.
To see how this template might be used, recall that I have previously shown how to generate random values from the folded normal distribution. The following DATA step generates 1,000 values from the folded normal distribution:
data MyData(drop=i); call streaminit(1); do i = 1 to 1000; x = abs( rand("Normal") ); output; end; run;
The PDF for a folded distribution is easy to compute: simply fold the distribution across zero and add the densities of each tail:
data PDFFolded; do t = 0 to 4 by 0.1; y = pdf("Normal", t) + pdf("Normal", -t); /* =2*PDF for the normal dist */ output; end; run;
The goal is to construct a histogram of the MyData data set and overlay the curve in the PDFFolded data set. There is a standard technique that you can use to overlay a curve on data: concatenate the two data sets and call an SG procedure on the combined data.
data All; set MyData PDFFolded; run; proc sgrender data=All template=ContPDF; dynamic _X="X" _T="T" _Y="Y" _HAlign="right" _binstart=0 _binstop=4 _binwidth=0.25 _Title="Sample from Folded Normal Distribution"; run;
The graph that results is shown at the beginning of this article. Notice that this graph cannot be produced without using the GTL, because the folded normal distribution is not one of the distributions that is supported by PROC UNIVARIATE or PROC SEVERITY.
I try to avoid using the GTL, but it is necessary in this case. Perhaps one day the SGPLOT procedure will enable overlaying a custom curve on a histogram, but until then the GTL provides a way to overlay a custom density estimate on a histogram.