You can use histograms to visualize the distribution of data. A comparative histogram enables you to compare two or more distributions, which usually represent subpopulations in the data. Common subpopulations include males versus females or a control group versus an experimental group. There are two common ways to construct a comparative histogram: you can create a panel of histograms, or you can overlay histograms in a single graph. This article shows how to create comparative histograms in SAS.
Sanjay Matange and I have each written multiple previous articles on this topic. This article collects many of the ideas in one place. In the SAS 9.2 and SAS 9.3 releases, the graph template language (GTL) was required to construct some of these graphs. However, thanks to recent features added to PROC SGPLOT, PROC SGPANEL, and PROC UNIVARIATE, you can now create comparative histograms in SAS without writing any GTL.Overlay and panel histograms in #SAS Click To Tweet
Panel of histograms
A panel of histograms enables you to compare the data distributions of different groups. You can create the histograms in a column (stacked vertically) or in a row. I usually prefer a column layout because it enables you to visualize the relative locations of modes and medians in the data.
In SAS, you can create a panel of histograms by using PROC UNIVARIATE or by using PROC SGPANEL. Both procedures require that the data be in "long form": one continuous variable that specifies the measurements and another categorical variable that indicates the group to which each measurement belongs. If your data are in "wide form," you can convert the data from wide form to long form.
To use PROC UNIVARIATE, specify the categorical variable on the CLASS statement and the continuous variable on the HISTOGRAM statement. For example, the following example compares the distribution of the SepalLength variable for each of the three values of the Species variable in the Sashelp.Iris data:
proc univariate data=sashelp.iris; class Species; var SepalLength; /* computes descriptive statisitcs */ histogram SepalLength / nrows=3 odstitle="PROC UNIVARIATE with CLASS statement"; ods select histogram; /* display on the histograms */ run;
The result is shown at the beginning of this section. The graph suggests that the median value of the SepalLength variable differs between levels of the Species variable. Furthermore the variance of the "Virginica" group is larger than for the other groups.
You can create similar graphs by using the SGPANEL procedure, which supports a wide range of options that control the layout. Specify the Species variable in the PANELBY statement and the SepalLength variable in the HISTOGRAM statement. The following call to PROC SGPANEL creates a comparative histogram:
title "PROC SGPANEL with PANELBY statement"; proc sgpanel data=sashelp.iris; panelby Species / rows=3 layout=rowlattice; histogram SepalLength; run;
The graph produced by PROC SGPANEL is similar to the previous graph.
With the GTL you can create more complicated panel displays than are shown here. For example, Sanjay shows how to create mirrored histograms, which are sometimes used for population pyramids.
For comparing the distributions of three or more groups, I recommend a panel of histograms. However, for two groups you might want to overlay the histograms. You can use the TRANSPARENCY= option in PROC SGPLOT statements so that both histograms are visible, even when the bars overlap. The portion of bars that overlap are shown in a blended color.
In the HISTOGRAM statement of PROC SGPLOT, you can use the GROUP= option to specify the variable that indicates group membership. The GROUP= option overlays the histograms for each group, as the following example shows:
proc sgplot data=sashelp.iris; where Species in ("Setosa", "Versicolor"); /* restrict to two groups */ histogram SepalLength / group=Species transparency=0.5; /* SAS 9.4m2 */ density SepalLength / type=kernel group=Species; /* overlay density estimates */ run;
In this graph I added density estimates to help the eye visualize the basic shape of the two histograms. The purple region shows the overlap between the two distributions. For more than two categories, you might want to omit the histograms and just overlay the density estimates. The graph combines the first two rows of the panel in the previous section. The overlay enables you to compare the two subpopulations without your eye bouncing back and forth between rows of a panel.
The GROUP= option was added to the HISTOGRAM and DENSITY statements in SAS 9.4m2. You can create the same graph in PROC UNIVARIATE by using the OVERLAY option in the HISTOGRAM statement. The OVERLAY option requires SAS 9.4m3.
proc univariate data=sashelp.iris; class Species; var SepalLength; histogram SepalLength / kernel overlay; /* SAS 9.4m3 */ run;
Overlay histograms of different variables
Because PROC SGPLOT enables you to use more than one HISTOGRAM statement, you can also overlay the histograms of different variables.
When comparing histograms it is best that both histograms use the same bin width and anchor locations. Prior to SAS 9.3, you could overlay histograms by using the graph template language (GTL). However, SAS 9.3 introduced support for the BINWIDTH= and BINSTART= options in the HISTOGRAM statement in PROC SGPLOT. Therefore you can force the histograms to have a common bin width, as shown in the following example:
title "Overlay Histograms with PROC SGPLOT"; proc sgplot data=Sashelp.Iris; histogram PetalLength / binwidth=5 transparency=0.5 name="petal" legendlabel="Petal Width"; histogram SepalLength / binwidth=5 transparency=0.5 name="sepal" legendlabel="Sepal Width"; density PetalLength / type=kernel lineattrs=GraphData1; /* optional */ density SepalLength / type=kernel lineattrs=GraphData2; /* optional */ xaxis label="Length (mm)" min=0; keylegend "petal" "sepal" / across=1 position=TopRight location=Inside; run;
In summary, SAS provides multiple ways to use histograms to compare the distributions of data. To obtain a panel of histograms, the data must be in the "long" format. You can then:
- Use the CLASS statement in PROC UNIVARIATE to specify the grouping variable. This is a good choice if you also want to compute descriptive statistics or fit a distribution to the data.
- Use PROC SGPANEL, which provides you with complete control over the layout of the panel, axes, and other graphical options.
If you only have two groups and you want to overlay partially transparent histograms, you can do the following:
- Use the GROUP= option in the HISTOGRAM statement of PROC SGPLOT (requires SAS 9.4m2).
- Use the OVERLAY option in the HISTOGRAM statement of PROC UNIVARIATE (requires SAS 9.4m3).
Lastly, if you have two variable to compare, you can use two HISTOGRAM statements. Be sure to use the BINWIDTH= option (and optionally the BINSTART= option), which requires SAS 9.3.
The comparative histogram is not a perfect tool. You can also use spread plots and other techniques. However, for many situations a panel of histograms or an overlay of histograms provides an effect way to visually compare the distributions of data in several groups.