At a recent conference, I talked with a SAS customer who told me that he was using an R package to create a three-panel visualization of a distribution. Unfortunately, he couldn't remember the name of the package, and he has not returned my e-mails, so the purpose of today's article is to discuss some ideas related to this visualization and to solicit critiques of my implementation.
The customer wanted to create a paneled display in SAS that includes three graphs: a histogram, a box plot, and a normal quantile-quantile (Q-Q) plot. We sketched out an idea, and the plot at the left is my implementation of our sketch. (Click to enlarge.)
One question I asked was, "Do you want the usual Q-Q plot, or should we flip it?" The usual Q-Q plot is a scatter plot of the ordered data values (on the vertical axis) plotted against the corresponding quantiles of a normal distribution (on the horizontal axis). The purpose of a Q-Q plot is to see whether the points fall along a straight line, which would indicate that the data are normally distributed. I remarked that if we flip the plot so that the data values are displayed horizontally, then the histogram, boxplot, and Q-Q plot can all share a common horizontal axis. The customer said that this seemed like a good idea.
In SAS, you can create this kind of paneled layout by using the Graph Template Language (GTL) and the SGRENDER procedure.A three-panel visualization of a data distribution #DataViz Click To Tweet
A GTL template for a three-panel display
When I returned from the conference, I created a GTL template that defines a three-panel display. The top panel, which occupies 50% of the height of the display, is a histogram of the data overlaid with a normal curve and a kernel density estimate. The second panel, which occupies 10% of the height, is a horizontal box plot. The third panel is a normal Q-Q plot, but is flipped so that the normal quantiles are plotted on the vertical axis. A diagonal reference line is added to the Q-Q plot. Normally distributed data should fall near the reference line.
The threepanel template takes five dynamic variables. The data and the normal quantiles are referenced by the dynamic variables _X and _QUANTILE, respectively. The title is supplied by using the _Title variable. Lastly, the parameter estimates for the normal curve that best fits the data are supplied by using the _mu and _sigma dynamic variables. The template definition follows:
/* define 'threepanel' template that displays a histogram, box plot, and Q-Q plot */ proc template; define statgraph threepanel; dynamic _X _QUANTILE _Title _mu _sigma; begingraph; entrytitle halign=center _Title; layout lattice / rowdatarange=data columndatarange=union columns=1 rowgutter=5 rowweights=(0.4 0.10 0.5); layout overlay; histogram _X / name='histogram' binaxis=false; densityplot _X / name='Normal' normal(); densityplot _X / name='Kernel' kernel() lineattrs=GraphData2(thickness=2 ); discretelegend 'Normal' 'Kernel' / border=true halign=right valign=top location=inside across=1; endlayout; layout overlay; boxplot y=_X / boxwidth=0.8 orient=horizontal; endlayout; layout overlay; scatterplot x=_X y=_QUANTILE; lineparm x=_mu y=0.0 slope=eval(1./_sigma) / extend=true clip=true; endlayout; columnaxes; columnaxis; endcolumnaxes; endlayout; endgraph; end; run;
You can download the %ThreePanel macro, which creates a three-panel display for any variable in any data set. If you want to learn more about how to write GTL templates, I recommend the book Statistical Graphics in SAS: An Introduction to the Graph Template Language and the Statistical Graphics Procedures by my colleague, Warren Kuhfeld.
The macro calls PROC UNIVARIATE to compute the normal parameter estimates and the quantiles. That information is then used by PROC SGRENDER to create the plot according to the specifications in the threepanel template. The macro uses a cool trick: I get the data for the Q-Q plot by using an ODS OUTPUT statement on a graph that is created by PROC UNIVARIATE.
The image at the top of this post shows how the template renders the MPG_City variable in the Sashelp.Cars data set. The image was created as follows:
ods graphics on; %ThreePanel(Sashelp.Cars, MPG_City)
The MPG_City variable is not normally distributed, as is evident by looking at the poor fit of the data in the Q-Q plot (lower panel). In contrast, the distribution of the SepalLength variable in the Sashelp.Iris data set appears to be more normal, as shown below:
What do you think? Try it out on your own data and let me know if you have suggestions to improve it. Is this a useful display? Leave a comment.