Last week a question was posted on the communities page about creating Box Plots where the width of each box is proportional to the frequency for the category. The comment was that PROC BOXPLOT can create such a graph, but there seems no way to do this using the SGPLOT procedure.
The user is right. The SGPLOT procedure does not provide a way to create box plots where the width of each box is proportional to frequency. However, there is a way to create such a graph using SGPLOT and GTL and a bit of coding.
We know by now that SGPLOT scripts out a GTL template along with the data needed to render the graph. SGPLOT scripts a template using the BoxPlotParm statement. This statement can render a box plot from a data set with three columns - X, Statistic and Y.
For this article, we will create a box plot of Mileage by Type where each box width is proportional to the frequency for each category. First, we run the SGPLOT procedure for a basic box plot of the same variables as follows. We add the ODS OUTPUT statement to save the processed data into the 'BoxData' data set.
ods output sgplot=boxdata
(rename=(BOX_MPG_CITY_X_TYPE_SORTORDER__X=X
BOX_MPG_CITY_X_TYPE_SORTORDER__Y=Y
BOX_MPG_CITY_X_TYPE_SORTORDER_ST=Stat));
ods graphics / reset width=5in height=3in imagename='Box';
title 'Mileage by Type';
proc sgplot data=sashelp.cars;
vbox mpg_city / category=type;
xaxis display=(nolabel);
run;
The graph of Mileage by Type is displayed on the right. Note, all boxes are of the same width. A BOXWIDTH option is available, but it takes a scaler, and is applied to all boxes.
The SGPLOT procedure also generates the data necessary to render the graph using the BoxPlotParm GTL statement. The generated data set contains three computed columns that have long names ending with "_X", "_Y" and _ST". I have renamed these above to "X", "Y" and "Stat". For each category value "X", the "Stat" variable contains names of various box plot statistics such as "MEAN", "MEDIAN", "Q1", "Q3" and so on. The Y variable contains the corresponding y value used to draw each box.
Starting with SAS 9.3, the BoxPlotParm statement also supports the BOXWIDTH statistic, which is a fractional value 0.0 - 1.0, and determines the width of each box. What we need to do is to compute this statistic based on N, and insert it into the data set. Then, we can use a GTL template with the BoxPlotParm procedure to render this graph.
I have used some data step code (see in linked code) to first compute the maximum value of N for the data. Then, I have computed the appropriate fractional value for BoxWidth for each category, and inserted a new observation into the data set as shown in the figure on the right. After I encounter "N" in the data set, I script out the new observation with Stat='BOXWIDTH' as seen in obs #9 on the right.
Now, we create a simple GTL template using the BoxPlotParm statement to render this data. The template is shown below. Then, we run the SGRENDER procedure using the new data set and the template to create the graph.
proc template;
define statgraph BoxWidth;
begingraph;
entrytitle 'Mileage by Type';
entrytitle 'Box Width is Proportional to N';
layout overlay;
boxplotparm x=x y=y stat=stat;
scatterplot x=x y=eval(y*0+5) /
markercharacter=n;
endlayout;
endgraph;
end;
run;
proc sgrender data=boxdataWidth template=BoxWidth;
format n 3.0;
run;
In the graph above, the box plot is now rendered showing the various statistics as before, but now the width of each box is made proportional to N using the new BOXWIDTH statistics value. Just for verification, I have also displayed the value of N at the bottom of each box. Click on graph for bigger graph
Full SAS 9.3 Code: Box_With_Variable_Width_93
6 Comments
Not sure if I'm a fan of the proportional width box plot. Does make it hard to see when they are narrow. Is this style of boxplot becoming popular?
There's always a way (and usually many ways), to do something using SAS!
Thanks Sanjay!
The BOXPLOT procedure provides this option. I recall Bob Rodriguez opining this feature was important for the Survey procedures, and the feature in GTL BOXPLOTPARM was added specifically for such use cases. We did not add the feature to GTL BoxPlot or SGPLOT as it was not clear how much demand was out there.
Yes, the boxes are harder to see when the box widths are narrow, but they provide an important clue to the relative frequency distribution between the categories. It is very easy to miss the fact that (in this graph) Sedans have a very high frequency compared to the other types.
That's a very good point Sanjay... Thanks for elaborating.
The descriptive statistics that are in the boxplot are very useful for visual comparison e.g. seeing the mean, maximum etc. and as you point out N isn't visualized. I can see how a proportionally width to N boxplot would be useful for surveys when looking at lower response answers and to potentially provide quick visual insight to patterns/trends etc. Mmmmm nice. 🙂
Thanks for this very useful information. It makes a box plot an even richer summary of the data. To avoid boxes that are too narrow, one could set a minimum width or use a logarithmic scale. PROC BOXPLOT includes logarithm and power scales, where 0.5 is square root.
An obvious question: do you plan to add this function to PROC SGPLOT? I would vote for it.
Yes, we would like to add this to SGPLOT. The timeframe is not clear.
Thank you Sanjay, I will share this with my team tomorrow! 🙂