At the 2012 SAS Global Forum, one of the questions from a user was about showing the original data used for the box plot. While you can use outliers in conjunction with the box features to get a feel for the data, for some situations you may need to see exactly what the data looks like, in relation to the boxes.
Since there could be many data points for a given categorical value, some sort of jittering would be needed to “un-clump” the points. Shown below is what you get by overlaying the raw data and the box plot without any jitter:
One solution for this problem is given in “The Graph Template Language: Beyond the SAS/GRAPH® Procedures” by J. M. Pratt. This approach uses a categorical X axis along with a numeric X2 axis that is not displayed.
Starting with SAS 9.3, there is another way to solve this problem. We now support box plots on interval axis! Here is what this solution looks like:
The full program is available here. Here are the main points in this solution:
- Map your categories to a numeric variable.
- Turn off the display of box outliers – we will be showing all points including the outliers.
- Introduce a small displacement in the X coordinates of the scatter points to reduce collisions. Ideally, this would be a true jitter which takes the degree of collision of the points into account. Here, we make do with a simpler method of adding random noise to every X coordinate.
- Explicitly specify the X axis tick values and map them to the original category values using a format.
- Making the scatter points slightly transparent prevents the points from overpowering the box features (mean, for example). This also helps us notice any overlapping points in spite of added the random noise.
- The above output has the box overlaid on the scatter points. If you choose to overlay the scatter points on top of the box, you also need to force the X axis to be of type linear.
So when you need to explode your boxes, remember this trick!