'Unbox' Your Box Plots

11

At the 2012 SAS Global Forum, one of the questions from a user was about showing the original data used for the box plot. While you can use outliers in conjunction with the box features to get a feel for the data, for some situations you may need to see exactly what the data looks like, in relation to the boxes.

Since there could be many data points for a given categorical value, some sort of jittering would be needed to “un-clump” the points. Shown below is what you get by overlaying the raw data and the box plot without any jitter:

Raw Data with Box overlays (No Jitter)
Raw Data with Box overlays (No Jitter)

One solution for this problem is given in “The Graph Template Language: Beyond the SAS/GRAPH® Procedures” by J. M. Pratt. This approach uses a categorical X axis along with a numeric X2 axis that is not displayed.

Starting with SAS 9.3, there is another way to solve this problem. We now support box plots on interval axis! Here is what this solution looks like:

Raw data (with jitter) overlaid with Box
Raw data (with jitter) overlaid with Box

The full program is available here.  Here are the main points in this solution:

  • Map your categories to a numeric variable.
  • Turn off the display of box outliers – we will be showing all points including the outliers.
  • Introduce a small displacement in the X coordinates of the scatter points to reduce collisions.  Ideally, this would be a true jitter which takes the degree of collision of the points into account. Here, we make do with a simpler method of adding random noise to every X coordinate.
  • Explicitly specify the X axis tick values and map them to the original category values using a format.
  • Making the scatter points slightly transparent prevents the points from overpowering the box features (mean, for example). This also helps us notice any overlapping points in spite of added the random noise.
  • The above output has the box overlaid on the scatter points. If you choose to overlay the scatter points on top of the box, you also need to force the X axis to be of type linear.

So when you need to explode your boxes, remember this trick!

Share

About Author

Prashant Hebbbar

Principal Software Developer

Prashant Hebbar is a software developer in Data Visualization at SAS. He began his career at SAS in 1997. His areas of expertise are the Graph Template Language (GTL), the ODS Graphics Designer and image formats.

Related Posts

11 Comments

  1. David Schmitt on

    While I think this is an interesting and innovative way to display a dataset, I'm uncomfortable with the the horizontal distance between the point and the wire of the boxplot. Doesn't this falsely imply that there's information related to that distance? For example, in the Asia boxplot there are several points that appear to be separated out to the right of the rest of the data. But if I understand your approach, the x axis variation was added simply to help make each individual circle more visible.

    Am I missing something?

    • Prashant Hebbbar
      Prashant Hebbbar on

      Thank you for the comment, David.

      The intent, as you surmised, is to spread out the colliding points to get a 'feel' for how many data values there may be at any Y coordinate. Other approaches to this are to just use transparency to show overlap, which works to a degree, but fails when a lot of points coincide.

      As I noted in the post, a true jitter which only displaces the points that overlap would be a better option. We hope to support that in some form in a future release.

      It is possible that a viewer could interpret the X displacement as significant, but hopefully, a clear legend will indicate what the intent was. I did put in 'jittered' in the legend label, but perhaps it needs to be more conspicuous...

  2. Powerful stuff Prashant. Great to see the ODS' team continued focus on responding to user requests.

    I think John Tukey would be proud.

    The functionality to show the individual data points along with the summary of the data that the box plot provides is very powerful. It might help a bit to downplay the data points a bit by making them semi transparent. What do you think? Can you do that?

    • Prashant Hebbbar
      Prashant Hebbbar on

      Glad to know you liked it, John!

      I cant take credit for the idea, of course! The 2012 SAS Global Forum paper I referenced in the post also does this, but our approach shows off the interval axis support for box plots!

      The graph as shown has a data transparency of 0.5 (0=opaque .. 1=transparent). This can be easily dialed up or down, as the need may be.

  3. excellent to have this. jitter points with overlaid boxplots are one of my favorite ways to display data - so clear, yet shows so much

    this should've been part of SAS 9.1 or something, it's long overdue -.-

  4. I've used this code frequently in my work and it's been tremendously helpful so thank you very much for sharing it! I was wondering if you know of any way to label the outliers (and ONLY the outliers)? My dilemma, as I'm sure you can guess, is that I either end up with a label on EVERY point (which looks really cluttered), or I get the box plot portion of the code to generate a new data point for the outliers with the label, but these wind up clashing with the scatter plot data point for that outlier. Any idea how to label only the outliers without creating an awkward duplicate data point for them?

    • Prashant Hebbbar
      Prashant Hebbbar on

      Interesting problem! After some discussions, we have *a* solution that has been posted as Part Deux. Hope that works for you!

  5. Pingback: 'Unbox' Your Box Plots - part deux - Graphically Speaking

  6. First, thank you very much for this blog of tremendous interest and for liaising with the SAS Users!
    To produce a similar graph comparing different substances within two groups, I used another approach with SAS 9.3 (SAS 9.4 is not available at my working place): statements 'highlow' (to construct the boxplot) and 'scatter' (to display jittered points) within a Proc sgpanel with a layout= 'columnlattice', as option of the 'panelby' statement.
    I am nearly happy with the result. My issue is that by default each box is displayed in two different colours (instead of one per box), one colour for the bottom part and another one for the upper part of the box.
    Unfortunately I cannot paste the graph picture here. Do you see any solution to the issue?
    Many thanks in advance.

Back to Top