10 tips for creating effective statistical graphics

0
These are a few of my favorite things.
          —Maria in The Sound of Music

For my annual Christmas-themed post, I decided to forgo fractal Christmas trees and animated greeting cards and instead present a compilation of some of my favorite data visualization tips for advanced SAS users. Hopefully, this is a gift that you will enjoy all year long.

There are many guides to making clear and effective data visualizations. Some articles describe the basic chart types (histogram, scatter plots, line plots,...) and when to use them. Other articles focus on design principles such as labeling axes, adding titles, choosing colors, and overlaying grid lines. For an excellent introduction, I recommend the books by Tufte (The Visual Display of Quantitative Information, 1983), Cleveland (Visualizing Data, 1993), and Wainer (Graphic Discovery, 1997).

These articles and books are essential reading, especially if you are new to data visualization. However, practical tips for the advanced practitioner are harder to find. Throughout my professional career, I have created many advanced graphics and visualizations. My goal is always to use best practices and create a compelling graph that enables the reader to understand the underlying data or analysis. This article lists some of my favorite tips for creating effective statistical graphics. Each tip links to a full article that discusses the issue in more detail and includes a SAS program that shows how to create the graph.

Data visualization versus statistical graphics

Although data visualization and statistical graphics are closely related, they are not the same. A data visualization focuses on presenting the data so that the reader can draw his or her own conclusions. In effect, a good data visualization allows the data to tell their own story. Histograms and scatter plots are examples of data visualizations.

A statistical graphic typically shows the result of a statistical model or analysis. If you overlay a kernel density estimate on a histogram, it becomes a statistical graphic. Similarly, adding a regression curve and confidence bands on a scatter plot turns it into a statistical graphic.

Most of the following tips apply both to data visualizations and to statistical graphics.

Tip 1: Simpler is better

The old Shaker hymn reminds us, "'Tis a gift to be simple." Long-time readers might have noticed that I do not use esoteric plots such as Chernoff faces. Of course, "simple" depends on your audience, so another name for this Tip might be "Consider your audience." Here are a few articles in which I advocate simplicity:

Tip 2: Use well-designed color palettes

You can use colors to visualize a continuous response variable in a scatter plot, a heat map, or a choropleth map. Depending on the response, you might want to use a sequential (unidirectional) or a diverging (bidirectional) color ramp. SAS provides the COLORMODEL= option in many graphical statements, which you can use to specify several two-color and three-color color ramps. You can also create your own color ramp, and this is where a potential problem lurks. I'm sure you've seen someone present a poorly designed graph that uses bright garish colors. Don't be that person!

Human perception of color is complex, and it is important to choose a palette of hues that have similar visual impact. To avoid that mistake, choose your palettes from a set of scientifically constructed palettes, known as the ColorBrewer system of palettes. You should also consider whether the palette you choose is colorblind-safe, since 8% of men have a color-vision deficiency.

Tip 3: Use small multiples

Comparison of 9 sample quantile methods

Tufte (1983) championed the use of "small multiples," which means displaying a panel of graphs in which each cell shows data from a subset of a larger data set. Panels are useful for comparing differences and similarities among states, countries, demographic groups, and more. In SAS, you can use several tools to create panels of graphs:

Tip 4: Use a horizontal display for a categorical variable

Tips 4–6 often go together. The first tip is that horizontal is better than vertical when you are plotting data for a categorical variable that has many levels or long labels. The advantages are clear when you compare a horizontal bar chart to a vertical bar chart. The same ideas apply to box plots and dot plots.

Tip 5: Order categories by using a statistic

Howard Wainer often reminded his readers that "we are almost never interested in seeing Alabama first" (Wainer (2005), Graphic Discovery, p. 72). His comment is a reminder that when we plot data for a large number of categories (states, countries, school districts,...), it usually better to order the categories according to some statistic of interest. Often the statistic is a count, a mean, or a rate. A graph that orders the categories by the statistic is more informative than a graph that relies on the default alphabetical ordering.

Tip 6: Plot rates, not counts

As a continuation of Tip 5, remember that often categories differ in size. If you are visualizing quantities such as population change, mortality, or students who pass standardized tests, you should standardize the quantities and visualize a rate rather than a raw count. This enables the viewer to see how all categories compare on a relative basis. If you plot the raw counts, your graph is almost always dominated by the largest units (states, countries, school districts, ...) and it is impossible to understand the trends for the smallest units.

Tip 7: Visualize missing values

It may sound like an oxymoron, but you can visualize missing values in your data. There are a few ways to do this:

Tip 8: Prefer lasagna charts over spaghetti charts

If you plot a time series for many subjects, you obtain a spaghetti plot. A spaghetti plot is aptly named because it often resembles a tangled mess that is hard to digest. You can use a lasagna plot to obtain an informative visualization of the time series that are associated with many subjects. A lasagna plot is a heat map for which each row represents a subject, each column represents a time, and the color represents a response variable. One reason that a lasagna plot is so powerful is that it enables you to use Tips 2-5 to make the visualization maximally effective.

Tip 9: Prefer dot plots over dynamite plots

Most experts agree that the baseline value for a bar chart must be zero, and the baseline value must be shown. Sometimes business analysts use a bar chart to display a mean and add a line segment at the top to visualize the uncertainty in the estimate. This is called a dynamite chart. If you want to display a mean and uncertainty estimates, then use a dot plot. A dot plot does not have to show the zero baseline so the graph can display more details about how the mean varies among groups. And if you display an error bar, please be clear about which measure of uncertainty you are using!

Tip 10: Use a log-scale if the data span several orders of magnitude

Does your intended audience know how to interpret a graph that uses a log scale? If so, you can use a log scale to display data that range over several orders of magnitude. The same transformation can be useful when you are using a response variable to add color to a graph. If the response variable varies of several orders of magnitude, you can use a log-transform of the response to assign colors.

A log transformation applies only to positive values, but you can use associated transformation (called the log-modulus transformation) to transform positive and negative values that span several orders of magnitude. However, only use this for audiences that are mathematically sophisticated.

Summary

Many introductory articles describe basic design principles that apply to all graphs. This article shares 10 intermediate and advanced tips that apply to statistical graphics. By following these tips, you can create effective statistical graphics that clearly visualize data and the statistical measures that analyze them. Each tip links to articles that use SAS software to analyze the data and to create statistical graphics.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of SAS/IML software. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

Leave A Reply

Back to Top