There has been a spate of recent high-profile airline crashes (Malaysia Airlines, TransAsia Airways, Germanwings,...) so I was surprised when I saw a time series plot of the number of airline crashes by year, which indicates that the annual number of airline crashes has been decreasing since 1993. The data show that the annual number of crashes in recent years is about half of the number from 20 years ago.
A flow diagram that visualizes airline crash data
The series plot appeared in Significance magazine as part of an interview with designer David McCandless, who produces infographics about data. McCandless has recently published a new book titled Knowledge is Beautiful, which includes the following infographic (click to enlarge). The time series plot appears in the upper left. The main portion of the infographic presents information about the causes of commercial air crashes from 1993–2013. In addition to the cause of the crash, McCandless shows the "phase of flight" (take-off, en route, landing) during which the disaster occurred.
In general, infographics are designed to appeal as well as to inform. In the article, McCandless says that he is trying to show "the links, connections and relations between things." His goal is to "mediate between the data and the audience" so that people "pay attention" to the data. Many statisticians share that goal, with a possible difference being that statistical graphics attempt to objectively represent the data so that the data speak for themselves.
McCandless's goal is to create beautiful infographics (and he has succeeded!), but I would like to propose an alternative visualization of these data that trades beauty for compactness.
I have two main issues with McCandless's "flow diagram," which is reminiscent of a Sankey diagram. The first is that it uses a lot ink to display the relationship between "Cause" and "Phase," which are each five-level categorical variables. A standard frequency analysis would display this data in a 5x5 table or visualize it with a tile-based plot with 25 tiles. I don't think that Edward Tufte would approve of McCandless's plot. Unlike Tufte's favorite plot—Minard's "Napoleon’s March to Moscow", which uses a similar design—there is no temporal or geographic change for these data. The "flow" through the various "pipes" merely represents conditional frequencies.
My second objection is simply that some of the blocks in the diagram are the wrong sizes. The area of the blocks is supposed to be proportional to the number of crashes in each category. Notice that the bottom row is properly scaled: If you stack the "take-off" and "en route" boxes, which account for 50% of the cases, they approximately equal the area of the "landing" box. However, some boxes are not proportional to the frequency that they represent:
- The top block represents 100% of the crashes, but its area is too big compared to the cumulative areas of the blocks in the "Causes" or "Phase" row.
- The "Human Error" block (48%) is too big within its row. The other blocks on the "Causes" row account for 51% of the cases, so the sum of their areas should be greater than the area of the "Human Error" block.
- The "Human Error" block (48%) is too big across rows. It is much larger than the "Landing" block in the bottom row, which represents 49% of the cases.
Visualizing airline crashes with a mosaic plot
At the risk of building a glass house after hurling those stones, I suggest that a mosaic plot is a standard statistical graphic that can display the same air crash data. You can download the data from McCandless's web site. The data needs a little data cleansing before you can analyze it, and I have provided my SAS program that creates the mosaic plot and other related plots. McCandless's diagram uses data through mid-2013, whereas the mosaic plot uses data through March 2015.
Many SAS procedures use ODS statistical graphics to display graphs as readily as they display tables. For example, the following call to PROC FREQ creates a 5x5 frequency table (click to see the table) and a mosaic plot:
proc freq data=Crash order=data; tables Cause*Phase / plots=(mosaic(square)); run;
In a mosaic plot, the area of each tile is proportional to the number of cases. I have ordered both variables according to frequency, whereas McCandless ordered the "Phase" variable temporally (standing, take-off, en route, landing). I did this so that the most important causes and phases appear in the first row (bottom) and first column (left). The relative widths of the columns indicate the relative frequency of the one-way frequencies for the "Phase" variable. Colors connect the categories for the "Cause" variable.
The mosaic plot is far from perfect. It is hard to visually trace the categories for the "Cause" variable. For example, finding the "mechanical" category for each flight phase requires some eyeball gymnastics. Also, although the mosaic plot makes it possible to visualize the proportions of crashes for each cause and phase, the absolute frequencies are not available on the graph by default. The frequency counts are available in the 5x5 table that PROC FREQ produces. Alternatively, with additional effort you can create a mosaic plot that includes frequency counts for the largest mosaic tiles, which I think is far superior to the default display.
What do you think? Do you prefer the mosaic plot that is produced automatically by PROC FREQ, the one with frequency counts, or do you prefer McCandless's flow diagram? Which helps you to understand the causes of airline crashes? Feel free to adapt my SAS program and create your own alternative visualization!
(April 13, 2015) Several visualization experts wrote in with alternate suggestions and comments.
Antony Unwin says that he prefers to visualize these data by using multiple bar charts. I also like multiple bar charts, which show the conditional distribution of cause given the flight phase. In SAS, you can use the SGPANEL procedure to produce a panel of bar charts (click to enlarge), as follows, although I might choose to combine the "unknown" and "standing" phases to save room:
ods graphics / reset width=600px; proc sort data=Crash; by descending CauseN PhaseN ; run; proc sgpanel data=Crash; panelby Phase /layout=rowlattice columns=1 novarname sort=data rowheaderpos=right onepanel; vbar Cause / stat=percent; rowaxis discreteorder=data grid; run;
Michael Friendly noted that although mosaic plots are useful for visualizing cross-tabulations, they also are useful for evaluating various statistical models. For example, a statistician might want to assess whether there is an association between flight phase and the cause of an airline crash. As I have written before, you can color the cells in a mosaic plot according to the residuals in a model, thereby visualizing how the observed data deviate from the model. For more about using mosaic plots in categorical models, see Friendly (1999).
Xan Gregg pointed me to an article by Stephen Few in which Few argues that an innovative design is not necessarily a better design. (Be sure to read the comments!) Few is more critical of McCandless than I am, but I agree that a different design is not necessarily a better design. If a standard statistical graphic can display the data well, then I prefer to use it rather than to design something new and more complex. My main goal is to communicate as simply as possible. However, I have no problem with artists who choose to create data-inspired art, especially if they convince the public that data are beautiful.