Let's start with Measles ... here's a screen-capture of WSJ's measles graph:
In general, their graph is eye-catching, and I learned a lot (in general) about the data by looking at it. But upon studying the graph a little deeper, I noticed several problems:
- It is difficult to distinguish zero values (very light blue) from missing-data (light gray).
- There was not enough room for all the state values along the left edge, so they just left out about 1/2 of the state labels.
- When you hover your mouse over the colored blocks to see the hover-text, the box turns light blue - which could mislead you to think that is the data-color of the box.
- Although it is explained in the introductory paragraph, the graphs themselves don't mention that the unit of measurement is cases per 100k people per year.
- They don't keep their graphical unit polygons square, but rather stretch them out to fill the entire page - this makes it difficult to know how many years the graph spans.
- It is sometimes difficult to quickly determine which color represents more/less disease cases, using the semi ~rainbow color scale (especially in the yellow/green/blue end of the scale).
Therefore I set about creating my own SAS version of the graphic, to see if I could do a better job. I located the data on the Tycho website, downloaded the csv files, and imported them into SAS datasets. I then created a rectangular polygon for each block in the graphic, so I could plot them using Proc Gmap. I assigned custom color bins, so I could control exactly which ranges of values were mapped to which gradient shade of my color (I used shades of a single color, rather than multiple/rainbow colors), and used a hash-pattern for the missing values. Here's my measles graph:
In addition to the visual aspects of the graph, I also made a slight change to the way the data is summarized. The Tycho data was provided as weekly number of cases per 100k people, and (it appears) WSJ summed those weekly numbers to get the annual number they plotted. But the data contains a lot of 'missing' values, and the Tycho faq page specifically mentions that 'missing' values are different from a value of zero ...
"The '-' value indicates that there is no data for that particular week, disease, and location. The '0' value indicates a report of zero cases or deaths for that particular week, disease, and location."
If you have 11 weeks with 'missing' data (such as Alabama in 1932), and you simply sum the other 41 weeks that do have data, and call that the annual rate ... I'm thinking that probably under-reports the true annual value somewhat(?) Therefore, in hopes of getting a more valid value to plot, I calculate the weekly average (rather than the yearly sum).
So, now for the big question ... have you ever had any of these diseases?