Now that COVID-19 is spreading in the US, I thought it might be helpful to view the data at a more granular level. Follow along as I plot the county data on a map and discuss how the color-binning can influence people's perception of the data.
Maps like this can be helpful, as they can help you track where the virus has spread - and perhaps more importantly, where it hasn't spread.
First I needed a source for the US county-level coronavirus data. After a bit of web searching, I found that the usafacts.org had a page dedicated to this topic (see screen-capture below, with download link circled). And their terms and conditions page "encourages you to use this information for education, analysis and discussion regarding government activities" (which is always a welcome thing!)
Their download link goes to a csv file, which is a very convenient/flexible way to provide the data (thankfully it wasn't a table in a pdf file!) With SAS software, I can either download the csv and import it, or import the data directly from the URL, using the following code:
filename confdata url "https://static.usafacts.org/public/data/covid-19/covid_confirmed_usafacts.csv";
proc import datafile=confdata out=confirmed_data dbms=csv replace;
In this first visualization, I create a map with 5 color levels, and take the default color/legend binning strategy - that's quintile binning, with approximately 1/5 of the counties in each color.
This map (above) looks somewhat dramatic, with lots of dark red. But if you look closely at the legend, you might notice that the dark red is assigned to counties with 11 to 142 confirmed cases. Wow, that's quite a large range! Does it make sense for a county with 11 cases to be dark red (the same as a county with 142 cases)??? Although quantile binning is a good choice in general and makes a good default, perhaps this particular data can be better represented by a different legend binning strategy.
Nelder Legend Binning
For my next map, I have the color legend use the Nelder binning algorithm (Applied Statistics 25:94–7, 1976). Rather than placing 1/5 of the counties are in each bin, this algorithm creates 5 bins that span approximately equal ranges of values. Using this algorithm, Mecklenburg (Charlotte's county), Durham, and Wake (Raleigh's county) are the only counties darker than the lightest (yellow-ish) shade. Wow, you almost wouldn't think this is the exact same data as the previous map!
Cases per 100,000 Residents
Although the 2nd map above is probably a better representation of the data than the first map, it's still a bit biased - counties with larger populations will probably have more coronavirus cases, and show up in the darker colors. Therefore in my next map, I grabbed the county population data and calculate the number of coronavirus cases per 100,000 residents in each county. I think this is a much better number to compare.
This new map shows that Durham county has the highest number of cases per 100,000 residents, and now Wake county is in the 2nd lowest (light orange) color range. And now Cherokee county (at the far western tip of the state) is in the 2nd highest color range (they don't have a high number of cases, but it's high for a county with such a small population).
The maps above provide a nice snapshot of the March 25th data - but how has the data been changing over time? Has the number of cases leveled off, or is it still increasing? It would be nice to see a trend line of the data, eh? Here's a simple trend line graph:
Which of the above maps do you think is the most useful? How long do you think the trend line will continue to rise? I always suggest looking at the same data in several different ways - therefore rather than picking one favorite graph, I recommend looking at them all!
For these examples, I used North Carolina to demonstrate (because that's the state I live in), but you can easily modify my SAS code to create similar maps for any US state. Also, here is a link to my 'live' version of the graphs, which I will try to keep updated daily.
LEARN MORE | See all Coronavirus dashboard blog posts
Some of my colleagues at SAS have created a Novel Coronavirus Report using SAS Visual Analytics that depicts the status, locations, spread and trend analysis of the coronavirus. Data is updated nightly. The ability to visualize the COVID-19 outbreak can help raise awareness, understand its impact and can ultimately assist in prevention efforts. View the public SAS Coronavirus dashboard to see maps based in ESRI, coronavirus statistics, and an animated timeline of worldwide spread.