Here in the US, there's a lot of talk about the flu each year. First, people discuss whether or not to get the flu shot. Then there are discussions about whether or not you or your friends have the flu (or something else). Then the discussions about what strain of flu is going around - is it the strain the shots were designed to protect against, or some other strain? It's difficult to get definitive data about cases of the flu since not every case is reported. And whereas not every illness is reported, all deaths are reported ... and therefore the number of deaths attributed to the flu is probably a somewhat comparable metric from year to year. So let's plot the data.
CDC's Flu Deaths graph
First I did a bit of searching for possible data sources, and looked to see if there might already be some graphs. I found some plots on the Centers for Disease Control (CDC) website that were close to what I was looking for. But this graph contains both influenza (flu) and pneumonia, whereas I was looking for just the flu. Also, it showed the % of deaths, whereas I was more interested in the total number of deaths. And, since my brain doesn't really think in "week numbers" (52 weeks per year), the bottom axis of the graph didn't make much sense to me. And their graph only went through the end of 2018.
My Graph (2019)
Since that graph wasn't quite what I was looking for, I decided to download the raw data and create my own graph ... I wanted to include the latest data (to see what the numbers were doing in 2019), and I wanted to make my graph easier to understand. The data is reported by year and week, therefore it's a pretty simple matter to create a plot for one year - here's how to plot the current year (2019). Note that rather than using a line plot like the original graph, I start my y-axis at zero and shade the area under the line. Also, I'm plotting the number of deaths attributed to the flu, rather than the percent of deaths attributed to the flu & pneumonia.
title1 h=14pt c=gray33 "Influenza (Flu) Deaths Per Week in the US in 2019";
proc sgplot data=my_data (where=(year=2019));
band x=week lower=0 upper=flu_deaths / fill fillattrs=(color=red);
yaxis values=(0 to 400 by 100);
xaxis values=(1 to 52 by 1) valueattrs=(size=6pt);
Well, that's the 2019 data! ... But is ~300 deaths per week better or worse than usual? Let's plot some other years of data, so we have something to compare the 2019 values to!
My Graph (10 years)
When you have your time stored in two separate variables (year & week), it's a bit tricky to plot more than one year on the same plot. One way you could do it would be to create a new variable that combines the year and week (as year + 1/52 for each week), and plot that new variable on a continuous axis. Another way would be to create a separate plot for each year (using a 'by year' statement) - then you would have a bunch of small multiples to compare. But I chose a slightly different approach ... I used Proc SGPanel to create a separate graph for each year, but 'panel' them together so that they appear to be one continuous graph, sharing a response axis.
title1 h=14pt c=gray33 "Influenza (Flu) Deaths Per Week in the US";
proc sgpanel data=my_data noautolegend;
format flu_deaths latest comma10.0;
panelby year / onepanel columns=10 novarname
headerattrs=(size=12pt color=gray33) noborder;
band x=week lower=0 upper=flu_deaths / fill fillattrs=(color=red)
tip=(flu_deaths year week);
rowaxis labelpos=top values=(0 to 1750 by 250)
label='Deaths' labelattrs=(size=11pt color=gray33)
colaxis values=(1 to 52 by 1) display=(nolabel noticks novalues)
refline 52 / axis=x lineattrs=(color=graycc thickness=1px);
refline 0 to 1750 by 250 / axis=y lineattrs=(color=graycc thickness=1px);
Now that we have ~10 years of data to compare, we can see that last year (2018) had a relatively high number of flu deaths, but this year seems to be much lower (keep your fingers crossed, because this year's flu season isn't over yet!) Speaking of this year's flu season not being over, it's a little difficult to tell exactly how far into this year the graph & data go. And was the final 2019 data point at the 'peak' of the graph, or has the number of deaths started to drop (below the peak)? Let's customize the graph to make these things a little more evident.
My Graph (customized)
First, I determined the most recent year & week in the data, and added an extra variable to the dataset such that only that particular week had a value. I then added a scatter plot marker (blue circle) at that value, and labeled it as 'latest'.
proc sort data=my_data out=my_data;
by year week;
data my_data; set my_data end=last;
if last then do;
scatter x=week y=latest / markerattrs=(color=blue symbol=circle)
And then I took those values for the most recent year & week, and stored them as macro variables, so I could easily show them in the title.
proc sql noprint;
select year into :maxyear separated by ' ' from my_data where latest^=.;
select week into :maxweek separated by ' ' from my_data where latest^=.;
title2 h=12pt c=gray99 ls=0.5 "Data source: cdc.gov (&maxyear, week &maxweek snapshot)";
But there was one other user-friendly thing I wanted to do. I wanted the user to be able to click the 'Data source:' text, and have it link to the actual data. The link= option for titles in ODS Graphics hasn't been implemented yet, so I had to annotate the 'Data source:' text, rather than using a title statement.
length label $100 anchor x1space y1space function $50 textcolor $12;
textcolor="gray33"; textsize=11; textweight='normal';
label="Data source: cdc.gov (as of &maxyear, week &maxweek)"; output;
Now we have a graph that allows the user to easily compare the current year to previous years, see multiple visual cues that tell what is the last data point on the graph, and click the 'Data source:' text in the title to download the actual data. (click the image below to see the interactive version with the drill-down to the data)
What is your flu prediction for the rest of this season - are we past the peak, or will the number of deaths per week continue to rise? What features do you like, or not like about this graph? What suggestions do you have for improving it?
(For the programmers out there, here's a link to the SAS code.)