With the US census coming in 2020, I've decided to sharpen my skills at graphing census data. And today I'm working on creating a population pyramid chart to analyze the age and gender distribution. Follow along if you'd like to see how to create such a chart ... or jump to the end to see the final chart!
First, you'll need some data. I'm using the 2010 data, and hopefully the 2020 data will likely be structured very similarly. I went to the Census data page, selected the datasets link, and downloaded the csv file for the "Single Year of Age and Sex Population Estimates: April 1, 2010 to July 1, 2017 - Civilian". I opened it in Excel, and then saved it as an xls spreadsheet. Here's a sample of what the original data looked like:
I imported the data using the code below, and then subset to just the columns I wanted. Next I limited the data to the desired state (in this case North Carolina), transposed the data so that each age became a separate column, used SQL to create new variables that grouped the columns into the desired age groups, and then transposed the data again to get it structured such that I could create a stacked bar chart. There are probably a dozen other ways you could have transformed the data to achieve similar results, but here's a link to my SAS code if you'd like to see the details (perhaps you can even recommend some improvements!)
proc import out=pop_data datafile="sc-est2017-agesex-civ.xls" dbms=xls replace;
I can now create a simple stacked bar chart, where each bar represents and age group, and the colored bar segments represent the gender (male or female), using the following minimal code:
proc sgplot data=both;
hbarparm category=age_group response=population /
But a pyramid plot needs a line down the middle with the bars for each of the two groups going out left & right from that center line. To accomplish this, I multiply the male population values by -1.
data left; set left;
Now let's work on the values shown along the axes. We had to make the male values in the data negative to have them plot to the left of the axis, but we don't want the values to show as 'negative' in the graph. Therefore we can set up a user-defined-format to make the negative values print as positive.
picture posval low-high='000,009';
I set up the age group category names as g00 - g17 so they will plot in the desired order, but I want those values to show as something more meaningful when they are printed along the axis. Therefore I again use a user-defined-format.
And the original data used 1 for Male and 2 for Female - but I would rather have the words show up in the legend than the numbers, therefore I also create a user-defined format for that:
value sexfmt 1="Male" 2="Female";
Here's what the graph looks like with these user-defined-formats applied:
And for a finishing touch, I specify mnemonic colors (blue and pink) for male and female, hard-code the population axis so it extends to the same number on the left and right of the zero line, and annotate Male and Female labels rather than using the color legend.
That's a pretty decent pyramid chart, and similar to most of them I've seen online ... but it still seems a bit lacking. For example, I find it difficult to locate a bar on the right-side (pink), and visually follow from that bar segment to figure out which label (age group) along the left side of the graph goes with it. Therefore I added a y2axis along the right-hand side, and also added alternating white and gray colorbands to the yaxis - now it is much easier to visually follow along from left-to-right and right-to-left. I'm pretty happy with the final result!
Here's a link to the code, if you'd like to try to re-create this graph for your own state!