A graph for Trump - our oldest president

Recently, my fellow SAS blogger Rick Wicklin wrote a post showing how to graph the ages of all the US presidents. And Chris Hemedinger showed how to create a bar chart showing the number of presidents having each of the 12 zodiac signs. Both are interesting graphs, but I wanted to take a slightly different (more traditional) graphical look at the data - something more along the lines of a histogram. This would let me view the age distribution of all the presidents, and confirm whether or not Donald Trump is the oldest.

For me, creating a graph is an iterative process. I start simple, and then add little changes & enhancements (wash, rinse, repeat), until I have a visualization I'm happy with. Rather than just showing you the final graph, I'm going to walk you through my entire though process this time ...

I started with a simple Proc Gchart bar chart, using all the defaults. It produced a fine histogram, showing the number of presidents grouped by age, and I could see that 3 presidents were in the '70' age group. I knew Trump was 70, and I had heard he is going to be our oldest president ... but there appear to be two other presidents in the 'age 70' bar in this graph. I think I need to dig a little deeper...


I used the subgroup option, to create a separate colored bar segment for each president, to determine who the other two were. In theory, I could match the colors in the bars with the colors in the legend and determine who the other two presidents' names ... but it was really a chore to try and distinguish all those colors. To save you a bit of time, I've used a drawing package and circled the legend items that correspond to the colors in the '70' bar (Trump, Reagan, and Harrison).


44 is a few too many colors for a legend - no, actually it's a ridiculous number of colors. Therefore I used a pattern statement to repeat a single color for all the bar segments. Then I used Gchart's 'html=' option to enable html mouse-over text. With that enhancement, I could hover my mouse over each bar segment and easily see not only the name of each president, but also their age at inauguration.


With the mouse-over text, I found that the other two presidents in the '70' bar were both slightly less than 70. Gchart bars of numeric values show the midpoint of each bar in the label, by default. In order to show the range of values in each bar (min-to-max), you must specify the 'range' option. With that change, it was much more clear that the last bar didn't show the presidents who were 70 when they were inaugurated, but rather the ones who were >=68.


These automatically chosen ranges were nice and evenly-spaced, but they weren't really the ranges that I would have picked ... therefore I decided to go totally custom, and do my own binning. I used a data step and 'if' statements to check each president's age, and assign them to 7 bins, representing age ranges that start and end on even multiples of 5. I then created a user-defined-format so that those ranges (1-7) would show up as the text representation of the ranges (such as '>=70'). The resulting graph clearly showed that Trump was the only president >=70!


Now that I had the layout the way I wanted, I shifted my focus to the cosmetic details. I made the bars a prettier shade of blue, cleaned up the axes, added some reference lines, and annotated the presidents' names on each bar segment (here's the final code). The results might not be perfect, but it's what I would call a blue-ribbon graph! Click the image below to see the interactive version, with html mouse-over text and drill-downs.


Quick quiz:

Which last names are repeated multiple times in the graph?

What's the youngest and oldest ages allowed to run for president in the US?


Post a Comment

Send your SAS graphs to Excel, directly to Excel ...

Are you the lone-wolf SAS programmer in your company - managing the data, performing the analyses, and graphing the output for everyone else? And what's the only format they all know how to work with, and wish they had your output in? Let's face it ... that's probably an Excel spreadsheet!

Previously you had to jump through some hoops to send SAS graphs to an Excel spreadsheet, but now you're in luck! In SAS 9.4 (TS1M3), there's an easy new way to send your output (including graphs) to Excel. Directly to Excel! (Do not pass go, do not collect $200!)


Chris Hemedinger gave you a little preview of the experimental version of ODS Excel in SAS 9.4M1 and 9.4M2. And we now have the production version in 9.4M3! An astute user (Frank Tokic) actually brought this new feature to my attention, and provided an example of using it with SAS/Graph (thanks Frank!) I modified Frank's example to also demonstrate a few additional features, and I think it makes a great starting place for anyone interested in sending SAS output to Excel.

The example demonstrates:

  • Sending a traditional SAS/Graph Gchart to an Excel spreadsheet.
  • Sending a SAS Proc Print table to an Excel spreadsheet.
  • Starting the table at a specific row & column.
  • Including the title text above the table.
  • Using date-formatted data in the table.
  • Writing the graph and table to separate sheets.
  • And controlling the names of the sheets.

Here are screen-captures showing the graph and table in the spreadsheet:





I'm including the SAS code below, and also including links to the code and xlsx output files.

data my_data;
input item $ 1-6 Amount Purchase_date date9.;
ITEM A 11.8 14nov2016
ITEM B 10.5 01jul2016
ITEM C 8.8 22feb2015
ITEM D 6.8 01apr2012
ITEM E 3.9 03aug2016
ITEM F 2.3 02mar2016
ods listing close;
ods excel file="excel_sas_graph.xlsx" style=htmlblue;
goptions gunit=pct htitle=6 ftitle="albany amt/bold"
htext=3.5 ftext="albany amt/bold" ctext=gray33;
axis1 label=none value=(justify=right);
axis2 label=('AMOUNT') order=(0 to 12 by 2) minor=none offset=(0,0);
pattern v=solid color=dodgerblue;
ods excel options(SHEET_NAME='Graph');
title1 ls=1.5 "Simple Bar Chart";
proc gchart data=my_data;
format amount dollar5.0;
hbar item / discrete type=sum sumvar=amount nostats
maxis=axis1 raxis=axis2 noframe
autoref clipref cref=graycc;
ods excel options(SHEET_NAME='Data' EMBEDDED_TITLES='yes' START_AT='3,2');
title c=gray33 "Raw Data Values";
proc print data=my_data noobs label;
label purchase_date='Purchase date';
format amount dollar5.2;
format purchase_date date9.;
ods excel close;

Hopefully this example will get you started with the basics. ODS Excel is pretty new, and therefore I'm not an expert (yet), but hopefully the documentation will answer most of your questions. And if that doesn't satisfy all your curiosity, I'd recommend posting your question in the SAS community pages.




Post a Comment

Where did all my daylight go?

Does it seem like there just aren't enough daylight hours to get things done these days? It might not be that you've got more to do - it might be that we actually have fewer daylight hours in the winter (here in the northern hemisphere). And since "seeing is believing," let's graph the data to see what's actually going on!

First, I looked around and found a website where I could get the sunrise & sunset times for various cities, and generated a URL to get the data for Raleigh, NC. Here's what the text data looks like:


Next I wrote some SAS code to read the data (directly from the website, by the way!), and manipulate it into a form that I could easily use to create my graph - for example, rather than having a separate variable/column for each month, I wanted all the dates to be values of a single variable. What I wanted to do was sort of like transposing the data, but rather than using Proc Transpose, I used a data step with multiple output statements for each line of data.

Once I had my data ready, I was able to create a simple stacked bar chart with a separate (very narrow) bar for each day of the year, showing the hours before sunrise, the daylight hours, and the hours after sunset. It's not a pretty graph (yet), but it showed me that I was on the right track:


Now, let's make the graph pretty... I started by cleaning up the axes. Bar charts typically label each bar, but in this case 365 bar labels showing the date just won't fit. So I use an axis statement to suppress those bar value labels (using the value=none option). The left/time axis auto-scaled and picked reasonable times to show along the axis, but I thought something a little simpler might be better. Therefore I used the order= option to specify which tick marks I wanted, and I used the value= option to specify some custom text for those tick marks (for example, '6 am' rather than '6:00:00 AM'). This graph is cleaning-up nicely, eh!


By default, each bar has an outline drawn around it. Outlines are usually desirable, but with 365 tiny bars the dark outlines are a bit overpowering - therefore I get rid of them using the coutline=same option. Also, rather than using the default colors (blue, red, and green) I specify colors that will be more intuitive and meaningful (black and yellow). Now, the time periods before sunrise and after sunset will be dark, and the daylight hours will be yellow (sunshine-colored). This graph is really starting to look a lot nicer - you can clearly see that the summer months have more daylight hours than the winter months!


And for a few finishing touches, I annotated some custom month labels along the bottom of the graph, added some dotted reference lines, and placed a title at the top to describe what's in the graph. I think this is a fine representation of the data, and much easier to understand than the text table!


But when analyzing data, sometimes "the devil is in the details" ... Did you notice the footnote at the bottom of the text data file (I've circled it in red below)? This means we're not quite finished. We need to figure out how to represent the time shift of daylight savings time in the graph!


Here in the Eastern time zone of the US, we use daylight savings time (dst). This is where we all turn our clocks forward an hour in the summer months, to trick everyone into waking up an hour earlier and take advantage of the earlier sunrise. In order to represent that in the graph, I did a Web search and found the dates that dst started & stopped in 2016, and added a 1-hour offset to the sunrise and sunset times for all the days between those dates. The following graph shows what we actually experience, with the artificial dst shift.


I hope you had fun and maybe even learned a little something!

Now for a side-question ... what's your opinion on daylight savings time? Do you like it or dislike it? Did you prefer the old dst we used to have, or the new dst we've had since 2007 (where they added a few extra weeks)?

And what better way to end a blog about sunrise/sunset, than a picture of someone waving goodbye in a beautiful sunset in Naples, Florida. (This picture is from my friend Eva, and this is her man Todd waving.)



Post a Comment

You need a custom map, for US presidential election results!

In the US presidential election, each of the 50 states has a certain number of electoral votes, based on the population. Typically, most states cast all their electoral votes for the candidate who wins in their state (all or nothing). But states can split their electoral votes if they want - and that's exactly what Maine did this year ... which apparently caught some of the election map-making software by surprise!

For example, here's the election map I saw on our local news website this morning, attributed to the AP. Notice that Maine (in the top/right corner) is all blue, which visually makes you think that Hillary Clinton won the entire state. But actually, she only won 3 of the 4 electoral votes there.


Of course, I was ecstatic to see such a mapping challenge come along, so I could show off how a SAS programmer can customize a map, and represent this type of non-standard data! (the ability to customize the graphs is one of SAS' strong points)

Here's what I came up with ... I added a rectangular map area overlapping the northeast section of Maine (ME), and called it ME2. And rather than Maine in the hover-text, I set that to "Maine, district 2". I won't bore you with the coding details, but the end result is a map that's visually more correct than the official AP map. :)

Feel free to click my map snapshot below, to see the full-size version, with html mouse-over text for the states and the segments of the horizontal bar (each bar segments represents a state, with the size of the bar segment representing the number of electoral votes).


And I'll leave you with a cool picture of my friend Hannah. She's an artist, and it even shows in her selfies! :)




Post a Comment

Your chance to vote ... for your favorite election graph!

Elections in the US are a 'target rich environment' for data analysts. There are surveys and forecasts before the election, and the presentation of results during and after the voting. What's your favorite election-related graph of all time?

For the current (2016) presidential election, my favorite graphs are on the fivethirtyeight.com website. They have a nice map, and a really nice line plot (see partial screen-captures below, and click on the images to see the real thing). They put a lot of thought into these graphs, and they contain a lot of information. The best part is that they use several data sources, and update their graphs any time new data is available (often several times a day!). They also allow you to plot the data in several different ways (such as popular vote, electoral vote, chance of winning), via buttons above the graph, and along the left side of the page. (Based on their articles, I think they might be slightly biased towards Hillary, so keep that in mind when you view their graphs - aside from that little caveat, these are some great graphs.)



So, what other types of graphs might be helpful in understanding the flood of data around election time? Over the years, I've created quite a few election-related graphs, and I've now pulled them together into one location for your easy viewing. Click the screen-capture below to go to my samples page, where you can drill into each graph thumbnail to see all the details. Note that when you click the text-link below the thumbnails, there's a link to the SAS code, and I've also added links to previous blogs about the graphs. (Are you overwhelmed yet!?! Hahaha!)



Did any of my graphs catch your attention? Which one is your favorite? And which election-related graph (other than mine) is your favorite? Feel free to leave your reply in a comment, and elaborate on what you do and don't like in election graphs!


Post a Comment

The difference between the Subsetting IF and the IF—THEN—ELSE—IF statement

Subsetting IF and the IF—THEN—ELSE—IF statementMy river walk last week turned into a spectacular fall show. But if it rains this week in San Antonio, like the weatherman predicts, what will I do?

In the coming days, I’ll be presenting at two user groups,  one in eastern Canada in Halifax, and the other all the way south west in San Antonio, Texas. I’m really praying for great weather. I’ll carry an umbrella if I have to, and wear a scarf if I have to. But I don’t’ really want to, at least not in San Antonio in Texas.

Regardless of what the weather turns out to be, at least I can use the weather forecast to provide the data I need to explain IF and IF-THEN-ELSE statements to my Programming 2 students. This was their question:

Tell me the difference between the Subsetting IF and the IF—THEN—ELSE—IF statement?

Yes, at first glance they do appear similar.

However the Subsetting IF is a filter. If you are familiar with the WHERE clause, its role is similar to the WHERE, it filters or subsets rows based on a condition. It’s a bit different from the WHERE because it has access to even new variables that you are constructing in the data step.

The IF—THEN—ELSE statement is familiar to most programmers, regardless of their background. It performs an action based on a condition, but its role is not to subset data.

To really see the difference here’s some code I wrote up:

1 - First, I construct my weather dataset by pulling in data from a Google search

data weather;
infile datalines dlm=',';
input city :	$12.	day $	date: date9.	high	low	weather :$9.;
San Antonio,Sat,5-Nov-16,25,14,sun cloud
San Antonio,Sun,6-Nov-16,23,16,rain
San Antonio,Mon,7-Nov-16,24,18,rain
San Antonio,Tue,8-Nov-16,24,17,rain
NOTE: The data set WORK.WEATHER has 8 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00

2 - Next, I'll do some conditional processing. Notice my log note, how all rows are written to the output dataset:

data what;
set weather;
if  city ='Halifax' and weather='rain'
then whattotake="Sigh! I knew it, I'll take an umbrella";
else if city ='San Antonio' and weather='rain'
then whattotake="No way! ok I'll take an umbrella";
flag='SAS user group conferences 3-8 november 2016';
NOTE: There were 8 observations read from the data set WEATHER.
NOTE: The data set WORK.WHAT has 8 observations and 8 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds

3 - Finally, I’ll use a Subsetting If.

If the condition is true, only then the next action takes place, (i.e. flag is set to 'SAS etc.’), same as in the conditional processing scenario.

But the key driving difference is this. Only if the subsetting IF condition is true, (i.e. only if city is either Halifax or San Antonio) will SAS hit the RUN statement.

Implicit in the RUN statement are 2 actions, IMPLICIT output, and IMPLICIT return.

data halifaxSanAnt;
set weather;
if city in ('Halifax', 'San Antonio');
flag='SAS user group conferences 3-8 november 2016';
NOTE: There were 8 observations read from the data set WEATHER.
NOTE: The data set WORK.HALIFAXSANANT has 6 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

There you go, the big differentiator between the IF..THEN..ELSE and Subsetting IF statements.

What do you think? Will it rain or will it shine?

Hope you enjoyed this blog post. I can’t wait to share more SAS tips at the two user groups in Halifax and San Antonio. If you are in the neighbourhood, stop by. And wish me fair weather.

Post a Comment

Building a better HIV/AIDS map

Being an Eagle Scout, the data for good movement caught my attention. I wondered if I could apply my computer skills in a way that might help. How about showing people better ways to visualize HIV/AIDS data - that might help doctors better understand the data, and therefore better treat and prevent the disease! Let's start simple, and then step-by-step build a better map...

I did some Google searches and soon found a report with lots of data for my state, North Carolina. The report had data at the county level, but the values were in tables (see partial screen-capture below) - no geographical maps at all. It was very difficult to look at the tables and try to identify the geographical relationships and trends (I doubt many people even know where all 100 NC counties are located). This data was begging to be plotted on a map!


So I copy-n-pasted the data from the pdf report into a text file, and read it into a SAS dataset. I then used Proc GMap to create a map. I started with the most basic color binning for the legend, specifying the midpoints=old option so GMap would use the simple Nelder algorithm (this was the default prior to SAS 8.2). I used the default color gradient that comes with style=htmlblue. With this map, you can easily see that some parts of the state have a much higher rate of new HIV cases - it's so much easier to see this in a map, than looking at 100 values in a table!


Now, instead of the simple Nelder algorithm, let's use quantile binning to assign the colors to the counties. By specifying levels=5 in Proc GMap, you get 1/5 of the counties placed into each color bin. This makes it easy to identify the best or worst fifth (or 20%) of the counties.


But using the default light-to-dark blue color gradient didn't seem quite right for this data. Using this color gradient (to me, at least) seems to only convey the "level of badness" of the HIV rate. I wondered if it might be better to use a divergent color scheme, which might provide a way to intuitively convey that low values are good, and high values are bad. Here's what a divergent color scheme (from the colorbrewer website) looks like on the map.


The divergent color scheme was a step in the right direction ... but it didn't quite fit the data. Are the counties with the lowest 1/5 of the HIV rates all good? They're all green in the map. A rate of 0 is definitely good, but is a rate of 3.3 also good? I think making them both green might be misleading.

So I decided to go with custom legend/color binning. In my custom map, the only good value is zero (which I make green), and all non-zero values are levels of bad (shades of red). And rather than using quantile binning, or simple Nelder binning, I looked for natural breaks in the distribution of values in the data, and assigned legend bins manually using 'if' statements and a data step. I'm starting to like this map!


And now for one last finishing touch ... If you click the map and view the interactive version, you can hover your mouse over each county and see the county names. But many people don't really think in terms of counties. Quick quiz - where's Jackson county NC? (... see what I mean!) Therefore I annotated labels on some of the major cities to provide additional reference points. And now I really like this map!


I'm not saying that I've created the perfect map, but I think it's certainly a step in the right direction. I'm happy to share the SAS code, if you'd like to improve it and send me some suggestions.

And now let's go on a little visualization tangent ...

Sometimes when I create a graph with reds & greens, someone comments that colorblind people might not be able to distinguish between the colors. Well, with colors "the devil is in the details" - especially when dealing with mixtures of colors (you can see in the RGB codes in the pattern statements that these aren't pure reds and greens, but rather mixtures of RGB codes). Therefore you can't just conclude "This map looks like it has reds and greens in it, therefore someone who is colorblind can't distinguish these colors." You either have to ask a colorblind person, or simulate a colorblind person.

I uploaded my HIV/AIDS map to one of the websites that lets you simulate what it would look like to people with various forms of colorblindness, and all the colors and shades used in this map are very easily distinguishable in all of the Anomalous Trichromacy and Dichromatic colorblind simulations. It's actually pretty cool & fun, and I encourage you to try it with some of the graphs you create.

Here's a screen-capture of one of the simulations for my map:




Post a Comment

How to scare up a few good graphs for Halloween

Halloween appears to be my favorite holiday, because I seem to have more graphs related to it than any of the others. And since Halloween is just a few days away, I thought you might like an easy way to see all those graphs.

Here's are links to my previous Halloween-related blog posts, containing lots of clever graphs and maps:

Or, if you're more of a visual person, here's a pictorial interface (click each image to go to the blog posts). The same person is in two of these photos - can you guess who? :-)








Have you seen any interesting Halloween-related data lurking out there?  Perhaps you have an idea for some future graphs and analyses I can do!?!  (feel free to leave a comment with your suggestions)


Post a Comment

What comes shipped free with Base SAS 9.4?

what-comes-shipped-free-with-base-sas-9-4_01Like unexpectedly seeing this beautiful bird in nature, SAS has tons of free goodies you might be surprised to encounter as you explore your software.

A user asked me how to find products licensed at their workplace and that's how this informative blog got started. While individual organizations may have separate permutations and combinations of SAS software, I’d like to share the list of freebies that comes with a simple BASE SAS license. My colleagues in Education helped finalize this list.

Base SAS is what's shipped free with SAS software. BASE is the mother of all - the language of SAS. BASE SAS gives users the DATA step, PROCs such as CONTENTS, PRINT, REPORT, MEANS, FREQ, SORT, IMPORT/EXPORT for text files, and SQL, ODS including ODS Graphics, and the macro facility.

What are the products that you get access to with Base SAS?

Here's a short list of products you have access to with Base SAS. And did I mention these are all part of your Base SAS licence?

1  -  Languages to extract, manipulate, manage, and report on your data
i.   Data Step - for data manipulation
ii.  PROC SQL - for data management and joins
iii. SAS Macro - for text substitutions
iv.  PERL - for seeking patterns in data
v.   DS2 - for working with big data

2  -  Interfaces of SAS Enterprise Guide and SAS Studio, if on an appropriate operating environment such as Windows, are automatically provided with Base SAS. Not all functionality of these interfaces will work without additional products.

3  -  ODS graphics procedures (SGPLOT, SGSCATTER, and others that start with "SG") *are* part of Base SAS. SAS/GRAPH procedures (GPLOT, GCHART, GMAP and others that start with “G”) do not come with Base SAS.

4  -  PROC IMPORT/PROC EXPORT which allow you to read raw data files such as TXT and CSV are part of Base SAS.

5  -  PROC MEANS which prints summary statistics.

6  -  PROC FREQ which prints frequency tables.

7  -  ODS, or the Output Delivery System, which allows you to create HTML, PDF, RTF, Excel, and PowerPoint files with your SAS results.

How do you know what other products you have licensed?

You are probably familiar with typing these lines of code in your SAS session:


Read More »

Post a Comment

The perfect storm for State Fair attendance!

The State Fair in North Carolina is just a few miles from SAS headquarters, and therefore it's virtually impossible for it to slip by without me noticing it. There are two aspects of the fair that usually get lots of news coverage - what's the latest fair-food, and did we set any attendance records? There's not a lot of data about the food available (although I did create this fun graph a few years ago), but thankfully attendance numbers are published on the NC State Fair website daily!

Before we get started, here's an awesome photo my friend John took at this year's State Fair (be sure and click to see the full-size image, so you can take in all the detail!)


Now, let's analyze that attendance data ... Here's a screen-capture showing the top of the attendance table from the official webpage (click it to see the full table):


I decided to start my analysis by making a similar table, but with a few enhancements. First, I got rid of the colors, used a '.' rather than 'n/a', and sorted the table with the most recent years at the top. I think all these things make it easier to read.


Next, I used some tricky coding techniques to mark the highest attendance for each day in bright green. This involved using user-defined formats with color names, in combination with a somewhat obscure style option in Proc Print. I think this adds important information to the table, and makes it more of an analytic tool rather than just a table. Wow - looks like 2010 had several record-setting days!


Now that we know which days had the record attendance, let's find out which year had the highest total. And what better way to show that than a simple bar chart! Looks like 2010 was the record setter.


Now, how about some more detailed plots that allow us to visualize all the individual values in the table? Here's a simple plot of the data by day, with the latest (2016) markers in bright red. Note that the other markers are transparent blue, so you can see where multiple markers 'stack up' on top of each other (multiple/overlapping markers become darker, as the transparent colors combine).


I like that plot, but I'm sure the analysts and statisticians are already salivating for a box plot. I know it's not traditional to show all the markers when using a box plot, but I like to be able to see the spread of the actual data, so I like to include them.


But even the box plot didn't seem to show all the secrets that I knew were hidden in this data. Therefore I created another plot showing each year of data as a separate line - and with this graph, you can see an oddity in the data for the second Thursday (some of the values were high, and some of them were much lower).


And finally, I decided to color the lines by decade. It's not a beautiful graph (some would even disparagingly call it a spaghetti graph), but in this particular situation I think it provides some important insight that the other graphs did not!

With this graph, I was able to determine that the Thursdays with the lower attendance were from the 1980s and 1990s. And then I remembered that in more recent years there has been a big canned food drive on Thursdays where if you bring 5 cans of food to donate to the Food Bank of Central and Eastern North Carolina, you get into the fair for free. According to the fair website, since 1993, more than 4.4 million pounds of food have been donated by fairgoers. This transformed the traditionally lower-attendance Thursday into one of the higher-attendance days.



Although some portions of NC are still recovering from the flooding caused by Hurricane Matthew, we had nice weather during the fair week this year. This probably helped produce the good attendance numbers. I wonder if it would be possible to correlate fair attendance to weather data? Hmm ... maybe a topic for a future bog!

Another factor which might have helped lure in people this year was the awesome new attraction called the Flyer Sky Ride. It's a chair lift that carries passengers above the fairgrounds, from one end to the other. Here's a link to a cool video my friend David made from this ride.


I hope you had fun exploring & analyzing this data with me, and hopefully you have learned some tricks and techniques to use on your own data!


Post a Comment