Where to find the cheapest gas

"Over the river and through the woods, to grandmother's house we go" ... and how much should we expect to pay for gasoline during the trip? Read this blog to find out!

I recently wrote a blog post about some of the factors that impact the price of gasoline. The blog had some interesting information, but when it comes down to it what we really want to know is how much will we be paying for our next tank of gas!

You've probably got your favorite low-price gas station in your local city, but did you know that the prices vary widely from state to state? According to AAA, the state averages for regular grade gasoline range from $2.25 to $3.70 per gallon today! Therefore, if you're going to be driving on a long trip, you might be able to save some money by planning your route, or the timing of your fill-ups, so you're buying gas in one of the lower-priced states.

To analyze & visualize the data in SAS, I wrote a little program that reads in the HTML code of the AAA Web page, and parses out the state names and gasoline prices, and then plots the prices on a map, color coded by quintiles (each 1/5 of the states are a different color, based on their gasoline price):


The above map is great for quickly determining which states have higher and lower gasoline prices. But what about the 'impact' of those prices? I thought it would also be interesting to know how many people the high & low prices are affecting. For example, although Wyoming is a fairly large state geographically (and therefore has a lot of 'red' in the map above), the population is low, and therefore the high gasoline price in Wyoming doesn't affect many people.

In order to visually 'weight' the map by population, I plotted the data using a rectangular tree map, where the size of each rectangle represents the number of people in each state. Now it's easier to see, for example, that Texas has a low gasoline price (bright green), and also has a large population (large rectangle size).



Now that you've seen the data, what state do you plan to fill up in, while traveling over the holidays?

Post a Comment

10 most annoying airline seatmate behaviors

I recently saw an interesting infographic about the top 10 most annoying airline seatmate behaviors. I decided to try to create a more efficient SAS graph to visualize the same data, and share it with you in this blog.

Not everyone is as fortunate as my buddy David (pictured below), who has a pilot license, and does not have to worry about annoying behaviors of random seatmates.


The rest of us non-pilots have to fly on commercial airlines, and the people sitting around us have a variety of behaviors - some good, and some annoying. Expedia's infographic about annoying seatmates was 'cute,' but I had to do so much scrolling to view the 1000x4515 pixel image that by the time I viewed the last few items, I had forgotten what the first few items were!

Therefore I decided to create a no-nonsense SAS graph that allows me to see and compare all the data, in a graph that easily fits on one page without scrolling. I first input the data into a SAS dataset, and then used the following minimal code to produce the default graph. Here's the code and the output:

proc gchart data=my_data;
hbar category / type=sum sumvar=percent;


That's a decent graph, and allows me to easily see and compare the 10 items ... but it could be a lot better with a little more work. Using several built-in options, and specifying characteristics in axis statements, I was able to get a graph that was much better looking, and also easier to compare the 10 items:

pattern1 v=s c=cx00B2EE; /* blue color */
axis1 label=none value=(j=right) offset=(4,4);
axis2 label=none major=none minor=none style=0;
proc gchart data=my_data;
hbar category / type=sum sumvar=percent descending nostats
maxis=axis1 raxis=axis2
noframe space=0 width=3.7 coutline=white
autoref clipref cref=graydd;


That's a fine graph, but there is always room for improvement! I found myself looking from left-to-right quite a bit, to see the bar text and then the bar height. So I thought, why not print the text inside the bar, so I don't have to look all the way on the left side of the graph? There's no built-in option to do that, but I suppressed the default text labels by using value=none on the axis statement, and then used data-driven annotate to write the text on the bars. Depending on your preference, either of these two graphs might suit your needs.

data my_anno; set my_data;
length text $100;
xsys='2'; ysys='2'; hsys='3'; when='a';
midpoint=category; x=percent;
function='label'; position='<'; style='albany amt/bold';
axis1 label=none value=none offset=(4,4);
axis2 label=none major=none minor=none style=0;
proc gchart data=my_data anno=my_anno;
hbar category / type=sum sumvar=percent descending nostats
maxis=axis1 raxis=axis2
noframe space=0 width=3.7 coutline=white
autoref clipref cref=graydd;


Do you agree with Expedia's "Top 10" list?  What's the most annoying behavior you've personally seen on a plane?

Post a Comment

Top 10 SAS Training Post blogs

December is all about traditions. Some of mine include holiday shopping, baking (I really mean eating) Christmas cookies and putting together my annual list of most read blogs on the SAS Training Post.

So as traditions go… here’s my list of the top 10 most read blogs in 2014.

  1. How to land a job as a SAS professional by Charu Shankar
  2. How to handle percent (%) values in SAS by Robert Allison
  3. Zero to SAS Certified Base Programmer in 3 months (Part 1) by Mark Stevens
  4. Top 6 recommendations for SAS Certification preparation by Michele Reister
  5. The Bayes theorem, explained to an above-average squirrel by Michele Reister
  6. Airplanes that have disappeared without a trace by Robert Allison
  7. Free SAS Software for students by Robert Allison
  8. Top 10 reasons to get SAS certified by Terry Barham
  9. Behind the scenes: Statistical Business Analysis certification exam by Mark Stevens
  10. Zero to SAS Certified Base Programmer in 3 months (Part II) by Mark Stevens

We hope you've been able to learn something new, and maybe even laugh a little with us here at the SAS Training Post over the last year. Please let us know if you have any suggestions for 2015 topics.

And since we’re talking traditions. Do you have any holiday traditions you look forward to each year? Leave me a message in the comment section below.

Post a Comment

Plotting the Corruption Perceptions Index using SAS

I recently found some very interesting data - a numeric Corruption Perceptions Index for each country. And of course I just had to plot that data on a SAS map! Here's my map (click the image below to see the interactive map, with html hover-text over each country, and buttons/links to select the desired year):



Technical Details:

Some of you might be satisfied with just looking at the output, but I know that many of you are, like myself, data meisters ... and you feel compelled to know all the details of how I created this map (so you can create similar maps using your own data). OK - here goes...

First, let's look at the original map on the transparency.org site. They mapped all the values using a yellow-orange-red gradient, the map had hover-text for each country, and you could zoom and pan the map. I didn't really like the color gradient, because it was difficult trying to visually match a country's shade to the continuous legend. Also, all the colors in the gradient seemed to represent bad/caution (ie, the countries with a 'good' score still looked 'bad' on the map). The pan & zoom capability seemed nice at first, but in practice the features seemed to get in the way more often than they were useful. And the way the hover-text was laid out, and also having different font sizes for different pieces of text, made it difficult to quickly read. Here's a screen-capture of the original map (click it to see the full size interactive version):



To create my own version, I first had to get the data. I could click the 'Download Info Package' button and it contained a spreadsheet of the data, but that was only for 1 year - whereas I was more interested in the 3 years of data they showed in their table. So I copy-n-pasted their table into a text file. But rather than a country's values (rank, country, and 3 years' scores) all being on one line, each cell in the table came out on a separate line, such as:


Being very flexible, SAS can handle that! Rather than the usual single input statement, I simply used 5 input statements per each observation:

data my_data;
length idname $80;
infile datafile lrecl=80 pad;
input rank;
input idname $ 1-80;
input score_2014;
input score_2013;
input score_2012;

Then, to make my data a little more manageable, I wanted to have 1 variable for score, with separate values for each year (rather than a separate variable for each year's score). There are a couple of ways to accomplish this (such as proc transpose), but I took the simple brute force method using a data step:

data plot_data; set my_data;
year=2012; score=score_2012; output;
year=2013; score=score_2013; output;
year=2014; score=score_2014; output;

Now for the tough decision - how to color my map. I first tried the Proc Gmap default of quantile binning (where 1/5 of the countries were assigned to each color bin), but I didn't really like how the bins changed from year to year. So I decided to manually set which scores went into which color bin. I didn't really know what scores were considered 'good' and 'bad', but after a bit of experimentation I decided to go with 5 evenly-spaced bins, with a range of 20 in each. I used a data step to assign each country to one of these 5 bins (or buckets), and then used a user-defined format to make those bins (1-5) show up as the desired text in the legend:

proc format;
value ranges
1 = 'ge 80'
2 = '60-79'
3 = '40-59'
4 = '20-39'
5 = 'lt 20'
data plot_data; set plot_data;
format bucket ranges.;
if score ge 80 then bucket=1;
else if score ge 60 then bucket=2;
else if score ge 40 then bucket=3;
else if score ge 20 then bucket=4;
else bucket=5;

I then wrote a macro that created a plot for each year (with the _year being part of the filename), and used the note statement to print the 'year' buttons at the bottom of the page, allowing users to easily select the desired year:

note move=(50,3)  font='albany amt/bold' "Select Year:";
note move=(60,3)  link='corruption_perceptions_2012.htm' box=1 "2012";
note move=(65,3)  link='corruption_perceptions_2013.htm' box=1 "2013";
note move=(70,3)  link='corruption_perceptions_2014.htm' box=1 "2014";


Feel free to download the full SAS code, and experiment with it!

Have you visited or done business with any of the orange or red countries? Do you think the perception of these countries is fair? Or do your experiences lead you to feel otherwise?

Post a Comment

How to eliminate data error notes from the SAS log

saslogA student in a SAS class recently asked if there were a way to eliminate data error notes from the SAS log and, instead, write them to a separate file.  Of course there's a way!

Here's a simple datastep.  Notice the missing dollar sign to indicate the variable GENDER (M, F) is a character variable.

data class;
	infile 'c:\temp\class.csv' dsd;
	input name $ gender age;

We've all seen those ugly data error notes in the SAS log!

562  data class;
563     infile 'c:\temp\class.csv' dsd;
564     input name $ gender age;
565  run;
NOTE: The infile 'c:\temp\class.csv' is:
      RECFM=V,LRECL=32767,File Size (bytes)=236,
      Last Modified=03Dec2014:12:46:20,
      Create Time=03Dec2014:12:46:20
NOTE: Invalid data for gender in line 1 8-8.
RULE:     ----+----1----+----2----+----3----+----4----+----5         
1         Alfred,M,14 11
name=Alfred gender=. age=14 _ERROR_=1 _N_=1
NOTE: Invalid data for gender in line 2 7-7.
2         Alice,F,13 10
name=Alice gender=. age=13 _ERROR_=1 _N_=2
NOTE: Invalid data for gender in line 3 9-9.
3         Barbara,F,13 12
name=Barbara gender=. age=13 _ERROR_=1 _N_=3
NOTE: Invalid data for gender in line 4 7-7.
4         Carol,F,14 10
name=Carol gender=. age=14 _ERROR_=1 _N_=4
NOTE: Invalid data for gender in line 5 7-7.
5         Henry,M,14 10
name=Henry gender=. age=14 _ERROR_=1 _N_=5
NOTE: Invalid data for gender in line 6 7-7.
6         James,M,12 10
name=James gender=. age=12 _ERROR_=1 _N_=6
NOTE: Invalid data for gender in line 7 6-6.
7         Jane,F,12 9
name=Jane gender=. age=12 _ERROR_=1 _N_=7
NOTE: 19 records were read from the infile 'c:\temp\class.csv'.
      The minimum record length was 9.
      The maximum record length was 12.
NOTE: The data set WORK.CLASS has 19 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds

Suppressing the data error notes in the SAS log is easy!

options errors=0;

But the above approach only masks the bad news.  That is why the student wanted to write these notes to a separate file.  Here's how!

data class;
	infile 'c:\temp\class.csv' dsd;
	input name $ gender age;
	if _error_=1 then do;
		file 'c:\temp\MyInvalidDataNotes.txt';
		put 'NOTE: Invalid data in line ' _N_;
		put _infile_;
		put _all_;

In SAS, there's always a way!

For more datastep tricks, you may wish to attend SAS Programming 2: Data Manipulation Techniques.

Post a Comment

Zombie fever: A chronological and quantitative analysis

Do you, or someone you know, have Zombie fever - and is it spreading? How might you analyze the spread of such a disease using SAS? Here's some sample output from a simulation I helped SAS author, Mike Zdeb with, that can be used to simulate the spread of diseases ...


But "Zombie fever" isn't a real disease - it's the addiction to Zombie TV shows, which are all the rage these days! For example, The Walking Dead TV series, which started in 2010, is one of the most popular shows on TV. And a new series called Z Nation might be heading in the same direction.

The Walking Dead seems to be more of a 'realistic' representation of  a Zombie apocalypse (survival & social issues). Whereas Z Nation is a little more 'fantastical' (NSA infrastructure and cameras still working when there should probably be no electricity, glowing zombies around a nuclear reactor, etc) and frequently interjects a bit of dark humor (Liberty Bell bouncing down the street taking out zombies, getting a zombie high with marijuana smoke, etc).

I've been watching both shows, and I was wondering how the two might compare head-to-head.

So I downloaded the IMDb ratings data, and started experimenting with some graphs. It was a bit difficult to compare the two using just the IMDb data (season/episode/rating) - I could compare the ratings, but that didn't really show me what I wanted. One show has been around for several years, whereas the other has only recently started, and I wanted to somehow include that in my analysis. Therefore I looked up the date each episode aired, and added that to the data. And after much experimentation, I came up with the following plot that allows me to compare the two shows both chronologically and quantitatively.


You can click the graph image above to see the full size interactive version, with hover-text showing each episode name/number, and drill down links to find out more details (such as summaries and reviews) about each episode.

Are you a fan of these shows? Which series do you prefer (and why)?

How long do you think Zombie fever will last, and is there some data that you're basing your theory on? :)

Post a Comment

Bringing order to holiday shopping chaos

The holidays are sometimes chaotic, especially for those tasked with analyzing consumer shopping data! I would like to share a few tips on adding order to your chaotic data.

SAS recently published an interesting article, sharing the results from a consumer survey. The infographics in the article showed high-level summary information, but there was also a link to the survey results in a more raw/numeric form. And of course I could not help but graph some of those numbers...

Several of the survey questions rated the answers on the same Agree<->Disagree scale, therefore I decided to plot those all together in a single graph. Here are the results, naively plotted using the default alphabetic ordering of all the text values, and using the default colors:


I can quickly 'see' all the data together now, but the graph doesn't really help me make sense of the data. By default, SAS picks colors that look okay together, and that are easy to discern. But since the survey allows users to choose answers on a scale from Agree<->Disagree, it would be more useful to assign colors such that Agree is green and Disagree is red. I can accomplish this by assigning colors manually in pattern statements (in the same order as the alphabetic items appear in the legend):

pattern1 v=s c=cxa6d96a; /* agree */
pattern2 v=s c=cxfdae61; /* disagree */
pattern3 v=s c=cxffffbf; /* neutral */
pattern4 v=s c=cx1a9641; /* strongly agree */
pattern5 v=s c=cxd7191c; /* strongly disagree */


The colors are now meaningful, but they're not in a logical order in the legend, or stacked in a meaningful order in the bars. To fix that problem, I assigned numeric values (stack_order) in a data step, and then plotted by the numeric values instead of the text.

if response='Strongly Agree' then stack_order=1;
if response='Agree' then stack_order=2;
if response='Neutral' then stack_order=3;
if response='Disagree' then stack_order=4;
if response='Strongly Disagree' then stack_order=5;


Now the colors in the legend and bars are in a logical order, but there's still a bit of 'non-order' in the graph. The questions/statements are still in their default alphabetic order, which doesn't really benefit us. Therefore I assigned a numeric value to each question, based on how much the users agree/disagree with it (specifically, the value was based on the middle value of the 'neutral' colored segment). I then plotted the graph by these numeric values, rather than the question text:


Now the graph looks very sharp, and logically-ordered ... but the numeric values don't tell us much about the questions and the answers. So I used a little trick called user-defined-formats to make those numeric values show up as the desired text. Here's the code that creates the user-defined format for the legend:

proc sql;
create table foo as select unique stack_order as start, response as label from tran_data;
quit; run;
data control; set foo;
fmtname = 'stackfmt';
type = 'N';
end = START;
proc format lib=work cntlin=control;

Now we have a wonderful graph, where the bar segments (colors) are stacked in a logical order, and the questions themselves are even ordered in a logical way:


So, were any of these survey results a 'surprise' to you? What other questions do you think would be interesting/useful to add to the survey?

Post a Comment

SAS takes your word for it

Rotman_2Last year, 50 companies came knocking on the doors of Rotman School of Management in Canada to recruit Masters of Business Administration (MBA) grads with SAS skills. This year, Rotman partnered with SAS to offer SAS Programming to interested candidates.

Last Sunday, I had the amazing opportunity to teach SAS Programming to 60 students at the university. The country’s future leaders, some of the brightest minds, gathered together on this cold Sunday morning. Even skipping out on the Santa Claus parade to learn SAS. You’re probably wondering, “Why on Earth did you pick Sunday to teach?” It was to work with their timing, given their super busy schedules.  I experienced great commitment from all of the 60 students that showed up. They had many questions and were completely engaged.

A big question for them was SAS behavior when they tried to group sorted data. Take a look below:

1. We sorted the sales dataset BY country (in default ascending order) and within that BY Salary in descending order (using the descending keyword guarantees the order)

19162  libname orion 'c:\workshop';
NOTE: Libref ORION was successfully assigned as follows:
      Engine:        V9
      Physical Name: c:\workshop
19163  proc sort data=orion.sales
19164             out=work.sales;
19165     by Country descending Salary;
19166  run;
NOTE: There were 165 observations read from the data set ORION.SALES.
NOTE: The data set WORK.SALES has 165 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.06 seconds
      cpu time            0.01 seconds

2. Quick peek at the sorted dataset


That looks about right. Data is in order BY country (primary sort key) and within that BY descending salary.

3. Students then submitted this code to group data BY salary

19167  proc print data=work.sales noobs;
19168      by salary;
19169  run;
ERROR: Data set WORK.SALES is not sorted in ascending sequence. The current BY
       group has Salary = 108255 and the next BY group has Salary = 87975.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set WORK.SALES.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.33 seconds
      cpu time            0.01 seconds

But why did SAS complain?

Because SAS assumes that you are telling the truth. SAS looks at the BY statement in your PROC PRINT. Any time it sees a BY statement in either a procedure (PROC) or a DATA step, the assumption SAS makes is that the data is sorted by that variable.

“Haven’t I sorted the data by salary,” asked one of the students.

Yes you have, but it’s not the primary sort order. You have to respect that. See the data again. When SAS sees your PROC PRINT step with a BY salary, it expects your data to be sorted BY Salary in ascending order. But the data was not in that order, the previous PROC SORT you specified sorted the data by Country and within that BY descending salary.

That was the mismatch for SAS. On one hand it wants to believe you. That’s why it tries to group the data by salary. But when SAS looks at the data, just like you did in the data grid, it sees that the first row Salary value is 108255. And the next row salary has a value of 87975.

This is when it says, “Wait a minute.” The sales data is not in ascending sorted order BY salary. And so it sighs, throws up its hands and comes to a complete stop.

This is why it’s really critical that you know your data before you work with it. Try a PROC CONTENTS before to check the sort order. And then at the very least at least try to respect the primary sort order key. What that means is that if you sorted your data BY country, gender, and descending salary, try to repeat this order in your BY statement. Or at the very least try to respect the primary sort order. You can safely get away with a BY statement order of just country. Or country and gender.

Hope this helps you understand SAS a little bit better and the integrity with which it operates.

The students also asked for some yoga in between to wake up their minds. We moved in unison to wake up mind and body. I’ll leave you with a super brain yoga tip.

Try it anytime you feel an energy slump and want to wake up a tired brain. This is an amazing technique, and if practiced regularly can get you in the flow experienced by top performing athletes. Take my word for it!!


Post a Comment

How a software geek prepares for the holidays

I'm not really into traveling and eating with family at Thanksgiving, but what are the local restaurant alternatives? The list was a bit overwhelming, so I used SAS to help analyze my options...

My holiday meal fall-back has always been the Waffle House - they're open 24/7, 365 days a year, and I like their kind of food (what can I say - I'm a "cheap date"!) But what if I want a more traditional Thanksgiving dinner?

I did a few Web searches, and found some lists of restaurants that are open on Thanksgiving day and offering a special/traditional menu. But I didn't recognize the names of many of the restaurants, nor their street addresses, therefore I didn't know what part of town they were in (I'm downtown-averse, for example), or how far they might be from my house. I thought about plugging each address into Google to plot a map and get a driving time estimate ... but that seemed like a lot of manual work. So, to solve this challenge, I turned to SAS software ...

I entered all the addresses into a SAS dataset, ran them through Proc Geocode (to estimate their latitude/longitude), and then plotted all the restaurants as markers on a street map. I set up each marker with html hover-text so I could easily see the name of the restaurants, and I set up html href drilldowns for each marker to launch a Google search (which would give me easy access to each restaurant's Web page, and also online review sites, etc).

Here's a thumbnail of the map - click it to see the full-size interactive version:


Here are some of the cool technical details:

I’m not really plotting the markers on a traditional SAS map (Proc Gmap), but rather I’m plotting them against a background of several map images (slippy map tiles). Also, I wanted my markers to show up brighter, and the map to be more subdued, therefore I annotated an alpha-transparent white polygon on top of the map to dim it, and then plotted my markers on top of that. ODS HTML created the hotspots for each marker, and set up a html page with my hover-text and drilldown tags. And now that my code is set up, I could easily swap out the data and re-use the code to plot something else!

Any other programmer geeks out there who use software to help prepare for the holidays? Feel free to share your geekiness in a comment! :)

Post a Comment

We will remember them

Every year, on 11 November at 11 am – the eleventh hour of the eleventh day of the eleventh month – we pause to remember those men and women who have died or suffered in all wars, conflicts and peace operations. Therefore November 11 is also known as Remembrance Day, a memorial day observed in Commonwealth of Nations (formally known as the British Commonwealth) member states since the end of the First World War to remember the members of their armed forces who have died in the line of duty.

Remembrance Day has a special meaning for me as I grew up in Flanders Fields, the Belgian region where the First World War saw more than 500,000 soldiers killed. Every morning when I went to school, I passed underneath the Menin Gate Memorial to the Missing.

DSC02607 (2)

The Menin Gate Memorial to the Missing is a war memorial in Ypres, Belgium which bears the names of more than 54,000 officers and men from United Kingdom and Commonwealth Forces (except New Zealand and Newfoundland) who fell in the Ypres Salient before 16th August 1917 and who have no known grave.

Even though I was not a data scientist at that time yet, I have always wanted to know more about those engraved names. As the First World War started now 100 years ago, in 1914, I thought it was time for an investigation and found what I had been looking for on the Commonwealth War Graves Commission website.

DSC02614 (3)

Not only did I find all the names of the casualties, I also learned about their country of origin, their date of death, their age and their rank in the army. I loaded these data into SAS Visual Analytics in order to quickly gain some insights.

We will remember... their nationalities

A simple pie chart teaches us that about 75% of the deceased came from the United Kingdom, there were about 10% Canadians and 10% Australians and about 1% of the engraved names on the Menin Gate are Indian and 1% South African.


We will remember... their date of death

Secondly, I created a line chart with the date of death on the X-axis. What immediately struck me, is the peak on 31/07/1917. Some research told me that the battle of Passchendaele started on that day. It became infamous not only for the scale of casualties, but also for the mud.

Another insight this chart is giving us, is that the British were present from the beginning of the war in 1914 until the end while there seems to be a shift for the others: first the Indian troops, than the Canadian forces, followed by the Australians and finally the South Africans.

We will remember... their ranks

The fatalities on the Menin Gate are associated with 63 different ranks but five of them represent 93% of the total: Private (70%), Lance Corporal (8%), Rifleman (7%), Sergeant and Corporal (both 4%). In the bar chart below we see how the countries stack up.


We will remember... their age

Although the age of 19 was the legal limit for armed service overseas in the United Kingdom, many young boys served their country in the First World War. When we look at the distribution of the age, we clearly observe a heavily skewed distribution.

Although about half of the values of age are missing, the box plot below is a good indicator for the spread of the ages among the different ranks. The youngest victims were the riflemen, with an average age of 25. The sergeants were the “oldest” when they died at the age of 28 on average. The other ranks (private, corporal and lance corporal) were on average 26 years old when they lost their life in the battlefield.


I would like to conclude this post with an extract from “For the Fallen”, a Poem by Robert Laurence Binyon (1869-1943), published in The Times newspaper on 21st September 1914.

They shall grow not old, as we that are left grow old:
Age shall not weary them, nor the years condemn.
At the going down of the sun and in the morning
We will remember them.

DSC02608 (2)

Post a Comment