Solving Sudoku with SAS/IML – Part 1

Sudoku solvers have been written in SAS using a variety of methods (e.g., the DATA step, PROC SQL, and PROC CLP). Surprisingly, SAS/IML appears to have been overlooked for this purpose. On a challenge from a coworker, I wrote this blog post to demonstrate the flexibility of SAS/IML in the context of solving Sudoku puzzles. This topic is split into two parts. Part 1 (this post) describes how Sudoku puzzles can be treated as exact cover problems and, in many cases, solved with simple logic. Part 2 describes how advanced Sudoku puzzles can be solved through a combination of the simple solver presented in part 1 and an efficient backtracking algorithm developed by Donald Knuth (which he labeled Algorithm X). Part 2 will be released in a separate post in the next two weeks.

Background

A Sudoku puzzle consists of a partially filled-in 9x9 grid. The grid contains 9 rows, 9 columns, and 9 3x3 boxes. Each cell must contain a single integer between 1 and 9. Each row, column, and box must contain every integer between 1 and 9.

Exact Cover

Before examining how Sudoku can be represented as an exact cover problem, let’s discuss exact cover problems generally. Consider a set containing the integers 1 through 9:  X = {1,2,3,4,5,6,7,8,9}. Now consider a collection of subsets of X, A-F.

A = {1,3,9}
B = {1,2,3}
C = {2,8,9}
D = {4,5,6}
E = {2,7,8}
F = {1,5,8}

Each subset contains some, but not all, of the numbers in X. The solution of an exact cover problem is the collection of subsets that represent every number in X exactly once. Because subsets A, D, and E represent every value in X once and only once, the subcollection {A,D,E} is said to be an exact cover of X.

A = {1,3,9}
D = {4,5,6}
E = {2,7,8}

Read More »

Post a Comment

Using SAS to help locate the 100 best restaurants in the US

I recently read an article that listed the 100 best restaurants in the US - but the article didn't have a map. I decided to use my SAS skills to change that!

When it comes to restaurants, I eat out a lot (and by 'a lot' I mean I never eat at home, lol). My personal favorites are Bojangles, Ole Time Barbecue, and D&S Cafeteria - you can probably tell I don't have the most sophisticated dining tastes, but I do like good food! Every food blog needs a photo of a great meal to get you in the mood, and here's a picture from my foodie friend Claudia - now that's some good eatin'! ...

claudia_dinner

And now, back to the Top 100 list...

When I recently saw OpenTable.com's article I thought it would be interesting to check their list and see how many of the restaurants were close to where I live, and see if I had perhaps eaten at any of them. When I tried to check their list (which was sorted alphabetically by restaurant name), it was a daunting task!  It contained pictures of all the restaurants, and their names & addresses - but there was no map, and no way to easily narrow it down to a certain geographical area (such as the area where I live, or the areas I've visited). Here's a screen-capture of how they present the info:

open_table_list

What would a SAS Graph guy do in a situation like this?...

You guessed it! -- I saved the html code from the web page into a text file, and wrote some SAS code to parse the restaurant names & addresses out of the html. I then used Proc Geocode to estimate the latitude & longitude of each restaurant, and programmatically annotated markers at those locations on a SAS/Graph gmap. You can hover over the markers to see the restaurant names, and click on the markers/states to see a test list of all the winning restaurants in that state. And in the text list, you can click the restaurant name to see the opentable.com information about that restaurant.

Here's a snapshot of my SAS map - click it to see the interactive version with html hover-text and drilldowns:

best_restaurants_2014

With my SAS map & drilldown table, I was able to quickly see that 6 North Carolina restaurants made it into the Top 100, and 5 of those 6 are very close to Cary (where I live). One is actually right beside SAS headquarters - Herons - and I have eaten there!

How many of the Top 100 have you eaten at?  If you ever come to SAS headquarters, perhaps you can eat at Herons and add it to your list :-)

Post a Comment

Most common New Year's resolutions

Can you guess the 5 most common New Year's resolutions? Compare your guesses to the survey results!...

Being a data meister, I'm always on the lookout for seasonal data. I recently did a Web search for "new year's resolutions" and found an interesting article on the You Gov UK website. Of the 30% of British adults who make New Year's resolutions, the article put the resolutions into 14 categories, and created the following bar chart:

resolutions

The chart "drew me in" but it took too much studying & concentration to make sense of it. First I wondered about the numbers - were they the number of people, or a percent value? And then I started trying to match up the bar colors with the color legend, and found that to be quite a chore (with 14 colors to try and keep track of).

So, you guessed it - I did a SAS makeover of the graph, where I keep it simple and easy to read!

new_years_resolutions

Were you able to guess the "top 5" resolutions?

What others do you think should be in the list?  And do you have any unique resolutions you'd like to share?!?

Post a Comment

Is more alcohol consumed during winter months?

The BACtrack mobile breathalyzer company recently published a report purporting that "most alcohol consumed during winter months." I wondered if the data would tell the same story, after a slightly different interrogation ...

alcohol_picture

Here's a portion of BACtrack's calendar chart of BAC levels (click it to see the full-size version). They used a tall up/down orientation (going from Nov 2013 at the top, to Dec 2014 at the bottom), and a 2-color scheme where shades of blue represented the lower BAC levels, and shades of brown represented the higher values.

calendar_tableaucalendar_legend_tableau

I think a calendar chart is great for this sort of data, and I used the same type of chart in my SAS analysis. But instead of using a tall vertical orientation, I used a horizontal calendar (with a separate row for each year of data). This allows the chart to be easily seen all at once, with no scrolling. It also allows you to more easily compare one year to another. And rather than a 2-color scheme, I went with shades of a single color (which I think is more appropriate and easier to visually grasp).

I experimented with several different ways to do the color/legend binning, and discovered something interesting - a suspect-bad value in the data. Using "choro bac / levels=5 midpoints=old" I got the following chart, which shows that Sept 28, 2014 has a much lower value than any other day. That day had a BAC value of 0.000 (zero) ... which I think is highly suspect/unlikely.

alcohol_level_bad_data

For my final chart, I decided to go with custom legend-ranges. Anything below 0.06 would go into the lowest/lightest shade, and each increase in color shade represents an increase of 0.01 in BAC.

alcohol_level

At first glance, the winter months do seem to have higher BAC levels. But upon closer examination, I think there might be other factors at work... If you look closely at the November values, you'll notice that November 2014 had much lower values than November 2013. So perhaps the more recent (non-winter) months don't have lower values because they're non-winter, but because they're more recent?

One possible explanation for the decreasing BAC values in more recent months might be that the BACtrack users are training themselves to stay below the drunk-driving limit. Another possible explanation might be that when the BACtrack app first came out, people were playing with the device and intentionally trying to see how high a BAC value they could blow (a neat "frat party trick" to impress your friends?) - and maybe now that the novelty has worn off people just use it to monitor their normal drinking.

Now it's your turn! - What other factors might be affecting this analysis? What further analyses do you recommend, and what do you think the data would say?

 

Post a Comment

Santa's information dashboard - Version 2.0

How does Santa keep track of everything on his big night? I can't confirm or deny that he uses SAS software -- but if he does, it would look a lot like this! ...

A couple of years ago, I blogged about a prototype dashboard that I wrote for Santa. This year, I've enhanced the dashboard to include some recently-added SAS animation functionality. Now there is an "Animate!" button on the map pane of the dashboard, and when you click it you'll see Santa's secret sleigh route through that continent animated. Click the image below to try out the dashboard, and see the animations:

santa_93

When you're viewing the dashboard, click a continent in the upper/left map, and you'll see that continent's route in the large (top/right) map pane. You can hover your mouse over the map areas to see the country names, and click on them to launch a Google search for Christmas traditions in that country. The naughty/nice bar chart is also updated for each continent. And don't forget to scroll down to the reindeer stats in the bottom/right, to see whether the reindeer have met their target in several performance categories.

Feel free to share this dashboard with all the little kids you know ... and even the big kids you think might enjoy it!

Post a Comment

Where to find the cheapest gas

"Over the river and through the woods, to grandmother's house we go" ... and how much should we expect to pay for gasoline during the trip? Read this blog to find out!

I recently wrote a blog post about some of the factors that impact the price of gasoline. The blog had some interesting information, but when it comes down to it what we really want to know is how much will we be paying for our next tank of gas!

You've probably got your favorite low-price gas station in your local city, but did you know that the prices vary widely from state to state? According to AAA, the state averages for regular grade gasoline range from $2.25 to $3.70 per gallon today! Therefore, if you're going to be driving on a long trip, you might be able to save some money by planning your route, or the timing of your fill-ups, so you're buying gas in one of the lower-priced states.

To analyze & visualize the data in SAS, I wrote a little program that reads in the HTML code of the AAA Web page, and parses out the state names and gasoline prices, and then plots the prices on a map, color coded by quintiles (each 1/5 of the states are a different color, based on their gasoline price):

state_gasoline_prices

The above map is great for quickly determining which states have higher and lower gasoline prices. But what about the 'impact' of those prices? I thought it would also be interesting to know how many people the high & low prices are affecting. For example, although Wyoming is a fairly large state geographically (and therefore has a lot of 'red' in the map above), the population is low, and therefore the high gasoline price in Wyoming doesn't affect many people.

In order to visually 'weight' the map by population, I plotted the data using a rectangular tree map, where the size of each rectangle represents the number of people in each state. Now it's easier to see, for example, that Texas has a low gasoline price (bright green), and also has a large population (large rectangle size).

gasoline

 

Now that you've seen the data, what state do you plan to fill up in, while traveling over the holidays?

Post a Comment

10 most annoying airline seatmate behaviors

I recently saw an interesting infographic about the top 10 most annoying airline seatmate behaviors. I decided to try to create a more efficient SAS graph to visualize the same data, and share it with you in this blog.

Not everyone is as fortunate as my buddy David (pictured below), who has a pilot license, and does not have to worry about annoying behaviors of random seatmates.

dave_pilot

The rest of us non-pilots have to fly on commercial airlines, and the people sitting around us have a variety of behaviors - some good, and some annoying. Expedia's infographic about annoying seatmates was 'cute,' but I had to do so much scrolling to view the 1000x4515 pixel image that by the time I viewed the last few items, I had forgotten what the first few items were!

Therefore I decided to create a no-nonsense SAS graph that allows me to see and compare all the data, in a graph that easily fits on one page without scrolling. I first input the data into a SAS dataset, and then used the following minimal code to produce the default graph. Here's the code and the output:

proc gchart data=my_data;
hbar category / type=sum sumvar=percent;
run;

airplane_annoyances

That's a decent graph, and allows me to easily see and compare the 10 items ... but it could be a lot better with a little more work. Using several built-in options, and specifying characteristics in axis statements, I was able to get a graph that was much better looking, and also easier to compare the 10 items:

pattern1 v=s c=cx00B2EE; /* blue color */
 
axis1 label=none value=(j=right) offset=(4,4);
axis2 label=none major=none minor=none style=0;
 
proc gchart data=my_data;
hbar category / type=sum sumvar=percent descending nostats
maxis=axis1 raxis=axis2
noframe space=0 width=3.7 coutline=white
autoref clipref cref=graydd;
run;

airplane_annoyances1

That's a fine graph, but there is always room for improvement! I found myself looking from left-to-right quite a bit, to see the bar text and then the bar height. So I thought, why not print the text inside the bar, so I don't have to look all the way on the left side of the graph? There's no built-in option to do that, but I suppressed the default text labels by using value=none on the axis statement, and then used data-driven annotate to write the text on the bars. Depending on your preference, either of these two graphs might suit your needs.

data my_anno; set my_data;
length text $100;
xsys='2'; ysys='2'; hsys='3'; when='a';
midpoint=category; x=percent;
function='label'; position='<'; style='albany amt/bold';
text=trim(left(category))||'a0a0'x;
run;
 
axis1 label=none value=none offset=(4,4);
axis2 label=none major=none minor=none style=0;
 
proc gchart data=my_data anno=my_anno;
hbar category / type=sum sumvar=percent descending nostats
maxis=axis1 raxis=axis2
noframe space=0 width=3.7 coutline=white
autoref clipref cref=graydd;
run;

airplane_annoyances2

Do you agree with Expedia's "Top 10" list?  What's the most annoying behavior you've personally seen on a plane?

Post a Comment

Top 10 SAS Training Post blogs

December is all about traditions. Some of mine include holiday shopping, baking (I really mean eating) Christmas cookies and putting together my annual list of most read blogs on the SAS Training Post.

So as traditions go… here’s my list of the top 10 most read blogs in 2014.

  1. How to land a job as a SAS professional by Charu Shankar
  2. How to handle percent (%) values in SAS by Robert Allison
  3. Zero to SAS Certified Base Programmer in 3 months (Part 1) by Mark Stevens
  4. Top 6 recommendations for SAS Certification preparation by Michele Reister
  5. The Bayes theorem, explained to an above-average squirrel by Michele Reister
  6. Airplanes that have disappeared without a trace by Robert Allison
  7. Free SAS Software for students by Robert Allison
  8. Top 10 reasons to get SAS certified by Terry Barham
  9. Behind the scenes: Statistical Business Analysis certification exam by Mark Stevens
  10. Zero to SAS Certified Base Programmer in 3 months (Part II) by Mark Stevens

We hope you've been able to learn something new, and maybe even laugh a little with us here at the SAS Training Post over the last year. Please let us know if you have any suggestions for 2015 topics.

And since we’re talking traditions. Do you have any holiday traditions you look forward to each year? Leave me a message in the comment section below.

Post a Comment

Plotting the Corruption Perceptions Index using SAS

I recently found some very interesting data - a numeric Corruption Perceptions Index for each country. And of course I just had to plot that data on a SAS map! Here's my map (click the image below to see the interactive map, with html hover-text over each country, and buttons/links to select the desired year):

corruption_perceptions_2014

 

Technical Details:

Some of you might be satisfied with just looking at the output, but I know that many of you are, like myself, data meisters ... and you feel compelled to know all the details of how I created this map (so you can create similar maps using your own data). OK - here goes...

First, let's look at the original map on the transparency.org site. They mapped all the values using a yellow-orange-red gradient, the map had hover-text for each country, and you could zoom and pan the map. I didn't really like the color gradient, because it was difficult trying to visually match a country's shade to the continuous legend. Also, all the colors in the gradient seemed to represent bad/caution (ie, the countries with a 'good' score still looked 'bad' on the map). The pan & zoom capability seemed nice at first, but in practice the features seemed to get in the way more often than they were useful. And the way the hover-text was laid out, and also having different font sizes for different pieces of text, made it difficult to quickly read. Here's a screen-capture of the original map (click it to see the full size interactive version):

corruption_original

 

To create my own version, I first had to get the data. I could click the 'Download Info Package' button and it contained a spreadsheet of the data, but that was only for 1 year - whereas I was more interested in the 3 years of data they showed in their table. So I copy-n-pasted their table into a text file. But rather than a country's values (rank, country, and 3 years' scores) all being on one line, each cell in the table came out on a separate line, such as:

1
Denmark
92
91
90

Being very flexible, SAS can handle that! Rather than the usual single input statement, I simply used 5 input statements per each observation:

data my_data;
length idname $80;
infile datafile lrecl=80 pad;
input rank;
input idname $ 1-80;
input score_2014;
input score_2013;
input score_2012;
output;
run;

Then, to make my data a little more manageable, I wanted to have 1 variable for score, with separate values for each year (rather than a separate variable for each year's score). There are a couple of ways to accomplish this (such as proc transpose), but I took the simple brute force method using a data step:

data plot_data; set my_data;
year=2012; score=score_2012; output;
year=2013; score=score_2013; output;
year=2014; score=score_2014; output;
run;

Now for the tough decision - how to color my map. I first tried the Proc Gmap default of quantile binning (where 1/5 of the countries were assigned to each color bin), but I didn't really like how the bins changed from year to year. So I decided to manually set which scores went into which color bin. I didn't really know what scores were considered 'good' and 'bad', but after a bit of experimentation I decided to go with 5 evenly-spaced bins, with a range of 20 in each. I used a data step to assign each country to one of these 5 bins (or buckets), and then used a user-defined format to make those bins (1-5) show up as the desired text in the legend:

proc format;
value ranges
1 = 'ge 80'
2 = '60-79'
3 = '40-59'
4 = '20-39'
5 = 'lt 20'
;
run;
 
data plot_data; set plot_data;
format bucket ranges.;
if score ge 80 then bucket=1;
else if score ge 60 then bucket=2;
else if score ge 40 then bucket=3;
else if score ge 20 then bucket=4;
else bucket=5;
run;

I then wrote a macro that created a plot for each year (with the _year being part of the filename), and used the note statement to print the 'year' buttons at the bottom of the page, allowing users to easily select the desired year:

note move=(50,3)  font='albany amt/bold' "Select Year:";
note move=(60,3)  link='corruption_perceptions_2012.htm' box=1 "2012";
note move=(65,3)  link='corruption_perceptions_2013.htm' box=1 "2013";
note move=(70,3)  link='corruption_perceptions_2014.htm' box=1 "2014";

 

Feel free to download the full SAS code, and experiment with it!

Have you visited or done business with any of the orange or red countries? Do you think the perception of these countries is fair? Or do your experiences lead you to feel otherwise?

Post a Comment

How to eliminate data error notes from the SAS log

saslogA student in a SAS class recently asked if there were a way to eliminate data error notes from the SAS log and, instead, write them to a separate file.  Of course there's a way!

Here's a simple datastep.  Notice the missing dollar sign to indicate the variable GENDER (M, F) is a character variable.

data class;
	infile 'c:\temp\class.csv' dsd;
	input name $ gender age;
run;

We've all seen those ugly data error notes in the SAS log!

562  data class;
563     infile 'c:\temp\class.csv' dsd;
564     input name $ gender age;
565  run;
 
NOTE: The infile 'c:\temp\class.csv' is:
      Filename=c:\temp\class.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=236,
      Last Modified=03Dec2014:12:46:20,
      Create Time=03Dec2014:12:46:20
 
NOTE: Invalid data for gender in line 1 8-8.
RULE:     ----+----1----+----2----+----3----+----4----+----5         
1         Alfred,M,14 11
name=Alfred gender=. age=14 _ERROR_=1 _N_=1
NOTE: Invalid data for gender in line 2 7-7.
2         Alice,F,13 10
name=Alice gender=. age=13 _ERROR_=1 _N_=2
NOTE: Invalid data for gender in line 3 9-9.
3         Barbara,F,13 12
name=Barbara gender=. age=13 _ERROR_=1 _N_=3
NOTE: Invalid data for gender in line 4 7-7.
4         Carol,F,14 10
name=Carol gender=. age=14 _ERROR_=1 _N_=4
NOTE: Invalid data for gender in line 5 7-7.
5         Henry,M,14 10
name=Henry gender=. age=14 _ERROR_=1 _N_=5
NOTE: Invalid data for gender in line 6 7-7.
6         James,M,12 10
name=James gender=. age=12 _ERROR_=1 _N_=6
NOTE: Invalid data for gender in line 7 6-6.
7         Jane,F,12 9
name=Jane gender=. age=12 _ERROR_=1 _N_=7
...
NOTE: 19 records were read from the infile 'c:\temp\class.csv'.
      The minimum record length was 9.
      The maximum record length was 12.
NOTE: The data set WORK.CLASS has 19 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds

Suppressing the data error notes in the SAS log is easy!

options errors=0;

But the above approach only masks the bad news.  That is why the student wanted to write these notes to a separate file.  Here's how!

data class;
	infile 'c:\temp\class.csv' dsd;
	input name $ gender age;
	if _error_=1 then do;
		file 'c:\temp\MyInvalidDataNotes.txt';
		put 'NOTE: Invalid data in line ' _N_;
		put _infile_;
		put _all_;
		put;
	end;
run;

In SAS, there's always a way!

For more datastep tricks, you may wish to attend SAS Programming 2: Data Manipulation Techniques.

Post a Comment