Solving Sudoku with SAS/IML – Part 2

Figure 1Part 1 of this topic presented a simple Sudoku solver. By treating Sudoku as an exact cover problem, the algorithm efficiently found solutions to simple Sudoku problems using basic logic. Unfortunately, the simple solver fails when presented with more difficult Sudoku problems. The puzzle on the right was obtained from Sudoku Garden ( and is reproduced here in its original form under the creative commons attributions license. The basic solver manages to solve the majority of the puzzle before giving up. This post demonstrates how the entire puzzle can be solved through a combination of the simple solver from Part 1 and an adaptation of a backtracking algorithm called Algorithm X. The solver can also solve the world’s hardest Sudoku puzzles, but those puzzles are not reproduced here due to copyright.

Adaptation of Algorithm X

Figure 2Algorithm X is an efficient backtracking algorithm for solving exact cover problems (Knuth, 2009). The algorithm forms a search tree with columns of the exact cover matrix forming the nodes of the tree and rows of the exact cover matrix forming the branches of the tree. The algorithm navigates the search tree until a solution is found, and then terminates. The adaptation of Algorithm X used in this post is a bit convoluted, so is only described in the context of the example puzzle above.

Running the basic solver on the puzzle at the beginning of this post yields the grid on the right. Having failed to obtain a solution using the basic solver, the IML program then calls my modified version of Algorithm X. The search tree formed by the algorithm is shown below. Note that this tree is a graphical representation of the process taken by the algorithm, and is not generated by the IML code.


Figure 3
Read More »

Post a Comment

Which North Carolina state park is trending?

North Carolina is one of those lucky states that has a huge variety of scenic destinations, such as mountains, piedmont, coastal plains, beaches, and 'outer banks' islands. We have state parks in all of these areas, but can you guess which state park has been trending the most during the past 10 years?

If you guessed the one that is right here within sight of the SAS headquarters in Cary, you guessed right! Umstead park is a 5,579-acre forest (just across the road from SAS), with a few hills/rocks/streams, and lots of hiking trails. The park attendance has grown from ~500k in 2004, to over 1.25 million in 2014.

Here's a picture of a hiking trail that my friend Jennifer took. It's very similar to the trails in Umstead park, but this one is actually in the Occoneechee/Eno River park, on the other side of the RTP:


You might be wondering what data I'm using to determine which park is trending. Well that was a bit of a challenge ... Each year, North Carolina State Parks publishes the annual totals, but it is in a jpg image of a table. Here's a screen-capture of a portion of the 2014 table:


These annual tables were not stored in one central location, therefore I had to do several web searches to find all the tables from previous years. I then visually read the numbers from the jpg images, and manually entered each value as text into a file that I could import into SAS (hopefully not making any typos!) To help error-check my SAS dataset, I calculated the grand total for each year, and compared it to the annual totals at the bottom of the jpg image tables ... and yeah, I found that I had made a few typos entering the data by hand. I fixed my typos, and then was ready to plot the data!

I started with something simple - a bar chart of the current year's data. I wanted to be able to easily relate the data in this chart to my other charts, so I added a bit of color-coding. I made Umstead park red, and since there seemed to be a 'natural divide' in the data, I also shaded the other higher-attendance parks darker than the lower-attendance ones:


Now that we've seen the data for 2014, how about a similar plot for all the years? I used a stacked bar chart, and color-coded it similar to the single-year bar chart. This shows that total park attendance was increasing from 2004 to 2007, and then took a dip in 2008 (maybe because of the 'great recession'?). Attendance leveled-off from 2009 to 2013, and then increased again in 2014.


The stacked bar chart 'hints' that Umstead's attendance was generally increasing, but it's difficult to compare it to the other parks. So let's also plot the data in a line chart, which will make it easier to compare all the parks:


Once again, I make the low-attendance parks light gray, and Umstead red. But it was difficult to follow the lines for other high-attendance parks when they were all dark-gray (because there are several places where the lines intersect), therefore I used different colors for each of them. From this plot, it is easy to see (at least for the high-attendance parks) that Umstead's attendance is definitely trending upward, faster than the other parks.

How high will Umstead's attendance go? Well, we could make a forecast based on the past attendance values and the current trend ... but that doesn't take into account other things, such as limiting factors. For example the parking lots are almost at capacity these days when I visit the park, therefore maybe the park is nearing its capacity (unless they build additional parking lots and/or entrances)? I guess time (and more data) will tell!

What's your favorite state (or other) park? Do you prefer parks with a lot of people, or fewer people?

Post a Comment

Is Google Fiber coming to your city?

Google recently announced that they will be adding Google Fiber high speed network and TV to my area. This was great news, because it will give us more choices ... and a little competition among providers tends to make them all 'try harder' to please the customer. :-)

I was curious what other areas have Google Fiber, and did a few web searches and came up with a couple of maps. But neither of them allowed me to quickly/intuitively 'see' what I was wanting to see.

Here's the map from the Washington Post:


And here's the map from Google itself:


All I can really tell by glancing at those maps is the city locations. I have to study them intently, and read the color legend, and match up the (non-intuitive) markers to the color legend, to determine which are current/planned/potential Google Fiber cities.

So, of course, I set out to try to create a better map with SAS ... In my map, I wanted to make it easy to identify the cities that currently have Google Fiber, therefore I made their marker the brightest, and also added a check-mark in it. I made the 'planned' cities slightly lighter, with no check-mark. And the potential cities are just a light gray. Also, if you click the snapshot below and view the interactive version of my map, it has html hover-text and you can click on the markers to launch a Google search for more information about Google Fiber in that city.



What do you think of my new version of the map? What other changes & enhancements would you recommend?

Post a Comment

Eating liver and prepping data: How are they similar?

This week, I finally ate some liver, for the first time in over 20 years - and I realized it's a lot like prepping data (which I'll explain in this blog post). Here are a few of the similarities:

  • They're both good for you.
  • Thinking about them makes you go Eiwww!!!!
  • You might dread doing them, but find "it's not so bad" once you start.

And to give you a mental image, here's a picture of my friend Becky's daughter with the "Eiwww Face" most kids make when thinking about eating liver (or prepping data) ...


And now, back to this liver that I ate ... I knew it was "good for me" but being a data person I wanted to quantify "how good". So I did a few random Web searches, and found a page with this table of data comparing liver to several other foods.
Read More »

Post a Comment

Thinking about retiring in another country?

Have you ever thought about retiring in another country, where your money might go further? Well here's some quantitative data to help you make an informed decision! ...

First, to get you in the mood, here's a picture of my friend Erik checking out the prices at a pedal-powered food cart in Thailand. Erik and his wife Joy have done more world traveling than any of my other friends, so they probably have good insight into what it might be like to retire in another country.


I recently ran across some interesting information on They had combined data from several different sources to come up with several indices that can be used to compare the prices of various things in different countries: Consumer Price Index, Rent Index, Groceries Index, Restaurant Index, and Local Purchasing Power Index. They let you select the data, and plot a map such as the following:


I'm glad they mapped the data (it's much easier to analyze than just a table), but I guess you could say I'm a little picky about my maps. I'm not a big fan of continuous color gradients (it's just too difficult to look at a continuous shade, and determine what value it represents, compare it to other countries, etc). I'm also not a fan of the projection they used (much of the available space is consumed by Greenland and Northern Canada ... which aren't really important in this analysis). So, of course, I decided to try creating my own maps using SAS.

I decided to go with 5 colors, and assign an equal number of countries to each - this way each color represents 1/5 of the countries (quantile binning). I also used a projection that de-emphasizes the extreme northern areas, and allows the other (more populous) areas to make use of more space. Here's the Rent Price Index map, for example (click the image to see all 5 maps):


Technical Details:

I copied the data from the page and pasted it into an Excel spreadsheet, and then used Proc Import to get the data into a SAS dataset. I used Proc Gmap to draw the map, and the levels=5 option to perform the quantile binning. You can see the complete SAS code here.

So, after reviewing the data, what country would you like to retire to? What are some other factors to consider, in addition to these indices?

Post a Comment

What's the solar power potential in your area?

Have you ever wondered whether the area where you live is a good location for producing solar power?  Let's create a SAS map to help find out!

To get you in the right frame of mind, here is an awesome picture of some Arizona sunshine, that my good friend Eva took on one of her recent trips:


There's been a lot of buzz lately about solar power - especially now that the price of solar panels has come down. SAS has a solar farm here at the Cary headquarters, and I've seen several other solar farms popping up around our state lately.

But I got to wondering - are certain parts of the country better than others for producing solar power? Intuitively, it seems like certain areas that don't have much clouds & rain would be better than areas that are generally cloudy & overcast. But how can I quantify that?

After a few Web searches, I found some data at NASA's Atmospheric Science Data Center. They let you enter a latitude/longitude, and provide an html table which contains the "Daily solar radiation - horizontal, kWh/m2/d". So I wrote some SAS code that looped through a grid of all the latitudes/longitudes I wanted to plot on a map, and then parsed the desired data out of each of those html pages and appended them to a SAS data step (the code is pretty neat, if you want to have a look at it!)

I then used Proc Ginside to determine which points in my lat/long grid were 'inside' the US, and then used annotate to plot color-coded dots on the map to represent the solar data. I think the map came out pretty cool:



While I was grabbing the solar data, it was also easy to grab the wind data - so I went ahead and created a wind map also. This map might indicate which areas of the country have more wind, and might be better for windmills and wind turbines:



And now for something fun - here's a video clip of me on one of my adventures, in a very windy location (hopefully you can view a .wmv file). Can you guess this windy location?!?


Post a Comment

Panning for corporate gold requires gold standard skills

We now live in the era of ‘big data’, where data and its analysis have become crucial to the modern economy.   In fact, "big data is the new 'corporate gold'," according to Mark Wilkinson, managing director of SAS UK & Ireland.

A recent study by Cebr found that companies in the UK are increasingly assigning a financial value to their data; with nearly three-quarters of business leaders seeing real benefits from using analytics to increase revenue, reduce costs and make decisions faster.

BigDataOpportunitiesBut ‘panning’ for that corporate gold requires gold standard skills.  Research published by SAS UK & Ireland and the Tech Partnership revealed that by 2020, there will be 56,000 job opportunities a year for big data analysts.  However, serious skills shortages are emerging with recruitment companies reporting that 77 percent of positions were either “very” or “fairly” difficult to fill.  Tech Partnership Director, Karen Price, said, "Investment in education and training opportunities is vital to securing a strong talent pipeline for the digital economy."

To address this, we’ve launched a new “SAS Data Scientist Curriculum” to give both students and experienced data scientists the knowledge and skills to better prepare, analyse and extract value from big data.

What’s more, we’ve also been awarded a Gold Accreditation from the Learning and Performance Institute (LPI) for our education programme. With more than 95 percent of customers rating our instructors, the courses and course administration as “excellent” or “good” and willing to  recommend these courses to others.

Solid gold proof that SAS Education can help you to take full advantage of the opportunities offered in the age of big data.



Post a Comment

Millennials will outnumber Baby Boomers in 2015

To get into the mood for this blog post, you should first listen to the music video of The Who singing My Generation...

I guess everybody has 'their generation' and here in the U.S. the most famous generation has been the Baby Boomers. Many companies have tried to design products they think the Baby Boomers would like (such as the 1964 Ford Mustang), to capitalize on the similar interests and buying power of the boomers. But this year, for the first time (according to this article from Pew Research) another generation will become the most populous in the U.S. - the Millennials!

The Pew article had a nice graph that shows which years people from each generation were born in, and how many people were born each year (note that there's not 100% agreement on when each generation starts & stops, but we'll go with these numbers for now).  Here is the graph from their article:

Read More »

Post a Comment

Jedi SAS Tricks: DS2 & APIs - GET the data you are looking for

While perusing the SAS 9.4 DS2 documentation, I ran across the section on the HTTP package. This intrigued me because, as DS2 has no text file handling statements I assumed all hope of leveraging Internet-based APIs was lost. But even a Jedi is wrong now and then! And what better API to test my API-wielding skills than the Star Wars API (SWAPI)? Read More »

Post a Comment

Learn how to maximize your data with SAS and Hadoop

California or bust

California or bust

Outside, the Cary, NC sky is gray and winds are blowing freezing rain, but a group of statisticians at SAS are channeling warm green hills and the soft, gold light of a California evening. Team conversations alternate between distributed processing, PROC IMSTAT and how many pairs of shorts to pack.

For the past several months, the Advanced Analytics training team here in Cary have been hard at work developing a course especially for the Strata+ Hadoop World conference entitled Machine Learning and Exploratory Modeling with SAS® and Hadoop. I’m very excited about this unique course. It blends many topics, and focuses exclusively on enhancing and refining students’ analytic skills in Hadoop.

The course will be held in San Jose, CA Feb 17-18 and was created for analytic professionals who want to make the most of their big data with Hadoop and SAS by incorporating high-performance, machine learning algorithms with predictive modeling best practices.

On the first day, we’ll primarily spend time using SAS Visual Analytics and Visual Statistics to perform analyses using the point-and-click interface. Because there will always be a need to do more than you see in the GUI, the second day is devoted to using PROC IMSTAT and High-Performance procedures for predictive modeling and text analytics, and the RECOMMEND procedure to build a recommendation system.

Featuring this course at the Strata conference is the perfect fit and a great value for analytic professionals. Your course registration fee includes a 2-day Expo Hall pass. This gives you the opportunity to network with data science professionals from around the world, who are experienced in many different technologies.  Good news! We are offering a special 30 percent discount to SAS customers. To take advantage of the discount, register using the promo code SASML. I’d love to see you there.

Post a Comment