Tracking the increase in marijuana's THC content

After the legalization of recreational marijuana use in Colorado in 2012, it has been a much more frequent news topic than before - even from a data analysis perspective...

I was recently looking for 'interesting' data to analyze with SAS, and I noticed some articles about the increasing potency of marijuana in recent years. I followed the data 'upstream' and found an interesting report from the Drug Enforcement Administration (DEA). And on p. 27 they showed the following graph:


Their graph tells an interesting story about how the amount of THC in marijuana has more than doubled in recent years. But the graph is somewhat painful to look at, and difficult to read. Here are a few of the problems that jump out at me:

  • It's difficult to know exactly what point along the line the pointlabels refer to.
  • There are 27 %-signs in the graph, which seems an excessive use of ink & space.
  • The y-axis needlessly shows 2 decimal places.
  • The x-axis has staggered year labels.
  • The year labels are staggered in the opposite up/down from the line pointlabels.
  • The graph doesn't mention marijuana (you have to read the article to intuit that).

Well, of course it might be considered rude to point out flaws in a graph, without going to the effort to produce an improved version ... So here's my SAS version! I think it's a lot cleaner, and easier to read.


Which of my changes do you like, and which do you not like? What other changes would you recommend?


Post a Comment

Visualizing the eradication of smallpox

Smallpox was declared eradicated in 1979, after an extensive vaccination campaign in the 19th and 20th centuries. This blog post contains a visual analysis of the final years of this disease in the US ...

In my previous blog post, I imitated and improved infectious disease graphs from a recent Wall Street Journal article. I focused mainly on measles in that post - I now focus on smallpox. Here's my calendar chart of the smallpox data from the Tycho website.

Read More »

Post a Comment

6 questions with forecasting expert Charlie Chase

Charlie Chase

Charlie Chase

Charlie Chase is considered an expert in sales forecasting, market response modeling, econometrics and supply chain management. Now he's sharing some of his expertise in his Business Knowledge Series (BKS) course, Best Practices in Demand-Driven Forecasting. I had the chance to ask him some questions about his course and the state of the forecasting industry.

  1. What do you think has been the biggest advancement in forecasting over the last 10 years?

[CC]: Data collection, storage and processing capabilities along with large scale automatic forecasting technology providing the capability to automatically forecast up/down a business hierarchy for hundreds of thousands of products.

  1. If you could forecast (so to speak) how you think forecasting will evolve over the next 10 years, what do you predict will change?

[CC]: Predictive analytics will take center stage supporting demand sensing and shaping utilizing both structured and unstructured data. A new position entitled “demand analyst” will supplement demand planners with analytics and will become standard practice across all industries. Companies will create analytics centers of excellence supporting not only demand forecasting and planning, but all facets of the company’s analytical needs.  Multi-Tiered Causal Analysis (MTCA) will be common practice for those companies who have access to POS/Syndicated Scanner data to improve forecast accuracy.

  1. What’s the biggest mistake forecasters make and how can they fix it and learn from it?

[CC]: Business knowledge alone is not enough to become a good forecaster. Forecasting requires two key things, 1) analytics, and 2) domain knowledge, not “gut feeling” judgment.  Forecasters need to supplement their business knowledge with analytics by taking classes at local universities, attending business forecasting workshops (SAS BKS Workshops), attend business forecasting conferences, and get certified as a “Certified Professional Forecaster” through the Institute of Business Forecasting.

  1. What’s the best advice you can give forecasters?

[CC]: Continue to develop your skills, knowledge, and domain experience. This also includes developing your communication skills and span of knowledge across the supply chain, which includes the commercial side (sales and marketing) of the business.  The future of demand management will be the ability to support sales and marketing with analytics to supplement and enhance the demand-driven forecasting and planning process.

  1. You created a new Business Knowledge Series course, Best Practices in Demand-Driven Forecasting. Why did you create the course and who can benefit from taking it?

[CC]: I created this BKS course to share my knowledge of demand-driven forecasting best practices based on my past experiences, and provide practitioners with a framework to implement a demand-driven forecasting process. Most forecasting courses only focus on algorithms and proofs with little attention to applying analytics and domain knowledge.  This BKS course focuses on applying, interpreting, and implementing statistical methods using domain knowledge.

It's designed for demand Forecasting analysts/planners, demand forecasting and planning directors/managers, marketing analysts/planners/managers/directors, and supply chain analysts/planners/managers/directors, as well as financial planners/managers.

  1. Can you share any tips from the course?

[CC]: The course focuses on the demand-driven process, analytics, and enabling technology with emphasis on applying different statistical methods, interpreting the results, and applying the appropriate methods that will give the best results. There will be no programming with code.

Learn more and sign up for the course - Best Practices in Demand-Driven Forecasting

Post a Comment

How to make infectious diseases look better

The Wall Street Journal recently published some graphs about seven infectious diseases, and I tried using SAS to improve the graphs ... it's a veritable infectious disease (graph) bake-off!

Let's start with Measles ... here's a screen-capture of WSJ's measles graph:


In general, their graph is eye-catching, and I learned a lot (in general) about the data by looking at it. But upon studying the graph a little deeper, I noticed several problems:

  • It is difficult to distinguish zero values (very light blue) from missing-data (light gray).
  • There was not enough room for all the state values along the left edge, so they just left out about 1/2 of the state labels.
  • When you hover your mouse over the colored blocks to see the hover-text, the box turns light blue - which could mislead you to think that is the data-color of the box.
  • Although it is explained in the introductory paragraph, the graphs themselves don't mention that the unit of measurement is cases per 100k people per year.
  • They don't keep their graphical unit polygons square, but rather stretch them out to fill the entire page - this makes it difficult to know how many years the graph spans.
  • It is sometimes difficult to quickly determine which color represents more/less disease cases, using the semi ~rainbow color scale (especially in the yellow/green/blue end of the scale).

Therefore I set about creating my own SAS version of the graphic, to see if I could do a better job. I located the data on the Tycho website, downloaded the csv files, and imported them into SAS datasets. I then created a rectangular polygon for each block in the graphic, so I could plot them using Proc Gmap. I assigned custom color bins, so I could control exactly which ranges of values were mapped to which gradient shade of my color (I used shades of a single color, rather than multiple/rainbow colors), and used a hash-pattern for the missing values. Here's my measles graph:


In addition to the visual aspects of the graph, I also made a slight change to the way the data is summarized. The Tycho data was provided as weekly number of cases per 100k people, and (it appears) WSJ summed those weekly numbers to get the annual number they plotted. But the data contains a lot of 'missing' values, and the Tycho faq page specifically mentions that 'missing' values are different from a value of zero ...

"The '-' value indicates that there is no data for that particular week, disease, and location. The '0' value indicates a report of zero cases or deaths for that particular week, disease, and location."

If you have 11 weeks with 'missing' data (such as Alabama in 1932), and you simply sum the other 41 weeks that do have data, and call that the annual rate ... I'm thinking that probably under-reports the true annual value somewhat(?) Therefore, in hopes of getting a more valid value to plot, I calculate the weekly average (rather than the yearly sum).

Here are links to my plots for all 7 diseases:  Measles, Hepatitis A, Mumps, Pertussis, Polio, Rubella, and Smallpox.

So, now for the big question ... have you ever had any of these diseases?


Post a Comment

Have a traditional SAS/Graph Valentine's Day!

Nobody puts an arrow through a heart any better than Sam Cooke & Cupid ... but SAS/Graph comes close!

If you've been following my blog, you know that my favorite of all the SAS Procedures are the traditional SAS/Graph Procs, such as GPlot and GMap. They're rock-solid reliable, and flexible enough that you can create just about any graphic visualization that you can imagine.

Therefore I've created a special Valentine's example, of a traditional heart, using traditional SAS/Graph procs - hopefully SAS/Graph has not only put an arrow through these hearts, but one through yours as well!





And for a special Valentine's treat, click either heart above to go to the interactive version, and then click the red heart to see other SAS Valentine's blogs.  Also, here's the SAS code if you'd like to see how this example was created.


Post a Comment

Everything’s bigger in Texas, so … SAS Education offers HUGE training savings!

SASGlobalForum2015SAS Global Forum brings together the most die-hard SAS users, both veteran and novice, once a year. It’s one of those can’t-miss events, and each year it just gets better.

2015 will bring us all together in Dallas, Texas for several days of active learning and excitement from SAS users and subject matter experts. But the learning doesn’t end on April 29 (or start the 26th, for that matter).

SAS Education is once again bringing training to the table. For the conference event, we’re offering training in our Dallas training center, an easy trip from downtown, at a 40% discount for SAS Global Forum attendees. Choose from SAS Programming 3 before the conference and SAS Macro Language 2 after.

Post-conference training will be held at the Kay Bailey Hutchison Convention Center in the heart of downtown Dallas. Texas does big – but these savings are HUGE! Check out the courses being offered at $399 a day … and then register. These seats will fill fast.

April 30 – May 1

  • Developing Custom Tasks for SAS® Enterprise Guide®
  • DS2 Programming: Essentials
  • Introduction to SAS® and Hadoop
  • SAS® Visual Statistics: Interactive Model Building

April 30

  • SAS® Visual Analytics: Getting Started
  • Working with Process Jobs

Ready to get certified? This is a great time to really bring it home – we’re offering a Certification event on both Saturday and Sunday prior to the SAS Global Forum festivities. All exams are being offered at 1 p.m. Saturday, and most are also offered at 9 a.m. Sunday morning.

So go BIG before you go home, and take some large SAS knowledge back with you!

Post a Comment

SAS Certification hits 75k


Susan Langan, SAS Certified Professional

We’re all about numbers here at SAS. So when the Global Certification program hit its 75,000th credential – we had to make it a big deal.

We tracked down the 75,000th credential holder to Susan Langan, a research analyst in Maryland, and what’s even more special than Langan holding the 75,000th credential holder title is how she’s using SAS to make the world a better place.

Langan analyzes HIV research at the Johns Hopkins Bloomberg School of Public Health. The information she discovers could save lives or even lead to a cure for the disease.

During our phone conversation, she told me how she started working with SAS about 15 years ago. Over time, she started using different capabilities of SAS in her job, so she decided to get certified as a Base Programmer. She had no idea it would lead to such a light-bulb moment.

“During the process of studying for the exam, I started to understand all of the programming stages like I never had before,” said Langan. “I learned how the data are read into SAS, gets processed through the program data vector and finally written to the new data set. After years of concentrating on code writing, it has been refreshing to now really know the behind-the-scenes process. This knowledge helps quickly pinpoint any problems encountered, and allows for quickly fixing and not repeating that error.”

With a full-time job and a six year-old daughter, Langan is like many other professionals I’ve interviewed who must prepare for a challenging exam while juggling a busy life. She devoted several weeks to intensively studying the SAS Certification Prep Guide and taking all of the quizzes and practice exams. “I think it’s helpful to be able to go through the prep guide and practice tests,” said Langan. “Preparing with SAS allowed me to work on catching errors in programs written by others – a nice perspective, since you don’t always get the luxury of writing all the programs on a project yourself. The exam was a good mix of memorization and applying my experiences with programming in SAS.”

Langan qualified for the academic discount, which gave her 50 percent off the exam. That discount is available to most employees, educators, students and staff in the field of academia.
Read More »

Post a Comment

Geographical hotspot visualization of election data

I'm ramping up my visualization skills in preparation for the next big election, and I invite you to do the same! Let's start by plotting some county-level election data on a map...

To get you into the spirit of elections, here's a picture of my friend Sara's dad, when he was running for office. Does this say Classic (Southern) American Politician, or what?!? You can't help but like this candidate! :-)


I'm not a very 'political' person, but as a huge 'data' person I'm drawn to the visualizations of the election results. Especially the maps. Therefore when I found some interesting election maps on John Mack's page, I decided to try my hand at creating some similar maps with SAS.

This first map shows the presidential election results, by county. There's not a color legend, but one would assume the red counties were won by the Republican candidate, and the blue counties were won by the Democratic candidate - the darker the color, the larger the margin of the win. Reading the text on the page, it appears they used the log-ratio as the variable to plot. Although that is an easy way to plot the data, most people would have a difficult time relating to log-ratio numbers (which probably explains why the map has no color-legend).


Creating the same map using SAS would not be difficult - just calculate the log-ratio with 1 line in a data step, assign 6 gradient colors in pattern statements, and tell Proc Gmap to use levels=6. But I decided to take a different approach. I calculated the % votes for McCain, and the % votes for Obama, and then used if-statements to put each county into a 'bucket' based on those values. I then explained in the color legend exactly which % values were in each bucket. This way people viewing the map know exactly what each color represents. I added state outlines to the county map, which I think is very useful. I also added html hover-text for each county, if you click to see the interactive version.
Read More »

Post a Comment

Solving Sudoku with SAS/IML – Part 2

Figure 1Part 1 of this topic presented a simple Sudoku solver. By treating Sudoku as an exact cover problem, the algorithm efficiently found solutions to simple Sudoku problems using basic logic. Unfortunately, the simple solver fails when presented with more difficult Sudoku problems. The puzzle on the right was obtained from Sudoku Garden ( and is reproduced here in its original form under the creative commons attributions license. The basic solver manages to solve the majority of the puzzle before giving up. This post demonstrates how the entire puzzle can be solved through a combination of the simple solver from Part 1 and an adaptation of a backtracking algorithm called Algorithm X. The solver can also solve the world’s hardest Sudoku puzzles, but those puzzles are not reproduced here due to copyright.

Adaptation of Algorithm X

Figure 2Algorithm X is an efficient backtracking algorithm for solving exact cover problems (Knuth, 2009). The algorithm forms a search tree with columns of the exact cover matrix forming the nodes of the tree and rows of the exact cover matrix forming the branches of the tree. The algorithm navigates the search tree until a solution is found, and then terminates. The adaptation of Algorithm X used in this post is a bit convoluted, so is only described in the context of the example puzzle above.

Running the basic solver on the puzzle at the beginning of this post yields the grid on the right. Having failed to obtain a solution using the basic solver, the IML program then calls my modified version of Algorithm X. The search tree formed by the algorithm is shown below. Note that this tree is a graphical representation of the process taken by the algorithm, and is not generated by the IML code.


Figure 3
Read More »

Post a Comment

Which North Carolina state park is trending?

North Carolina is one of those lucky states that has a huge variety of scenic destinations, such as mountains, piedmont, coastal plains, beaches, and 'outer banks' islands. We have state parks in all of these areas, but can you guess which state park has been trending the most during the past 10 years?

If you guessed the one that is right here within sight of the SAS headquarters in Cary, you guessed right! Umstead park is a 5,579-acre forest (just across the road from SAS), with a few hills/rocks/streams, and lots of hiking trails. The park attendance has grown from ~500k in 2004, to over 1.25 million in 2014.

Here's a picture of a hiking trail that my friend Jennifer took. It's very similar to the trails in Umstead park, but this one is actually in the Occoneechee/Eno River park, on the other side of the RTP:


You might be wondering what data I'm using to determine which park is trending. Well that was a bit of a challenge ... Each year, North Carolina State Parks publishes the annual totals, but it is in a jpg image of a table. Here's a screen-capture of a portion of the 2014 table:


These annual tables were not stored in one central location, therefore I had to do several web searches to find all the tables from previous years. I then visually read the numbers from the jpg images, and manually entered each value as text into a file that I could import into SAS (hopefully not making any typos!) To help error-check my SAS dataset, I calculated the grand total for each year, and compared it to the annual totals at the bottom of the jpg image tables ... and yeah, I found that I had made a few typos entering the data by hand. I fixed my typos, and then was ready to plot the data!

I started with something simple - a bar chart of the current year's data. I wanted to be able to easily relate the data in this chart to my other charts, so I added a bit of color-coding. I made Umstead park red, and since there seemed to be a 'natural divide' in the data, I also shaded the other higher-attendance parks darker than the lower-attendance ones:


Now that we've seen the data for 2014, how about a similar plot for all the years? I used a stacked bar chart, and color-coded it similar to the single-year bar chart. This shows that total park attendance was increasing from 2004 to 2007, and then took a dip in 2008 (maybe because of the 'great recession'?). Attendance leveled-off from 2009 to 2013, and then increased again in 2014.


The stacked bar chart 'hints' that Umstead's attendance was generally increasing, but it's difficult to compare it to the other parks. So let's also plot the data in a line chart, which will make it easier to compare all the parks:


Once again, I make the low-attendance parks light gray, and Umstead red. But it was difficult to follow the lines for other high-attendance parks when they were all dark-gray (because there are several places where the lines intersect), therefore I used different colors for each of them. From this plot, it is easy to see (at least for the high-attendance parks) that Umstead's attendance is definitely trending upward, faster than the other parks.

How high will Umstead's attendance go? Well, we could make a forecast based on the past attendance values and the current trend ... but that doesn't take into account other things, such as limiting factors. For example the parking lots are almost at capacity these days when I visit the park, therefore maybe the park is nearing its capacity (unless they build additional parking lots and/or entrances)? I guess time (and more data) will tell!

What's your favorite state (or other) park? Do you prefer parks with a lot of people, or fewer people?

Post a Comment