SAS takes your word for it

Rotman_2Last year, 50 companies came knocking on the doors of Rotman School of Management in Canada to recruit Masters of Business Administration (MBA) grads with SAS skills. Because of that demand, Rotman is now partnering with SAS to offer SAS Programming.

Last Sunday, I had the amazing opportunity to teach SAS Programming to 60 students at the university. The country’s future leaders, some of the brightest minds, gathered together on this cold Sunday morning. Even skipping out on the Santa Claus parade to learn SAS. You’re probably wondering, “Why on Earth did you pick Sunday to teach?” It was to work with their timing, given their super busy schedules.  I experienced great commitment from all of the 60 students that showed up. They had many questions and were completely engaged.

A big question for them was SAS behavior when they tried to group sorted data. Take a look below:

1. We sorted the sales dataset BY country (in default ascending order) and within that BY Salary in descending order (using the descending keyword guarantees the order)

19162  libname orion 'c:\workshop';
NOTE: Libref ORION was successfully assigned as follows:
      Engine:        V9
      Physical Name: c:\workshop
 
19163  proc sort data=orion.sales
19164             out=work.sales;
19165     by Country descending Salary;
19166  run;
 
NOTE: There were 165 observations read from the data set ORION.SALES.
NOTE: The data set WORK.SALES has 165 observations and 9 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.06 seconds
      cpu time            0.01 seconds

2. Quick peek at the sorted dataset

SASdata

That looks about right. Data is in order BY country (primary sort key) and within that BY descending salary.

3. Students then submitted this code to group data BY salary

19167  proc print data=work.sales noobs;
19168      by salary;
19169  run;
 
ERROR: Data set WORK.SALES is not sorted in ascending sequence. The current BY
       group has Salary = 108255 and the next BY group has Salary = 87975.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set WORK.SALES.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.33 seconds
      cpu time            0.01 seconds

But why did SAS complain?

Because SAS assumes that you are telling the truth. SAS looks at the BY statement in your PROC PRINT. Any time it sees a BY statement in either a procedure (PROC) or a DATA step, the assumption SAS makes is that the data is sorted by that variable.

“Haven’t I sorted the data by salary,” asked one of the students.

Yes you have, but it’s not the primary sort order. You have to respect that. See the data again. When SAS sees your PROC PRINT step with a BY salary, it expects your data to be sorted BY Salary in ascending order. But the data was not in that order, the previous PROC SORT you specified sorted the data by Country and within that BY descending salary.

That was the mismatch for SAS. On one hand it wants to believe you. That’s why it tries to group the data by salary. But when SAS looks at the data, just like you did in the data grid, it sees that the first row Salary value is 108255. And the next row salary has a value of 87975.

This is when it says, “Wait a minute.” The sales data is not in ascending sorted order BY salary. And so it sighs, throws up its hands and comes to a complete stop.

This is why it’s really critical that you know your data before you work with it. Try a PROC CONTENTS before to check the sort order. And then at the very least at least try to respect the primary sort order key. What that means is that if you sorted your data BY country, gender, and descending salary, try to repeat this order in your BY statement. Or at the very least try to respect the primary sort order. You can safely get away with a BY statement order of just country. Or country and gender.

Hope this helps you understand SAS a little bit better and the integrity with which it operates.

The students also asked for some yoga in between to wake up their minds. We moved in unison to wake up mind and body. I’ll leave you with a super brain yoga tip.

Try it anytime you feel an energy slump and want to wake up a tired brain. This is an amazing technique, and if practiced regularly can get you in the flow experienced by top performing athletes. Take my word for it!!

Rotman_1

Post a Comment

How a software geek prepares for the holidays

I'm not really into traveling and eating with family at Thanksgiving, but what are the local restaurant alternatives? The list was a bit overwhelming, so I used SAS to help analyze my options...

My holiday meal fall-back has always been the Waffle House - they're open 24/7, 365 days a year, and I like their kind of food (what can I say - I'm a "cheap date"!) But what if I want a more traditional Thanksgiving dinner?

I did a few Web searches, and found some lists of restaurants that are open on Thanksgiving day and offering a special/traditional menu. But I didn't recognize the names of many of the restaurants, nor their street addresses, therefore I didn't know what part of town they were in (I'm downtown-averse, for example), or how far they might be from my house. I thought about plugging each address into Google to plot a map and get a driving time estimate ... but that seemed like a lot of manual work. So, to solve this challenge, I turned to SAS software ...

I entered all the addresses into a SAS dataset, ran them through Proc Geocode (to estimate their latitude/longitude), and then plotted all the restaurants as markers on a street map. I set up each marker with html hover-text so I could easily see the name of the restaurants, and I set up html href drilldowns for each marker to launch a Google search (which would give me easy access to each restaurant's Web page, and also online review sites, etc).

Here's a thumbnail of the map - click it to see the full-size interactive version:

thanksgiving_meal_rtp

Here are some of the cool technical details:

I’m not really plotting the markers on a traditional SAS map (Proc Gmap), but rather I’m plotting them against a background of several map images (slippy map tiles). Also, I wanted my markers to show up brighter, and the map to be more subdued, therefore I annotated an alpha-transparent white polygon on top of the map to dim it, and then plotted my markers on top of that. ODS HTML created the hotspots for each marker, and set up a html page with my hover-text and drilldown tags. And now that my code is set up, I could easily swap out the data and re-use the code to plot something else!

Any other programmer geeks out there who use software to help prepare for the holidays? Feel free to share your geekiness in a comment! :)

Post a Comment

We will remember them

Every year, on 11 November at 11 am – the eleventh hour of the eleventh day of the eleventh month – we pause to remember those men and women who have died or suffered in all wars, conflicts and peace operations. Therefore November 11 is also known as Remembrance Day, a memorial day observed in Commonwealth of Nations (formally known as the British Commonwealth) member states since the end of the First World War to remember the members of their armed forces who have died in the line of duty.

Remembrance Day has a special meaning for me as I grew up in Flanders Fields, the Belgian region where the First World War saw more than 500,000 soldiers killed. Every morning when I went to school, I passed underneath the Menin Gate Memorial to the Missing.

DSC02607 (2)

The Menin Gate Memorial to the Missing is a war memorial in Ypres, Belgium which bears the names of more than 54,000 officers and men from United Kingdom and Commonwealth Forces (except New Zealand and Newfoundland) who fell in the Ypres Salient before 16th August 1917 and who have no known grave.

Even though I was not a data scientist at that time yet, I have always wanted to know more about those engraved names. As the First World War started now 100 years ago, in 1914, I thought it was time for an investigation and found what I had been looking for on the Commonwealth War Graves Commission website.

DSC02614 (3)

Not only did I find all the names of the casualties, I also learned about their country of origin, their date of death, their age and their rank in the army. I loaded these data into SAS Visual Analytics in order to quickly gain some insights.

We will remember... their nationalities

A simple pie chart teaches us that about 75% of the deceased came from the United Kingdom, there were about 10% Canadians and 10% Australians and about 1% of the engraved names on the Menin Gate are Indian and 1% South African.

Graph_3

We will remember... their date of death

Secondly, I created a line chart with the date of death on the X-axis. What immediately struck me, is the peak on 31/07/1917. Some research told me that the battle of Passchendaele started on that day. It became infamous not only for the scale of casualties, but also for the mud.

VisualAnalytics_4
Another insight this chart is giving us, is that the British were present from the beginning of the war in 1914 until the end while there seems to be a shift for the others: first the Indian troops, than the Canadian forces, followed by the Australians and finally the South Africans.

We will remember... their ranks

The fatalities on the Menin Gate are associated with 63 different ranks but five of them represent 93% of the total: Private (70%), Lance Corporal (8%), Rifleman (7%), Sergeant and Corporal (both 4%). In the bar chart below we see how the countries stack up.

VisualAnalytics_5

We will remember... their age

Although the age of 19 was the legal limit for armed service overseas in the United Kingdom, many young boys served their country in the First World War. When we look at the distribution of the age, we clearly observe a heavily skewed distribution.

VisualAnalytics_6
Although about half of the values of age are missing, the box plot below is a good indicator for the spread of the ages among the different ranks. The youngest victims were the riflemen, with an average age of 25. The sergeants were the “oldest” when they died at the age of 28 on average. The other ranks (private, corporal and lance corporal) were on average 26 years old when they lost their life in the battlefield.

VisualAnalytics_7

I would like to conclude this post with an extract from “For the Fallen”, a Poem by Robert Laurence Binyon (1869-1943), published in The Times newspaper on 21st September 1914.

They shall grow not old, as we that are left grow old:
Age shall not weary them, nor the years condemn.
At the going down of the sun and in the morning
We will remember them.

DSC02608 (2)

Post a Comment

Credit risk modeling: Remove the guess work

What's the probability that a firm will default on its debt? That’s the big question for many financial institutions. One way you can answer it is with credit risk modeling.

Starting today, we’re offering a new Business Knowledge Series course on that topic through our popular e-Learning format. That means you can take the course anywhere, anytime. (Like right now)

The course, Credit Risk Modeling Using SAS, will help you learn how to develop credit risk models in the context of the recent Basel guidelines.

I caught up with one of the instructors, Bart Baesens, to find out more about the course, the benefits, and how it can solve real-world problems.

 

Interested? You can start the course today.

Post a Comment

Hey! Where have you been?!?

There's recently been a "States I've Visited" application going around on Facebook, where users create a map showing all the US states they've visited, and then post it on their page for their friends to see. I wondered if SAS could do a better job?...

Here's a screen-capture of one of my friends' map, created with the m.maploco.com application. It's a pretty simple map, using two colors, and from a visual perspective I have no complaints. It's a good map.

states_ive_visited

But when I thought about creating my own map, I felt a bit limited by the application. For example, there are certain states that I've visited, but only while I was driving through that state to get somewhere else. And I also wanted to provide some details about where I went in the state, or what I did there.

Therefore I used SAS to create my own map, so I could do it the way I wanted. I made the states I had driven through a lighter red, to distinguish them from states I'd actually spent quality time in. And I added html hover-text with more information about my 'visit' to each state. Below is a snapshot of my map - click it to see the full-size map with html hover-text:

states_visited

Feel free to download my code, and modify it to create your own map (and post it to your Facebook page, or wherever you want). Who knows, you might even want to add some changes and enhancements!

 

Post a Comment

Are patent trolls finally on the decline?

Are you a legitimate hard-working company that has been threatened with a lawsuit, by a patent troll? If so, the graphs in this blog should make you happy!

Speaking of 'happy' and 'troll' - here's a picture of a happy Troll Doll from my friend Hannah. Don't you just hate how patent trolls give the word 'troll' a bad name? Hannah commented that, "Patent trolls could learn a thing or two from good luck Trolls," and I wholeheartedly agree! - hahaha!

troll_doll

In recent years, patent trolling has been very profitable.  Perhaps that is one of the reasons that the number of patent legal-case filings had been increasing (in combination with other factors, such as the joinder provisions of the American Invents Act that block a patentee from filing suit against multiple unrelated parties in a single lawsuit).

But enough about the number of patent case filings going up - we're here to talk about them recently going down! I found an interesting article by Lex Machina in which he showed a graph of the number of patent case filings per month since 2011, and the graph showed that the number of filings had gone down considerably during the past few months. I decided to create a similar SAS graph, and I include it below (click the graph to see the full size version, with html hover-text on the plot markers). Note that I suppressed the horizontal axis tick marks and values, and annotated the years along that axis instead (annotate is so handy that way!)

patent_case_filings

Although a line plot is a good way to show trends, I always like to plot data in several different ways to get a more complete mental picture. With quantities, I often like to use bar charts, because the heights of the bars give you a way to visually compare the quantities, and the 'area' of the bars gives you a more direct visual representation of the data. So here's a bar chart of the same data, grouped by year, and using the SAS default colors for the htmlblue style. Note that no annotation was required in the bar chart (the year labels are provided by the group axis).

patent_case_filings1

 So, which do you prefer in this situation - the line chart or the bar chart, and why?

Post a Comment

Why I love the Analytics conference series

Maggie Miller and I were chatting about the reasons for enjoying the Analytics 2014 conference in Las Vegas and I made a comment she thought was peculiar. I said that I liked the conference because I could talk about complicated things. She asked me to explain what I meant by that. I will do my best in this post to explain.

analytics

Here's me presenting about some "complicated things" at the Analytics conference.

I learned how to talk about complicated concepts in graduate school. I imagine many of you learned this skill in graduate school as well. I will never forget the first time I gave a seminar. WHAT A DISASTER!! I still recall my dissertation advisor saying, “Yeah that sucked.” So what happened? Why was my talk a disaster and how did I fix this? Aside from adjustment to the structure of an economics seminar (a free for all), the single biggest change I made was that I learned to talk to my graduate school colleagues about my research. I realized that I had a collection of people in my inner circle who all were smarter than me. It provided me an opportunity to put my ideas to the test. And that is what I did. At each opportunity I talked about my “identification strategy” and potential “endogeneity” issues. And, guess what? The next seminar was a success and I’ve never looked back.

So how does this relate to Analytics 2014? Selling your analytic models isn’t about the quality of your estimates. It isn’t about the complexity of your models. It is entirely determined by your ability to explain your models to other people. Sometimes it is to people who know more statistics than you. Sometimes it is to smart people who don’t know statistics. Acquiring these skills is entirely the result of practice. And in my three years of attending conferences as part of SAS, I have found no better place to talk about complicated models with sophisticated modelers and smart people who don’t have as much statistical experience.

The Analytics conference gives us the opportunity to start a conversation about a complicated econometric or statistical model with people who know more than we do. We can start our conversation without extensive background on the assumptions of the model. The time with SAS employees at the exhibitor booths allow for interaction with developers of models who know the math but perhaps not the applications. These interactions allow us all to be more comfortable talking about complicated statistical models.

So, when Maggie asked me to explain what I meant, perhaps I was reflecting on what I learned early in my graduate school career about presenting complicated material. Practice helps. While it will never be easy to present complicated material, the Analytics Conference Series events provide an additional opportunity to stand and confidently talk about your work in analytics. We hope to see you next year in Las Vegas at Analytics 2015.

Post a Comment

Dr. Mohammad Abbas, SAS UK & Ireland's top data scientist, forecasts the UK's energy consumption in 2020 using SAS

SAS UK & IreTop Data Scientistland recently ran a competition to find the region's 'top data scientist'; the competition challenge was to produce a forecast of energy demand for the UK in the year 2020 based on the data provided.  Competition for this coveted award was fierce; with the winner claiming a trip to SAS Global Forum in the USA and the chance to feature their submission on the SAS Professionals Network.

I recently caught up with Dr. Mohammad Abbas to discuss how he solved the challenge.

Phil:  Could you tell us a bit about your background?

pylonMohammad:  I hold a Master’s Degree in Organic-Geochemistry and a Ph.D. in Inorganic Chemistry.  While working in the public sector as a chemical analyst in an animal health laboratory, I developed a strong interest in how statistical applications and experimental design are used in animal health. I pursued this interest by gaining a Diploma in Statistics from the Open University and I’ve since devoted considerable time experimenting with analytics using data sets drawn from various disciplines.

Phil: Why did you choose to enter this competition?

Mohammad: Well, I saw the Top Data Scientist Competition as an opportunity to test drive my skills in Big Data Analytics.  Tackling a large analytical project in a predefined time scope was a worthy challenge. It offered me the opportunity to constantly re-evaluate my skills and identify ways to achieve a result.

Phil:  The challenge was to forecast energy consumption in 2020, how did you go about tacking the problem?

Mohammad: Having spent some time examining the 47 or so datasets and doing some background reading on energy consumption, I was in a position to develop some approaches to tackling the problem. In essence, it consisted of three key phases: exploratory data analysis, identifying the key model parameters and then selecting a model.

Phil:  An interesting approach, could you tell me a bit more about each phase?

Mohammed: Generally, exploratory data analysis is by far the most important step in any analytical process and I started by investing a significant amount of time in understanding and visualising the data. It was through this step that I was able to build data blocks and make logical connections between data objects.

UK_population_predictionNext, I needed to identify the key model parameters.  With energy data, there are a lot of variables which can be used at a later stage in the modelling process. The task at this stage was to be able to ask questions of the data  and subdivide those answers into clearly defined groups. For example, what impact do economic factors have on energy consumption?  How should factors such as gross domestic product, housing, population and disposable income be taken into account?  How was energy 'intensity' (that is energy consumption across the domestic, industrial, transport and services sectors) calculated and presented in the data sets? What was the relationship between energy consumption in primary equivalents and final energy consumption?

Phil:  What do you mean by energy consumption in primary equivalents and final energy consumption?

Mohammad:  By this I mean, the difference between the amount of energy generated and the final amount consumed.  Some energy is lost in the production and transmission of power; burning coal to generate electicity looses some of the coal's energy in the process and further power is lost when that electricty is transmitted via pylons, for example.domestic_energy_consumption

I needed to answer all of these questions and more to choose the best variables. Based upon these findings, I subdivided the key parameters into three distinct groups:

  1. Economic factors and related energy consumption variables
  2. Energy intensities by sector (domestic, industry, transport and services)
  3. Energy consumption in primary equivalents and final energy consumption.

Phil:  OK, so how did you go about selecting the best model?

Mohammad: SAS offers a wide array of modelling procedures; and choosing which model to use depends upon a clear understanding of the analytical problem and how much you know about the various statistical modelling methods available. Of course, you also need solid diagnostic skills.

To meet the challenge, it was essential to reduce the number of variables analysed to as few as were relevant; this is known in statistical parlance as 'reducing dimensionality'.  I also needed to take data quality into account and also standardisation was needed as some figures were expressed in thousands and others in millions. Also, some energy consumption data was expressed as tonnes of oil equivalents while others as Terawatt-hours so conversion of these units was needed.

Phil:  How did you go about reducing the number of variables, the 'dimensionality' as it's called?

Mohammad:  There are a number of ways to reduce dimensionality, one of which is a model that combines both dimensionality reduction techniques and regression models.  You can use methods such as 'factor analysis' and 'principal component analysis'  which can be applied individually to reduce dimensionality, or combine them with a regression model to obtain obtain a powerful unified approach known as a 'Partial Least Square Regression Model'.  Of course, SAS provides the ability to do all of this.

Phil:  So which fundamental questions were you trying to answer?

Mohammad:  I was trying to address two key questions,  Firstly, how much variation within the predictor variables (those variables which explain the values of other variables, sometimes known as independent variables) could I explain.  For example, atmospheric temperature could explain energy consumption, as it gets colder, more people put on their heating and hence use more power.  Secondly, how much variability in the target variables could be explained by my choice of predictor variables.  In other words, my target variables concerned energy consumption in 2020, so to what extent did the predictor variables I had chosen help to explain, and hence forecast, that?

Phil:  So what results came out of this process?

Mohammad:  My dimensionality reduction techniques reduced the large number of variables into a handful of factors.  Then the partial least square model generated what are known as factor loadings, weights and scores, which helped me to explain how much each factor contributed to the final forecast and how accurate those forecasts would be.  Also, examining the various models' outputs and their associated diagnostic plots helped me to shape the final prediction process.

Actual and Predicted electricity demand in the UKObviously, trying to predict a single value (energy consumption in 2020) has a large amount of uncertainty associated with it.  So, I ran the model a number of times using different inputs.  I tried broad economic factors, electricity consumption and energy intensity (consumption) for each specific economic sector and finally I used randomisation as a means of assessing my model's ability to differentiate between irrelevant (noise) variables and those with real predictive power.  This allowed me to forecast electricity consumption for the UK in 2020 with a difference of approximately 80 TW-h (terawatt hours) between the highest and the lowest predicted value.

Phil:  Amazing, so what did you find out?

Mohammad:  I predict that the overall demand for electricity in the UK in 2020 will be 527 (+/- 30 TW-h).  This represents an increase of 14.6% relative to 2013. Given the potential growth in population, housing and usage of electrical devices in the UK in the next few years, I think this is pretty accurate.

Finally, I would like to say, I am delighted to have been named as the first winner of this competition. From my experience, the most appealing about this competition was the challenge of taming a large volume of data and to be able to draw valuable insights and relate those findings to the real world we live in. This is what Big Data Analytics is all about.

bigdataskillsUK firms are struggling to find the big data skills they need, click here to read new research by SAS and Tech Partnership highlighting the extent of the problem facing British businesses.

 

 

 

Post a Comment

Graphs that make you go hmm... (early voting data)

Back in the 90s, there was a song by the C&C Music Factory about things that just didn't quite make sense - the song was called Things That Make You Go Hmm.... And in that same spirit, this blog post is about Graphs That Make You Go Hmm...

I'm not really into politics, but I look forward to elections just so I can see what they do with all the data. Some of the graphs are good, some are bad, and some just make me go "Hmm... what were they thinking???". Here's an example of such a graph that appeared in our local newspaper:

early_voters1

At first glance, the graph seemed OK, and told me that as working-age voters get older they tend to do more early voting, and then after they retire they do less early voting (I presume that's partly because retired people don't need to worry about their work schedule any more, and partly because there are fewer older people still living).

But then I started looking at the finer details of the graph, such as tick marks along the axes. Why did they choose 17 as the starting point for the age axis? You have to be 18 to vote. And the horizontal axis tick marks were spaced by 11 years ... until it reached age 61, and then they incremented by 10 years, and then went back to 11 years. Why? And why didn't they choose some increment of 5000 for the tick marks on the vertical axis? Why did they include a shadow to the right of the bars? The shadow makes the entire graph be visually biased towards the right. The hover-text for the bars shows values such as 'Count: 36800' - why not format the number with a comma like the axis values, maybe something like 'Voters: 36,800'. Hmm...

But the thing that really just made me SMH - when I re-sized my browser window, the vertical axis got re-drawn with a scale way taller than needed (making the bars very short), and the most significant digit of the highest value on the axis got chopped off (at first glance, I thought it was 30,000 instead of 100,000). I guess whatever dynamic/interactive software they're using to draw the graph is just too fancy. Hmm...

early_voters2

So, I decided to create my own SAS version of this graph, and try to avoid all the problems shown above. Here's what I came up with. Can you name all the improvements? What else would you change?

nc_early_voters_2014

Post a Comment

Sports analytics - visualize your results!

In sports these days, there's a lot more data to keep track of than just the score! How can you make sense of it all? Being the Graph Guy, of course I recommend graphing it!

Here's an example that's up close and personal for me - dragon boat racing... Below is a cool head-on picture of our team paddling a dragon boat back to the dock after a race (note that the boat is over 40 feet long, and there are 22 people in it). We are Raleigh Relentless!

dragonboat_head_on

When our team goes to a race it's an all day event, with each team racing in several heats against various combinations of the other teams. In each heat, you can easily see how how you did against the other boats in that heat, but what's more interesting and important is how your team is performing compared to all the other teams across the entire day. The times are available in tabular form, but I created the following chart that I think provides a much better way to make sense of the data:

dragonboat_richmond_races_2014

A nice visual analysis of sports data can help capture the fans' attention, help teams know which strategies work better than others, or even help with logistics and seating charts in the stadiums.  I've prepared a collection of examples to show you a few of the ways SAS can be used to help visualize sports-related analytics - hopefully you can reuse some of these examples with your own data. Click the screen-capture below to see the samples page:

sports_analytics

 So, what's the most interesting sport to you, from a data perspective? How do you analyze (or wish you could analyze) the data from this sport?

Post a Comment