Missing unicorns - 10 tips on finding data scientists (Part 1)

missing unicornAs this article on the mythical data scientist describes, many people call this special kind of analytical talent "unicorns," because the breed can be so hard to find. In order to close the analytical talent gap that McKinsey Global Institute and others have predicted, and many of you experience today, SAS launched SAS Analytics U in March of this year to feed the pipeline for analytical talent. This higher education initiative aims to help address the skills gap by offering free versions of SAS software, university partnerships, and more. Yes, I did say free, and the free SAS® University Edition even runs on Mac, in addition to PC and Linux! Meanwhile, since data scientists can be hard to find, I’ll share with you ten tips to use in your hunt, illustrated with examples from some of our own legendary unicorns at SAS.

Five of the tips relate to academic recruiting:

1.  Hire from an MS in Analytics program
2.  Hire from a great program you’ve never heard of
3.  Recruit from untraditional disciplines
4.  Look beyond STEM - recruit from social sciences
5.  Try before you buy – create an intern program

 

Five more relate to other best practices:

6.  Invest in sponsorship for foreign nationals
7.  Use social networks to hire friends of unicorns
8.  Hire the curious who want to solve problems
9.  Think about what kind of data scientist you need
10.  Don’t expect unicorns to grow horns overnight

 

Each tip is worth expansion, so I'll share two in this post and more in subsequent posts.

Patrick in grad school cropped

Patrick Hall, MS in Analytics, NC State University

1. Hire from an MS in Analytics program

SAS proudly helped launch and continues to support the Institute for Advanced Analytics at NC State University, led by Dr. Michael Rappa, which is the granddaddy of them all, for good reason. Over 90% of their graduates get offers by graduation, because in their intensive 10-month program they receive not only an outstanding academic foundation but targeted attention to those “softer” skills like public speaking, team work, business problem identification and formulation, etc. that are so essential to the practice of analytics. Patrick Hall, pictured here while getting his MS in Analytics from this program, is one of our machine learning experts on the SAS® Enterprise Miner™ R&D team and even a certified data scientist, being one of the few to pass the rigorous Cloudera Certified Professional: Data Scientist (CCP:DS) exam. SAS works with scores of the exploding number of these programs and they can be great places to recruit graduates with training in analytics and experience using SAS software.

 

2011_12_OSU_students_placed_second_in_A2011_shootout1 cropped

Dr. C (far left), Murali Pagolu (fifth from left) and Satish Garla (sixth from left), both MS in Management of Information Systems/Analytics, Oklahoma State University

2. Hire from a great program you’ve never heard of

In addition to the many well-known programs, there are some great ones that you might not have heard of, like one at Oklahoma State University (OSU) run by Dr. Goutam Chakraborty (just call him Dr. C.), who has graduated 700+ unicorns in the last decade. Designed to recognize students with advanced knowledge of SAS, these joint certificate programs supported by the SAS Global Academic Program require students to complete a minimum of credit hours in relevant courses. Murali Pagolu and Satish Garla both received an MS in Management Information Systems/Analytics from this program and are pictured here when they were winners in the 2011 SAS Analytics Shootout, held annually at our Analytics Conference. Murali and Satish work in our Professional Services Division, helping customers implement their software by working with them to get their analytical models in place. They are just two of the many OSU graduates who have won countless awards. A large Midwestern manufacturing executive recently told me that he had to persuade his Human Resources Department to send a recruiting team to Stillwater, Oklahoma, but it paid off – they found two of their own unicorns there. Or convince HR to pay a visit to Kennesaw, Georgia, to visit Dr. Jennifer Priestley’s program run out of the statistics department at Kennesaw State University, which was recently cited by Computerworld as having the most innovative academic program in Big Data Analytics. There are many more programs like these around the country where you can recruit, so don't limit yourself to universities with which you are familiar.

I'll explain more tips and show more unicorns in future posts, but if you're attending the Analytics 2014 conference in Las Vegas October 20-21 there will be a virtual herd of SAS unicorns galloping around! I'll be giving a demo theater presentation on analytical talent where I'll give all ten of my tips for finding them. Stop me and say hi if you’re there – I always like meeting unicorns and could introduce you to many others we have there. Many others mentioned in this post are on the great conference agenda will be presenting:

  • Dr. Michael Rappa, who leads the Institute of Advanced Analytics at NC State University, will give a keynote session on "Solving the Analytics Talent Gap."
  • Patrick Hall, SAS unicorn and one of Rappa's former students, who will give a presentation on "An Overview of Machine Learning with SAS® Enterprise Miner™" and a super demo on "R integration in SAS® Enterprise Miner™."
  • Murali Pagolu, SAS unicorn and OSU graduate will present with his former professor, Dr. C, on "Unstructured Data Analysis: Real World Applications and Business Case Studies." Dr. C will bring 35 of his current and former students to the conference and has two teams who are finalists and another Honorable Mention in the 2014 Analytics Shootout.
  • Dr. Jennifer Priestley of Kennesaw State University will talk about "What You Don't Know About Education and Data Science."
  • Stop by the Networking Hall to visit booths on SAS Analytics U, the Global Academic Program, and programs from NCSU, OSU, and Kennesaw State University, as well as many other academic sponsors who run great programs you should add to your recruiting list.

 

(Unicorn poster image credit: photo by Arvind Grover// attribution by creative commons. Other photos courtesy of the unicorn pictured)

Post a Comment

How discrete-event simulation can help project prison populations

In 2011, the passage of the federal Justice Reinvestment Act (JRA) brought significant changes to North Carolina’s criminal sentencing practices, particularly in relation to the supervision of offenders released into the community on probation or post-release supervision. A recent New York Times article highlighted how NC has used the implementation of the JRA to implement cost-saving strategies. Each year the NC Sentencing Commission prepares prison projections that are used by the state Department of Public Safety and the NC General Assembly to help determine correctional resource needs for adult offenders, but the changes resulting from the JRA placed a huge kink into the long-established process used to generate those projections. Use of discrete-event simulation software from SAS helped smooth out the kinks.

The NC Sentencing Commission had been using a simulation model written in C-based code to project NC’s prison population for more than twenty years. The changes imposed by the JRA required new functionality not available in the existing simulation. The Administrative Office of the Courts (AOC) ultimately contracted with SAS to develop a more flexible and transparent prison population projection model using discrete-event simulation.

Traditional time series methods are ineffective for prison population projections because of dynamic factors like sentence length, prior criminal history, revocations of community supervision, and legislative changes. As an alternative, the SAS Advanced Analytics and Optimization Services Group (AAOS) used SAS® Simulation Studio to build a discrete-event simulation model that approximates the journeys of offenders through the criminal justice system. In general, discrete-event simulation is used to model systems where the state of the model is dynamic and changes in the state (called events) occur only at countable, distinct points in time. Examples of events in the prison model include an arrival of an offender in prison or a probation violation.

Process flowcharts provided the framework for the Simulation Studio prison projection model (for those interested in more detail, these flowcharts can be found in a more extended paper on this model presented at SAS Global Forum in 2013). Even though most of the JRA provisions went into effect on December 1, 2011, there will be a period of time for which portions of the JRA do not apply to certain offenders. As a result, the new simulation model incorporates both pre-existing and new legislative policies.

The AAOS Group in R&D translated the logic contained in the flowcharts into a Simulation Studio model. The entities (or objects) that flow through the model represent criminal cases. They have attributes (or properties) such as case number, gender, race, age, and prison term. At simulation execution, case entities are generated and routed according to their attributes over a ten-year period. For example, when it is time for a case entity to be released from prison, a check is done to see if that entity qualifies for post-release supervision. If so, then the entity is routed to logic that samples a random number stream to determine whether or not that entity will commit a violation at some point in the future.

The inputs to the Simulation Studio model are in the form of SAS data sets and include the following:

  1. Stock data, provided by the Department of Public Safety’s Division of Adult Correction: includes inmates in prison at the beginning of the projection period and their projected release date.
  2. Court data, provided by the AOC: contains convictions and sentence imposed in the most recent fiscal year.
  3. Growth estimates: projected growth rate for convictions as determined by the Sentencing Commission’s forecasting advisory group after examining demographic trends, crime trends, and arrest trends.

The court and stock data include both individual-level information (such as demographics, offense, and sentence) as well as aggregate-level information (such as the probability of receiving an active sentence by offense class and the lag-time between placement on probation and a return to prison for a violation).

At the end of the simulation, two SAS data sets are generated, providing a complete history of prison admissions and releases over a ten-year time period. From this data, monthly and annual projections can be prepared at an aggregate level as well by variables of interest such as gender, race, age, and offense class.

After the AAOS group finished building the simulation model, it was handed over to the Sentencing Commission, along with documentation and training for the Simulation Studio modeling interface, so that the Sentencing Commission could then run the model and make changes as needed. They have used the model to prepare projections for two years now, with the first official results being published in February of 2013. Figure 1 shows the projected prison population and capacity for FY 2014 through FY 2023. The prison population is projected to increase from 37, 679 in June 2014 to 38,812 in June 2023, an increase of 3%. A comparison of the projections with the operating capacity indicates that the projected prison population will be below prison capacity for the ten-year projection period. In June 2014 the actual average prison population was 37,731, so the model-projected population of 37,679 was within 0.2% of the actual. The current projections, as well as projections from previous years, are located on the Sentencing Commission’s website.

This project demonstrates a very promising application of discrete-event simulation in practice. The resulting Simulation Studio model not only incorporates changes to correctional policies as a result of the JRA, but it can easily be modified by the Sentencing Commission to incorporate any future legislative acts that affect the prison process, providing the state the flexibility and transparency they desired. The model can also be extended to project other criminal justice populations (such as juveniles) in both NC and in other states.

PrisonProjections4

 

 

 

 

 

Post a Comment

Building a $1 billion machine learning model

At the KDD conference this week I heard a great invited presentation called How to Create a $1 billion Model in 20 days: Predictive Modeling in the Real World – A Sprint Case Study. It was presented by Tracey de Poalo from Sprint and former Kaggle President and well known machine learning expert Jeremy Howard (@jeremyphoward). Jeremy convinced Sprint’s CEO that machine learning could help their business, so he was brought on as a consultant to work with Tracey and her team. The result was the $1 billion model, which he called the highest value machine learning case he’s ever seen.

Jeremy had the executive blessing they needed to get access to key teams, so they conducted 40-50 interviews to identify which business problems to prioritize for their work. Based on these interviews they decided to prototype models for churn, application credit, behavioral credit, and cross-sell. When ready to tackle the data, Jeremy was impressed that they were ahead of the curve. Tracey’s team had already built a data mart of 10,000 features on each customer. Jeremy said their thorough and well-organized data dictionary was the best he’d seen in his career.

For a planned benchmarking exercise, Jeremy chose his favorite Kaggle-winning scripts from R packages caret and randomForest. Based on his past Kaggle success he was felt confident he’d beat her existing models. When the results were in he confessed he was shocked that his were almost the same as hers, which were based on logistic regression. Kudos to Jeremy for his refreshing honesty, as someone commented during the Q&A.

Tracey’s team’s process was very rigorous and completely automated process used: 1) missing value imputation; 2) outlier treatment; 3) variable reduction (getting them down by ~65%); 4) transformations; 5) VIF (limit to 10); 6) stepwise regression (down to ~ 1,000 variables); 7) model refitting (50-75 left). Jeremy was most amazed at Tracey's strategic use of variable clustering, commenting that it is an interesting approach that he hadn’t seen elsewhere. She ranked her variables by R2 and then picked one variable/cluster.

As a result of their work together their new model identified nine variables that explained the majority of bad debt. Combining these factors with customer credit data they were able to estimate customer lifetime value, which allowed them to quantify the cost for making a bad call on credit. Adding these costs up you reach $1 billion in value.

A history of machine learning in SAS

What I love about the machine learning model Tracey's team had in place is that it has its roots in a very early SAS procedure, VARCLUS, which goes back to at least the early 1980’s. As I wrote before, machine learning is not new territory for SAS. SAS implemented a k-means clustering algorithm in 1982 (as described in this paper with PROC FASTCLUS in SAS/STAT®), but after reading my post Warren Sarle pointed out that PROC DISCRIM did k-nearest-neighbor discriminant analysis at least as far back as SAS 79.  This early procedure was written by a certain J. H. Goodnight, who some may recognize as SAS founder and CEO.

Perceptron

A neural learning technique called the perceptron algorithm was developed as far back as 1958. But neural network research made slow progress until the early 1990’s, when the intersection of computer science and statistics reignited the popularity of these ideas. In Warren Sarle’s 1994 paper Neural Networks and Statistical Models (where I found the illustration to the left), he even says that “the most commonly used artificial neural networks, called multilayer perceptrons, are nothing more than non-linear regression and discriminant models that can be implemented with standard statistical software.” He then explains that he will translate “neural network jargon into statistical jargon.”

Flash forward to today, where this article from Forbes reports that the most popular course at Stanford is one on machine learning. It is popular once again, and the discussions and papers at KDD this week certainly reflected this trend. While machine learning is nothing new for SAS, there is a lot of new machine learning in SAS. You can read more on machine learning in SAS® Enterprise Miner in this paper and in SAS® Text Miner in this paper, to name just a few of our products with machine learning features. Now grab some and go build your own $1 billion model!

Why corporate economists are hot again and a great source for analytical talent

A while back The Wall Street Journal published the article “Corporate Economists Are Hot Again“ that chronicles the resurgence of in-house economists in corporate America. The role of a corporate economist may bring about visuals of classic economist stereotypes (watch Ben Stein play to this stereotype as a teacher in the great 1986 movie Ferris Bueller's Day Off - search for "anyone, anyone" and the movie title for a good laugh). These types of prognosticators were popular in the 1970’s and 1980’s as companies attempted to turn the volatile macroeconomic environment into a competitive advantage. The subsequent near-twenty-year economic expansion and decreasingly volatile economy reduced the need for full-time economists, since the future continued to appear near-certain. Recently, economists are being hired again, but this time it is for a completely different reason, one that I have been evangelizing since my start at SAS. Economists are great source for analytical talent. They have all the necessary skills, which is why many companies are hiring them into these roles.  Economists are poised to break in to data science roles for these five reasons:

  1. We understand objective functions: Economists love objective functions, since they dictate how the players in a system behave. This can be important in both predicting outcomes as well as in conducting analysis. If the objective is to understand how price affects quantity, variable selection mechanisms cannot be used because they would eliminate the price variable.
  2. Economists have a very strong linear regression toolkit: While economists often do not have the depth of statistical methods that a formally-trained statistician has (we miss out on clustering and variable reduction, to name a few), we know what we know with great depth. And fortunately, very few problems require more than linear regression. There is one subtle tweak to an economist’s regression toolkit, which is….
  3. We own observational data and causality:  Economists never assume we have the luxury of experimental data. We always assume that the data are rife with issues such as measurement error, censoring and sample selection. For these reasons, economists have tweaked their regression training to address all these problems. Nearly all the corporate customers of SAS I have met model data generated outside a lab. The data are collected retroactively and have all the problems listed above and more.
  4. Articulating the problem and the solution: This reason is closely tied to the first point. Economists can talk about the problem and explain the solution. I have heard my fellow economists call this trait “storytelling (hat tip to John Moreau).” I think term that perfectly describes our skills here. SAS customers often tell me that they like the way economists conduct regression, because they look at the coefficients to verify they align with theory. Part of the storytelling proficiency is skill at explaining what incentives led to this response. Other disciplines tend to focus on statistical fit rather than explanation.
  5. We work with big data: While this might not be immediately obvious, economists are very skilled with dealing with data that are uncomfortably large. Nearly every labor or health economics course requires a data replication project involving multiple years of the US Census Bureau’s Current Population Survey or their 5-percent Public Use Microdata Sample (PUMS). These datasets easily are multiple gigabytes in size and require programming efficiency to process.

In fact, perhaps one of the most famous advocates of the “economist as data scientist” argument is Hal Varian. While his comment about statisticians being sexy is far better known, he is an economist himself, and the full quote sums it up best:

I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.

I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills—of being able to access, understand, and communicate the insights you get from data analysis—are going to be extremely important. Managers need to be able to access and understand the data themselves.” –Hal Varian, Chief Economist, Google[1]

Too bad he didn't call economists sexy.

So what holds economists back? I have my theories. I believe there are three key areas we must address: 1) terminology, 2) methodology and 3) technology. I will elaborate on these during my upcoming talk at the National Association for Business Economics Annual Meeting in Chicago September 27-30. If you find yourself in the area, I hope you can attend.

Looking backwards, looking forwards: SAS, data mining, and machine learning

Looking forward, ten of my SAS colleagues and I are heading to New York City this weekend for KDD 2014: Data Science for the Social Good, which runs August 24-27. This event’s full name is the 20th Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining, but it is more commonly known as ACM SIGKDD, or just KDD for short.

Looking backwards, the first KDD workshop was held in 1989, and these workshops eventually grew into the series of conferences. Whether you still call it data mining, or prefer machine learning or data science, the fact that this year the conference is sold out, with the 2,200 registered exceeding all expectations, is a sign of the trending of this topic. KDD’s tagline today is “bringing together the data mining, data science, and analytics community,” so this nexus is right where SAS has played for years. In fact, the picture below is taken from a data mining primer course SAS offered in 1998.

data mining Venn diagram

 

 

 

 

 

 

The SAS story starts with the statistics circle above, when the language was first developed in 1966, multiple regression and ANOVA were added in 1968, the first licenses sold in 1972, and the company incorporated in 1976. SAS moved into the data mining and machine learning circle early, when in 1982 the FASTCLUS procedure implemented k-means clustering. But while there’s more to this history, I’ll save it for another post and return to a forward-looking view.

I’m looking forward to hearing a keynote on Sunday night by Pedro Domingos (Department of Computer Science and Engineering at the University of Washington), who is the 2014 winner of the ACM SIGKDD Innovation Award and will be giving the talk associated with that award at the conference. I found his paper A Few Useful Things to Know about Machine Learning to be an excellent resource. On Monday morning Oren Etzioni (Executive Director of the Allen Institute for Artificial Intelligence, from the same department at the University of Washington) will give a talk on “The Battle for the Future of Data Mining,” which certainly will inform my forward-looking view. It will be interesting to hear where he thinks the field is heading, and where the battles will lie.

On Monday morning, right after we’ve heard Dr. Etzioni look to the future, my own colleague Zheng Zhao will give a paper he co-authored with our fellow SAS peers James Cox and Jun Liu on “Safe and Efficient Screening For Sparse Support Vector Machine” in the Feature Selection Research Track. In this paper, a novel screening technique is proposed to accelerate model selection for SVM and effectively improve its scalability. The emergence of big-data analysis poses new challenges for model selection with large-scale data that consist of tens of millions samples and features. This technique can precisely identify inactive features in the optimal solution of an SVM model and remove them before training. Experimental results on five high-dimensional benchmark data sets demonstrate the power of the proposed technique.

SAS will be in the exhibit hall with a booth (#14). In addition to talking about the products SAS offers for machine learning, we will be talking about our new SAS Analytics U initiative, which includes SAS® University Edition, a free, downloadable version of select SAS statistical software that runs on PCs, Macs, and Linux and is designed for teaching and learning SAS. We'll also be giving away some copies of our colleague Jared Dean's new book, Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. In the booth on Monday and Tuesday we will also offer what we call superdemos, which are 15-minute long demos on focused topics. Here is the list:

Monday, August 25, 10:00-10:15 a.m.
Deep learning for dimensionality reduction/visualization
Jorge Silva
We will showcase deep learning with PROC NEURAL, using a deep auto-encoder architecture to visualize clustering results on medical provider data. 
 
Monday, August 25, 1:00-1:15 p.m.
Contextual Recommendation using Text Analysis
Yue Qi
The collaborative filtering-based recommender is prone to the cold start problem and long tail problem, so this demo will show how to derive contextual recommendations using text analysis to address both problems.
 
Monday, August 25, 3:00-3:15 p.m.
Time series dimension reduction for data mining using SAS
Catherine Lopes
This demo introduces SAS procedures for time series dimension reduction in data mining.
 
Monday, August 25, 5:00-5:15 p.m.
New techniques for doing association classification and a demonstration of their usefulness for mining text
Jim Cox
We will describe two new algorithms for pattern discovery with a single consequent or external category: Bool-yer and AssoCat.
 
Tuesday, August 26, 10:00-10:15 a.m.
R integration node
Jorge Silva
This demo will illustrate the diagram and workflow user interface and also focus on how people can try their favorite R algorithms while taking advantage of data handling and pre-processing capabilities built into SAS® Enterprise Miner.
 
Tuesday, August 26, 1:00-1:15 p.m.
Classification Using Bayesian Networks in SAS® Enterprise Miner
Weihua Shi
Using a newly developed high-performance Bayesian network procedure (PROC HPBNET), this demo will illustrate the graphic-modeling approach using a real-world data.
 
Tuesday, August 26, 3:00-3:15 p.m.
Interactive Stratified Modeling using SAS® Visual Statistics
Wayne Thompson
This demo will show how to develop stratified models based on group-by variables, decision trees to derive segments and enforce business rules, and clustering demographic data followed by supervised models using transactional data.
 

If you are already planning on attending KDD, come by booth #14 and see us. If you didn’t register in advance you’re probably out of luck, since the conference is sold out. But I plan to blog again after the conference and will offer some impressions from the event, as well sharing some more history about SAS, data mining, and machine learning, continuing with my backward and forward looks.

An intuitive approach to the appropriate use of forecasts

It is a mild summer evening in July at Lake Neusiedl here in Austria. The participants of the traditional YES Cup Regatta are sitting with beer and barbecue chops on the terrace of our clubhouse. The mood is relaxed, and everyone wants to tell their story after two eventful races.

A conversation at the end of our table draws my attention, because it is about forecasting, more specifically the usability and accuracy of weather and wind forecasts. As expected, the opinions differ substantially. From "mostly wrong" to "we should be thankful that we have them – in earlier times no forecast existed on that level of detail" to "I make my decisions based on the cloud pictures."

Knowing the wind conditions before a regatta is important, to enable good decisions such as: "With what size of the sail shall I use to start the race, so that I don’t have to change during the regatta" or "What wind direction will prevail and which areas of the lake will therefore be favored?"

Marc, an old stager in race sailing, explains his use of wind forecasts as follows:

"I always consider several available forecasts; Windguru, Windfinder, Otto Lustyk, swz.at and ORF Burgenland . So I get a picture of the diversity or uniformity of the possible wind scenarios - because obviously the stations use different weather models. So I can judge whether weather and wind for the race weekend is easy or hard to predict and how much I can trust the forecasts in general. In addition, I also monitor how much the predictions for the weekend change during the week. If they stay stable all week, the weather seems to allow a clear prediction; if the predictions change daily, it seems that we get very unstable whether conditions. On the race day itself, watching the clouds and the sky is very important. Short-term and local facts cannot be included in these models and give me additional information based on my experience.”

Wind-2-300x290

 

 

Wind-1-300x174

 

 

 

 

 

 

A smile can be seen on my face, and I intentionally do not participate in the conversation, because I do not want to be seen as the statistician who “always considers everything so mathematically." And even more importantly: there is nothing to add to Marc's statement. Without knowing, he has summarized the most important principles of business forecasting and talked about the proper handling of statistical forecasts. And although his professional background is definitely not in dealing with data, forecasts, or things like "business intelligence," at the next Analytics Conference Marc can work together with me at the forecasting demo station. Because what he has just explained represents important features in SAS® Forecast Server.

  • Combine models for stable forecasts: " I always consider several available forecasts."
  • Segmentation of time series -  "I can judge whether weather and wind for the race weekend is easy or hard to predict."
  • Confidence intervals for forecasts: "How much I can trust the forecasts in general"
  • Forecast stability analysis and rolling simulations: "I also monitor how much the predictions for the weekend change during the week."
  • Overwrites and judgemental forecasts: "Short-term and local facts cannot be included in these models and give me additional information."

Forecast

So enjoy the fact that there is software that does the very things that people consider as intuitively correct, smile with satisfaction, and head towards the beer tap for another beer. At least, that's what we did after Marc shared his intuition with us.

 

Combined forecasts: what to do when one model isn’t good enough

My esteemed colleague and recently-published author Jared Dean shared some thoughts on how ensemble models help make better predictions. For predictive modeling, Jared explains the value of two main forms of ensembles --bagging and boosting. It should not be surprising that the idea of combining predictions from more than one model can also be applied to other analytical domains, such as statistical forecasting.

Forecast combinations, also called ensemble forecasting, are the subject of many academic papers in statistical and forecasting journals; they are a known technique for improving forecast accuracy and reducing the variability of the resulting forecasts. In their article “The M3 Competition: Results, Conclusions, and Implications" published by the International Journal of
Forecasting
, Spyros Makridakis and Michèle Hibon write about the results of a forecasting competition and share as one of their four conclusions: “The accuracy of the combination of various methods outperforms, on average, the specific methods being combined and does well in comparison with other methods.”

The lesson from this statement is that a combination of forecasts from simple models can add substantial value in terms of enhancing the quality of the forecasts produced, but the statement also concedes that combinations might not always perform better than a suitably-crafted model.

But how do you combine statistical forecasts? Similar to ensembles for predictive models, the basic idea is to combine the forecasts created by individual models, such as exponential smoothing models or ARIMA models. Let’s have a look at three combination techniques typically used:

  • Simple average
    • Every forecast created is combined using a similarly-weighted value – while this sounds like a simplistic idea, it has been proven very successful by practitioners, in particular if the individual forecasts are very different from each other.
  • Ordinary least squares (OLS) weights
    • In this approach an OLS regression is used to combine the individual forecasts. The main idea is to assign higher weights to the more accurate forecast.
  • Restricted least squares weights
    • Extends the idea of OLS weights by forcing constraints on the individual weights. For example, it might make sense to force all weights to be non-negative.

It is worth mentioning that estimating prediction error variance needs to be considered separately. In all cases, the estimated prediction error variance of the combined forecast uses the estimates of prediction error variance from the forecasts that are combined.

Not every time series forecast benefits from combination. The power of this technique becomes apparent when you consider that modern software such as SAS® Forecast Server allows for combination methods to be applied to large-scale time series forecasting of hierarchically structured data. The software makes it possible to generate combinations for inclusion into its model selection process in an automated fashion. In all cases, combined forecasts must prove their worth by their performance in comparison to other forecasts in the model selection process. If you are interested in more details this paper provides an extended explanation.

How ensemble models help make better predictions

My oldest son is in the school band, and they are getting ready for their spring concert. Their fall concert was wonderful; hearing dozens of students with their specific instruments playing together creates beautiful, rich sounding music. The depth of sound from orchestral or symphonic music is unmatched. In data mining, and specifically in the area of predictive modeling, a similar effect can be created using ensembles of models that leads to results that are more “beautiful” than a single model. A predictive model ensemble combines the posterior predictions from more than one model. When you combine multiple models together you create model crowdsourcing. Each individual model is described by a set of rules, and when the rules are applied in concert you can consider the "opinions" of many models. How to use these opinionated models depends on the goal. The two main ways are to (1) let every model vote and decide democratically the target label or (2) label the target with the opinion of the most confident model (probabilistically speaking).

Types of Ensembles

The two main forms of ensembles are boosting and bagging (more specifically called bootstrap aggregating). The most popular forms of ensembles are using decision trees. Random forest and gradient boosting machines are two examples that are very popular in the data mining community right now. While decision trees are the most popular they are not the only ensemble algorithm. Any model algorithm can be part of an ensemble and heterogenous ensembles can be quite powerful.

Bagging

Bagging, as the name alludes, takes repeated unweighted samples with replacement of the data to build models and then combines them. Think of your observations like grains of wild rice in a bag. Your objective is to identify the black grains because they have a resale price 10x greater when sold separately.

  1. Take a scoop of rice from the bag.
  2. Use your scoop of rice to build a model based on the grain’s characteristics, excluding that of color.
  3. Write down your model classification logic and fit statistics.
  4. Pour the scoop of rice back into the bag.
  5. Shake the bag for good measure and repeat.

examples of mixed riceHow big the scoop is relative to the bag, and how many scoops you take, will vary by industry and situation, but I usually use 25-30% of my data and take 7-10 samples. This results in a likelihood that every observation will be included 1-2 times in the model.

Boosting

Boosting is similar to bagging except that the observations in the samples are now weighted. To follow the rice problem from above, after step 3 I would take the grains of rice I had incorrectly classified (e.g. black grains I said were non-black or non-black grains I thought were black) and place them aside. I would then take a scoop of rice from the bag and leave some room to add the grains I had incorrectly classified. By including previously misclassified grains at a higher rate, the algorithm has more opportunities to identify the characteristics for correct classifications. This is the same idea behind giving more time to review flashcards of facts you didn’t know than those you did. For what it's worth, I tend to use bagging models for prediction problems and boosting for classification problems. By taking multiple samples of the data and modelling over iterations you allow factors that are otherwise weak to be explored. This provides a more stable and generalizable solution. When model accuracy is the most important consideration, ensemble models will be your best bet. This topic was recently discussed in much greater detail at SAS Global Forum. See this paper by Miguel Maldonado for more details.

Image credit: photo by Ludovico Sinz // attribution by creative commons

How Bayesian analysis might help find the missing Malaysian airplane

At the time this blog entry was written, there still appears to be little to no signs of locating the missing Malaysian flight MH370. The area of search, although already narrowed down from the size of the United States at one point to the size of Poland, is still vast and presents great challenges to all participating nations. Everything we’ve seen in the news so far have been leads that turn out to be nothing but dead ends.

There are a great many uncertainties surrounding the disappearance of flight MH370, making a search and rescue operation all but seem like finding a needle in an ocean-sized haystack. There is, however, an already established statistical framework based on Bayesian inference that has had great success in locating, amongst other things, a Hydrogen bomb lost over the Mediterranean sea1, a sunken nuclear submarine from the US Navy (USS Scorpion)1,  and the wreckage of Air France Flight 447 just several years ago.

The U.S. Coast Guard’s SAROPS (Search and Rescue Optimal Planning System) is based on the same Bayesian search framework that’s refined to accommodate ocean drift and crosswinds. As there is currently no evidence that the Malaysian government or Malaysian airlines is employing a Bayesian optimal search method, it is worthwhile to point out why a Bayesian search strategy should at least be  considered for a situation such as the missing MH370 case.

Unknown variables

First of all, there are still many unknowns regarding the missing MH370. Unknown variables are typically modeled probabilistically in the statistical world. Most of us are familiar with the frequency definition of probability.  If I handed you an old beat-up coin and asked you to tell me what the probability of heads is if the coin is flipped, your best bet is to flipped the coin say 5000 times and record the number of times it came up heads. Then you would divide the number of heads by 5000 and get a pretty good estimate of the probability in question. This is the frequency interpretation of probability:  the probability of an event is the relative frequency of the event happening in an infinite population of repeatable trials.

In the real world, however, we are often faced with rare and unique events, events that are non-repeatable. Hopefully, we wouldn’t have to study 5000 plane crashes to get a good estimate of a plane accident happening. In reality, there have only been 80 recorded missing planes since 1948. This calls for a different interpretation of probability, a subjective one that reflects an expert’s degree of belief. The subjective nature of the uncertainties of a rare event such as the loss of flight MH370 places us squarely in the domain of Bayesian inference. In the case of Air France 447, the prior distribution (initial belief about the crash location) of the search area was taken to be a mixture of three probability distributions, each representing a different scenario. The mixture weights were then decided based on consultations with experts at the BEA.

All information is useful

A big advantage of employing a Bayesian search method is that a Bayesian framework provides a systematic way to incorporate all available information via Bayes’ rule. This is invaluable in a large and complex search operation where new information will constantly emerge and the situation could change at a moment’s notice, requiring the search strategy to be constantly updated. The important thing to note here is that any information is considered useful. One area turning up empty will lower the probability of the wreckage in that area after a Bayesian update, but at the same time, it will increase the probability of the wreckage in other areas yet unsearched.

Air France 447 went missing in June 2009, when BEA commissioned scientific consulting firm Metron Scientific Solutions to come up with a probability map of the search area in 2011, two years of search efforts had turned up nothing. In their model, the Metron team took into account all four unsuccessful previous searches when updating their prior distribution of the crash location. Based on their recommendation of resuming the new round of search efforts around the region with the highest posterior probability, the wreck was located only one week into the search2.

While there are many intricate steps involved in deploying a Bayesian search strategy, particularly in coming up with the prior distribution and quantifying the likelihood of the different accident scenarios, the core math involved is surprisingly straightforward. For illustration purposes, assume that the search area is divided up into N grids, labelled x1 through xN. Let the prior probability of the wreck being in grid xk be denoted by p(xk+), for k=1,…,N. Now let the probability of successful detection in grid xk given the wreck is in grid xk be denoted p(Sk+|xk+). If the search in grid xk turns up empty, then the posterior probability of the wreck being in grid xk given the fact that the search in grid x is unsuccessful is:

Meanwhile the posterior probability of the wreck being in any other grid xm is also updated by the information that the search in grid xk turned up unsuccessful:

Note that p(xk+|Sk-) < p(xk+), and p(xm+|Sk-) > p(xm+).

A Bayesian search strategy would start from the grids with the highest prior probability mass, if nothing is found in those grids then update the posterior probability of all grids via Bayes theorem and start all over treating the new posterior probabilities as current prior probabilities. This could be a lengthy process, but as long as the wreck is located within the prior region, it would eventually be located.

An unprecedented search area

When AF447 crashed, BEA was able to quickly establish that the plane would have to lie within a 40 nautical mile radius circle from the plane’s last known location. This is roughly 6600 square miles of initial search area, compare to MH370’s current Poland-sized search area of more than 100,000 square miles. Considering it took two years, and five rounds of search efforts to finally locate AF447, the difficulty involved in finding MH370 is unprecedented in the history of modern aviation. While a Bayesian search method might not locate the remains of MH370 any time soon, its flexibility and systematic nature, not to mention its past successes, makes it a powerful tool to seriously consider for the current search efforts.

For interested readers, here is the paper that documented the Metron teams’ efforts in using Bayesian inference to develop the probability map of AF447’s location.

References:

1. S. B. McGrayne. “The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy”, Yale University Press, 2011.

2. L. D. Stone, C. M. Keller, T. M. Kratzke and J. P. Strumpfer. “Search for the Wreckage of Air France Flight AF 447”, submitted to Statistical Science, 2013.

March Madness and Predictive Modeling

Jared Dean and son at a 2013 NCAA Tournament game one

In my region of North Carolina (Raleigh, Durham, and Chapel Hill) one of the most anticipated times of the year has arrived— the NCAA basketball tournament. This is a great time of year for me, because I get to combine several of my passions.

For those who don’t live among crazed college basketball fans, the NCAA (National Collegiate Athletic Association) holds an annual tournament that seeds the regional conference winners and the best non-conference winning teams in a single elimination tournament of 68 teams to determine the national champion in collegiate basketball.  The teams are ranked and seeded so that the perceived best teams don’t face each other until the later rounds.

In the tournament history stretching back more than 75 years, only 14 universities have won more than one championship, and three schools local to SAS world headquarters are on that list (the University of North Carolina, Duke University, and North Carolina State University). That concentration, combined with the fact that this area is a well-known cluster for statistics, means that I am not alone amongst my neighbors in combining my passions.

The NCAA tournament carries with it a tradition of office betting pools, where coworkers, families, and friends predict the outcome of the 67 games to earn money, pride, or both. Unbeknownst to many of them they are building predictive models, something near and dear to my heart. As a data miner, I analyze data and build predictive models about human behavior, machine failures, credit worthiness, and so on. But predictive modeling in the NCAA tournament can be as simple as choosing the winner by favorite color, most fierce mascot, or alphabetizing. Others rely on their observation of the teams throughout the regular season and conference championships to inform their decisions, and then they use their “gut” to pick a winner when they have little or no information about one of both of the teams.

I’m sure some readers have used these kinds of strategies and lost or maybe even won the “kitty” in these betting pools, but the best results will come using historical information to identify patterns in the data. For example, did you know that since 2008 the 12th seed has won 50% of the time against the 5th seed? Or that the 12th seed has beat the 5th seed more often than the 11th seed has beat the 6th seed?

Upon analyzing tournament data, patterns like these emerge about the tournament, specific teams (e.g. NC State University struggles to make free throws in the clutch), or certain conferences. To make the best predictions, use this quantitative information in conjunction with your own domain expertise, in this case about basketball.

Predictive modeling methodology generally comes from two groups: statisticians and computer scientists (who may take a more machine learning approach). The field of data mining encompasses both groups with the same aim - to make correct predictions of a future event. Common data mining techniques include logistic regression, decision trees, generalized linear models, support vector machines (SVM), neural networks, and many many more (all available in SAS).

While these techniques are applied to a broad range of problems, professors Jay Coleman and Mike DuMond have successfully used those from SAS to create their NCAA“ dance card,” a prediction of the winners that has had a 98% success rate over the last three years.

If you think you have superior basketball knowledge and analytical skills, then hopefully you entered the ultimate payday competition from Warren Buffett. He will pay you $1 billion if you can produce a perfect bracket.  Before you go out and start ordering extravagant gift, it is worth considering that the odds of winning at random are 1 in 148 pentillion (148,000,000,000,000,000,000), but with some skill your odds could improve to 1 in 1 billion. In this new world of crowdsourcing, about 8,000 people have united to try and win the billion dollar  prize. I don’t know how many picked Dayton to beat Ohio State last night, but that one game appears to have eliminated about 80% of participants.

If you’re looking for even more opportunities to combine basketball and predictive analytics, then check out this Kaggle contest with a smaller payday but better odds.

Statistician George Box is famous for saying, “essentially, all models are wrong but some are useful”.  I wish you luck in your office pool, and if you beat the odds remember the bloggers in your life :)