Building a $1 billion machine learning model

At the KDD conference this week I heard a great invited presentation called How to Create a $1 billion Model in 20 days: Predictive Modeling in the Real World – A Sprint Case Study. It was presented by Tracey de Poalo from Sprint and former Kaggle President and well known machine learning expert Jeremy Howard (@jeremyphoward). Jeremy convinced Sprint’s CEO that machine learning could help their business, so he was brought on as a consultant to work with Tracey and her team. The result was the $1 billion model, which he called the highest value machine learning case he’s ever seen.

Jeremy had the executive blessing they needed to get access to key teams, so they conducted 40-50 interviews to identify which business problems to prioritize for their work. Based on these interviews they decided to prototype models for churn, application credit, behavioral credit, and cross-sell. When ready to tackle the data, Jeremy was impressed that they were ahead of the curve. Tracey’s team had already built a data mart of 10,000 features on each customer. Jeremy said their thorough and well-organized data dictionary was the best he’d seen in his career.

For a planned benchmarking exercise, Jeremy chose his favorite Kaggle-winning scripts from R packages caret and randomForest. Based on his past Kaggle success he was felt confident he’d beat her existing models. When the results were in he confessed he was shocked that his were almost the same as hers, which were based on logistic regression. Kudos to Jeremy for his refreshing honesty, as someone commented during the Q&A.

Tracey’s team’s process was very rigorous and completely automated process used: 1) missing value imputation; 2) outlier treatment; 3) variable reduction (getting them down by ~65%); 4) transformations; 5) VIF (limit to 10); 6) stepwise regression (down to ~ 1,000 variables); 7) model refitting (50-75 left). Jeremy was most amazed at Tracey's strategic use of variable clustering, commenting that it is an interesting approach that he hadn’t seen elsewhere. She ranked her variables by R2 and then picked one variable/cluster.

As a result of their work together their new model identified nine variables that explained the majority of bad debt. Combining these factors with customer credit data they were able to estimate customer lifetime value, which allowed them to quantify the cost for making a bad call on credit. Adding these costs up you reach $1 billion in value.

A history of machine learning in SAS

What I love about the machine learning model Tracey's team had in place is that it has its roots in a very early SAS procedure, VARCLUS, which goes back to at least the early 1980’s. As I wrote before, machine learning is not new territory for SAS. SAS implemented a k-means clustering algorithm in 1982 (as described in this paper with PROC FASTCLUS in SAS/STAT®), but after reading my post Warren Sarle pointed out that PROC DISCRIM did k-nearest-neighbor discriminant analysis at least as far back as SAS 79.  This early procedure was written by a certain J. H. Goodnight, who some may recognize as SAS founder and CEO.


A neural learning technique called the perceptron algorithm was developed as far back as 1958. But neural network research made slow progress until the early 1990’s, when the intersection of computer science and statistics reignited the popularity of these ideas. In Warren Sarle’s 1994 paper Neural Networks and Statistical Models (where I found the illustration to the left), he even says that “the most commonly used artificial neural networks, called multilayer perceptrons, are nothing more than non-linear regression and discriminant models that can be implemented with standard statistical software.” He then explains that he will translate “neural network jargon into statistical jargon.”

Flash forward to today, where this article from Forbes reports that the most popular course at Stanford is one on machine learning. It is popular once again, and the discussions and papers at KDD this week certainly reflected this trend. While machine learning is nothing new for SAS, there is a lot of new machine learning in SAS. You can read more on machine learning in SAS® Enterprise Miner in this paper and in SAS® Text Miner in this paper, to name just a few of our products with machine learning features. Now grab some and go build your own $1 billion model!

Why corporate economists are hot again and a great source for analytical talent

A while back The Wall Street Journal published the article “Corporate Economists Are Hot Again“ that chronicles the resurgence of in-house economists in corporate America. The role of a corporate economist may bring about visuals of classic economist stereotypes (watch Ben Stein play to this stereotype as a teacher in the great 1986 movie Ferris Bueller's Day Off - search for "anyone, anyone" and the movie title for a good laugh). These types of prognosticators were popular in the 1970’s and 1980’s as companies attempted to turn the volatile macroeconomic environment into a competitive advantage. The subsequent near-twenty-year economic expansion and decreasingly volatile economy reduced the need for full-time economists, since the future continued to appear near-certain. Recently, economists are being hired again, but this time it is for a completely different reason, one that I have been evangelizing since my start at SAS. Economists are great source for analytical talent. They have all the necessary skills, which is why many companies are hiring them into these roles.  Economists are poised to break in to data science roles for these five reasons:

  1. We understand objective functions: Economists love objective functions, since they dictate how the players in a system behave. This can be important in both predicting outcomes as well as in conducting analysis. If the objective is to understand how price affects quantity, variable selection mechanisms cannot be used because they would eliminate the price variable.
  2. Economists have a very strong linear regression toolkit: While economists often do not have the depth of statistical methods that a formally-trained statistician has (we miss out on clustering and variable reduction, to name a few), we know what we know with great depth. And fortunately, very few problems require more than linear regression. There is one subtle tweak to an economist’s regression toolkit, which is….
  3. We own observational data and causality:  Economists never assume we have the luxury of experimental data. We always assume that the data are rife with issues such as measurement error, censoring and sample selection. For these reasons, economists have tweaked their regression training to address all these problems. Nearly all the corporate customers of SAS I have met model data generated outside a lab. The data are collected retroactively and have all the problems listed above and more.
  4. Articulating the problem and the solution: This reason is closely tied to the first point. Economists can talk about the problem and explain the solution. I have heard my fellow economists call this trait “storytelling (hat tip to John Moreau).” I think term that perfectly describes our skills here. SAS customers often tell me that they like the way economists conduct regression, because they look at the coefficients to verify they align with theory. Part of the storytelling proficiency is skill at explaining what incentives led to this response. Other disciplines tend to focus on statistical fit rather than explanation.
  5. We work with big data: While this might not be immediately obvious, economists are very skilled with dealing with data that are uncomfortably large. Nearly every labor or health economics course requires a data replication project involving multiple years of the US Census Bureau’s Current Population Survey or their 5-percent Public Use Microdata Sample (PUMS). These datasets easily are multiple gigabytes in size and require programming efficiency to process.

In fact, perhaps one of the most famous advocates of the “economist as data scientist” argument is Hal Varian. While his comment about statisticians being sexy is far better known, he is an economist himself, and the full quote sums it up best:

I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.

I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills—of being able to access, understand, and communicate the insights you get from data analysis—are going to be extremely important. Managers need to be able to access and understand the data themselves.” –Hal Varian, Chief Economist, Google[1]

Too bad he didn't call economists sexy.

So what holds economists back? I have my theories. I believe there are three key areas we must address: 1) terminology, 2) methodology and 3) technology. I will elaborate on these during my upcoming talk at the National Association for Business Economics Annual Meeting in Chicago September 27-30. If you find yourself in the area, I hope you can attend.

Looking backwards, looking forwards: SAS, data mining, and machine learning

Looking forward, ten of my SAS colleagues and I are heading to New York City this weekend for KDD 2014: Data Science for the Social Good, which runs August 24-27. This event’s full name is the 20th Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining, but it is more commonly known as ACM SIGKDD, or just KDD for short.

Looking backwards, the first KDD workshop was held in 1989, and these workshops eventually grew into the series of conferences. Whether you still call it data mining, or prefer machine learning or data science, the fact that this year the conference is sold out, with the 2,200 registered exceeding all expectations, is a sign of the trending of this topic. KDD’s tagline today is “bringing together the data mining, data science, and analytics community,” so this nexus is right where SAS has played for years. In fact, the picture below is taken from a data mining primer course SAS offered in 1998.

data mining Venn diagram







The SAS story starts with the statistics circle above, when the language was first developed in 1966, multiple regression and ANOVA were added in 1968, the first licenses sold in 1972, and the company incorporated in 1976. SAS moved into the data mining and machine learning circle early, when in 1982 the FASTCLUS procedure implemented k-means clustering. But while there’s more to this history, I’ll save it for another post and return to a forward-looking view.

I’m looking forward to hearing a keynote on Sunday night by Pedro Domingos (Department of Computer Science and Engineering at the University of Washington), who is the 2014 winner of the ACM SIGKDD Innovation Award and will be giving the talk associated with that award at the conference. I found his paper A Few Useful Things to Know about Machine Learning to be an excellent resource. On Monday morning Oren Etzioni (Executive Director of the Allen Institute for Artificial Intelligence, from the same department at the University of Washington) will give a talk on “The Battle for the Future of Data Mining,” which certainly will inform my forward-looking view. It will be interesting to hear where he thinks the field is heading, and where the battles will lie.

On Monday morning, right after we’ve heard Dr. Etzioni look to the future, my own colleague Zheng Zhao will give a paper he co-authored with our fellow SAS peers James Cox and Jun Liu on “Safe and Efficient Screening For Sparse Support Vector Machine” in the Feature Selection Research Track. In this paper, a novel screening technique is proposed to accelerate model selection for SVM and effectively improve its scalability. The emergence of big-data analysis poses new challenges for model selection with large-scale data that consist of tens of millions samples and features. This technique can precisely identify inactive features in the optimal solution of an SVM model and remove them before training. Experimental results on five high-dimensional benchmark data sets demonstrate the power of the proposed technique.

SAS will be in the exhibit hall with a booth (#14). In addition to talking about the products SAS offers for machine learning, we will be talking about our new SAS Analytics U initiative, which includes SAS® University Edition, a free, downloadable version of select SAS statistical software that runs on PCs, Macs, and Linux and is designed for teaching and learning SAS. We'll also be giving away some copies of our colleague Jared Dean's new book, Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. In the booth on Monday and Tuesday we will also offer what we call superdemos, which are 15-minute long demos on focused topics. Here is the list:

Monday, August 25, 10:00-10:15 a.m.
Deep learning for dimensionality reduction/visualization
Jorge Silva
We will showcase deep learning with PROC NEURAL, using a deep auto-encoder architecture to visualize clustering results on medical provider data. 
Monday, August 25, 1:00-1:15 p.m.
Contextual Recommendation using Text Analysis
Yue Qi
The collaborative filtering-based recommender is prone to the cold start problem and long tail problem, so this demo will show how to derive contextual recommendations using text analysis to address both problems.
Monday, August 25, 3:00-3:15 p.m.
Time series dimension reduction for data mining using SAS
Catherine Lopes
This demo introduces SAS procedures for time series dimension reduction in data mining.
Monday, August 25, 5:00-5:15 p.m.
New techniques for doing association classification and a demonstration of their usefulness for mining text
Jim Cox
We will describe two new algorithms for pattern discovery with a single consequent or external category: Bool-yer and AssoCat.
Tuesday, August 26, 10:00-10:15 a.m.
R integration node
Jorge Silva
This demo will illustrate the diagram and workflow user interface and also focus on how people can try their favorite R algorithms while taking advantage of data handling and pre-processing capabilities built into SAS® Enterprise Miner.
Tuesday, August 26, 1:00-1:15 p.m.
Classification Using Bayesian Networks in SAS® Enterprise Miner
Weihua Shi
Using a newly developed high-performance Bayesian network procedure (PROC HPBNET), this demo will illustrate the graphic-modeling approach using a real-world data.
Tuesday, August 26, 3:00-3:15 p.m.
Interactive Stratified Modeling using SAS® Visual Statistics
Wayne Thompson
This demo will show how to develop stratified models based on group-by variables, decision trees to derive segments and enforce business rules, and clustering demographic data followed by supervised models using transactional data.

If you are already planning on attending KDD, come by booth #14 and see us. If you didn’t register in advance you’re probably out of luck, since the conference is sold out. But I plan to blog again after the conference and will offer some impressions from the event, as well sharing some more history about SAS, data mining, and machine learning, continuing with my backward and forward looks.

An intuitive approach to the appropriate use of forecasts

It is a mild summer evening in July at Lake Neusiedl here in Austria. The participants of the traditional YES Cup Regatta are sitting with beer and barbecue chops on the terrace of our clubhouse. The mood is relaxed, and everyone wants to tell their story after two eventful races.

A conversation at the end of our table draws my attention, because it is about forecasting, more specifically the usability and accuracy of weather and wind forecasts. As expected, the opinions differ substantially. From "mostly wrong" to "we should be thankful that we have them – in earlier times no forecast existed on that level of detail" to "I make my decisions based on the cloud pictures."

Knowing the wind conditions before a regatta is important, to enable good decisions such as: "With what size of the sail shall I use to start the race, so that I don’t have to change during the regatta" or "What wind direction will prevail and which areas of the lake will therefore be favored?"

Marc, an old stager in race sailing, explains his use of wind forecasts as follows:

"I always consider several available forecasts; Windguru, Windfinder, Otto Lustyk, and ORF Burgenland . So I get a picture of the diversity or uniformity of the possible wind scenarios - because obviously the stations use different weather models. So I can judge whether weather and wind for the race weekend is easy or hard to predict and how much I can trust the forecasts in general. In addition, I also monitor how much the predictions for the weekend change during the week. If they stay stable all week, the weather seems to allow a clear prediction; if the predictions change daily, it seems that we get very unstable whether conditions. On the race day itself, watching the clouds and the sky is very important. Short-term and local facts cannot be included in these models and give me additional information based on my experience.”











A smile can be seen on my face, and I intentionally do not participate in the conversation, because I do not want to be seen as the statistician who “always considers everything so mathematically." And even more importantly: there is nothing to add to Marc's statement. Without knowing, he has summarized the most important principles of business forecasting and talked about the proper handling of statistical forecasts. And although his professional background is definitely not in dealing with data, forecasts, or things like "business intelligence," at the next Analytics Conference Marc can work together with me at the forecasting demo station. Because what he has just explained represents important features in SAS® Forecast Server.

  • Combine models for stable forecasts: " I always consider several available forecasts."
  • Segmentation of time series -  "I can judge whether weather and wind for the race weekend is easy or hard to predict."
  • Confidence intervals for forecasts: "How much I can trust the forecasts in general"
  • Forecast stability analysis and rolling simulations: "I also monitor how much the predictions for the weekend change during the week."
  • Overwrites and judgemental forecasts: "Short-term and local facts cannot be included in these models and give me additional information."


So enjoy the fact that there is software that does the very things that people consider as intuitively correct, smile with satisfaction, and head towards the beer tap for another beer. At least, that's what we did after Marc shared his intuition with us.


Combined forecasts: what to do when one model isn’t good enough

My esteemed colleague and recently-published author Jared Dean shared some thoughts on how ensemble models help make better predictions. For predictive modeling, Jared explains the value of two main forms of ensembles --bagging and boosting. It should not be surprising that the idea of combining predictions from more than one model can also be applied to other analytical domains, such as statistical forecasting.

Forecast combinations, also called ensemble forecasting, are the subject of many academic papers in statistical and forecasting journals; they are a known technique for improving forecast accuracy and reducing the variability of the resulting forecasts. In their article “The M3 Competition: Results, Conclusions, and Implications" published by the International Journal of
, Spyros Makridakis and Michèle Hibon write about the results of a forecasting competition and share as one of their four conclusions: “The accuracy of the combination of various methods outperforms, on average, the specific methods being combined and does well in comparison with other methods.”

The lesson from this statement is that a combination of forecasts from simple models can add substantial value in terms of enhancing the quality of the forecasts produced, but the statement also concedes that combinations might not always perform better than a suitably-crafted model.

But how do you combine statistical forecasts? Similar to ensembles for predictive models, the basic idea is to combine the forecasts created by individual models, such as exponential smoothing models or ARIMA models. Let’s have a look at three combination techniques typically used:

  • Simple average
    • Every forecast created is combined using a similarly-weighted value – while this sounds like a simplistic idea, it has been proven very successful by practitioners, in particular if the individual forecasts are very different from each other.
  • Ordinary least squares (OLS) weights
    • In this approach an OLS regression is used to combine the individual forecasts. The main idea is to assign higher weights to the more accurate forecast.
  • Restricted least squares weights
    • Extends the idea of OLS weights by forcing constraints on the individual weights. For example, it might make sense to force all weights to be non-negative.

It is worth mentioning that estimating prediction error variance needs to be considered separately. In all cases, the estimated prediction error variance of the combined forecast uses the estimates of prediction error variance from the forecasts that are combined.

Not every time series forecast benefits from combination. The power of this technique becomes apparent when you consider that modern software such as SAS® Forecast Server allows for combination methods to be applied to large-scale time series forecasting of hierarchically structured data. The software makes it possible to generate combinations for inclusion into its model selection process in an automated fashion. In all cases, combined forecasts must prove their worth by their performance in comparison to other forecasts in the model selection process. If you are interested in more details this paper provides an extended explanation.

How ensemble models help make better predictions

My oldest son is in the school band, and they are getting ready for their spring concert. Their fall concert was wonderful; hearing dozens of students with their specific instruments playing together creates beautiful, rich sounding music. The depth of sound from orchestral or symphonic music is unmatched. In data mining, and specifically in the area of predictive modeling, a similar effect can be created using ensembles of models that leads to results that are more “beautiful” than a single model. A predictive model ensemble combines the posterior predictions from more than one model. When you combine multiple models together you create model crowdsourcing. Each individual model is described by a set of rules, and when the rules are applied in concert you can consider the "opinions" of many models. How to use these opinionated models depends on the goal. The two main ways are to (1) let every model vote and decide democratically the target label or (2) label the target with the opinion of the most confident model (probabilistically speaking).

Types of Ensembles

The two main forms of ensembles are boosting and bagging (more specifically called bootstrap aggregating). The most popular forms of ensembles are using decision trees. Random forest and gradient boosting machines are two examples that are very popular in the data mining community right now. While decision trees are the most popular they are not the only ensemble algorithm. Any model algorithm can be part of an ensemble and heterogenous ensembles can be quite powerful.


Bagging, as the name alludes, takes repeated unweighted samples with replacement of the data to build models and then combines them. Think of your observations like grains of wild rice in a bag. Your objective is to identify the black grains because they have a resale price 10x greater when sold separately.

  1. Take a scoop of rice from the bag.
  2. Use your scoop of rice to build a model based on the grain’s characteristics, excluding that of color.
  3. Write down your model classification logic and fit statistics.
  4. Pour the scoop of rice back into the bag.
  5. Shake the bag for good measure and repeat.

examples of mixed riceHow big the scoop is relative to the bag, and how many scoops you take, will vary by industry and situation, but I usually use 25-30% of my data and take 7-10 samples. This results in a likelihood that every observation will be included 1-2 times in the model.


Boosting is similar to bagging except that the observations in the samples are now weighted. To follow the rice problem from above, after step 3 I would take the grains of rice I had incorrectly classified (e.g. black grains I said were non-black or non-black grains I thought were black) and place them aside. I would then take a scoop of rice from the bag and leave some room to add the grains I had incorrectly classified. By including previously misclassified grains at a higher rate, the algorithm has more opportunities to identify the characteristics for correct classifications. This is the same idea behind giving more time to review flashcards of facts you didn’t know than those you did. For what it's worth, I tend to use bagging models for prediction problems and boosting for classification problems. By taking multiple samples of the data and modelling over iterations you allow factors that are otherwise weak to be explored. This provides a more stable and generalizable solution. When model accuracy is the most important consideration, ensemble models will be your best bet. This topic was recently discussed in much greater detail at SAS Global Forum. See this paper by Miguel Maldonado for more details.

Image credit: photo by Ludovico Sinz // attribution by creative commons

How Bayesian analysis might help find the missing Malaysian airplane

At the time this blog entry was written, there still appears to be little to no signs of locating the missing Malaysian flight MH370. The area of search, although already narrowed down from the size of the United States at one point to the size of Poland, is still vast and presents great challenges to all participating nations. Everything we’ve seen in the news so far have been leads that turn out to be nothing but dead ends.

There are a great many uncertainties surrounding the disappearance of flight MH370, making a search and rescue operation all but seem like finding a needle in an ocean-sized haystack. There is, however, an already established statistical framework based on Bayesian inference that has had great success in locating, amongst other things, a Hydrogen bomb lost over the Mediterranean sea1, a sunken nuclear submarine from the US Navy (USS Scorpion)1,  and the wreckage of Air France Flight 447 just several years ago.

The U.S. Coast Guard’s SAROPS (Search and Rescue Optimal Planning System) is based on the same Bayesian search framework that’s refined to accommodate ocean drift and crosswinds. As there is currently no evidence that the Malaysian government or Malaysian airlines is employing a Bayesian optimal search method, it is worthwhile to point out why a Bayesian search strategy should at least be  considered for a situation such as the missing MH370 case.

Unknown variables

First of all, there are still many unknowns regarding the missing MH370. Unknown variables are typically modeled probabilistically in the statistical world. Most of us are familiar with the frequency definition of probability.  If I handed you an old beat-up coin and asked you to tell me what the probability of heads is if the coin is flipped, your best bet is to flipped the coin say 5000 times and record the number of times it came up heads. Then you would divide the number of heads by 5000 and get a pretty good estimate of the probability in question. This is the frequency interpretation of probability:  the probability of an event is the relative frequency of the event happening in an infinite population of repeatable trials.

In the real world, however, we are often faced with rare and unique events, events that are non-repeatable. Hopefully, we wouldn’t have to study 5000 plane crashes to get a good estimate of a plane accident happening. In reality, there have only been 80 recorded missing planes since 1948. This calls for a different interpretation of probability, a subjective one that reflects an expert’s degree of belief. The subjective nature of the uncertainties of a rare event such as the loss of flight MH370 places us squarely in the domain of Bayesian inference. In the case of Air France 447, the prior distribution (initial belief about the crash location) of the search area was taken to be a mixture of three probability distributions, each representing a different scenario. The mixture weights were then decided based on consultations with experts at the BEA.

All information is useful

A big advantage of employing a Bayesian search method is that a Bayesian framework provides a systematic way to incorporate all available information via Bayes’ rule. This is invaluable in a large and complex search operation where new information will constantly emerge and the situation could change at a moment’s notice, requiring the search strategy to be constantly updated. The important thing to note here is that any information is considered useful. One area turning up empty will lower the probability of the wreckage in that area after a Bayesian update, but at the same time, it will increase the probability of the wreckage in other areas yet unsearched.

Air France 447 went missing in June 2009, when BEA commissioned scientific consulting firm Metron Scientific Solutions to come up with a probability map of the search area in 2011, two years of search efforts had turned up nothing. In their model, the Metron team took into account all four unsuccessful previous searches when updating their prior distribution of the crash location. Based on their recommendation of resuming the new round of search efforts around the region with the highest posterior probability, the wreck was located only one week into the search2.

While there are many intricate steps involved in deploying a Bayesian search strategy, particularly in coming up with the prior distribution and quantifying the likelihood of the different accident scenarios, the core math involved is surprisingly straightforward. For illustration purposes, assume that the search area is divided up into N grids, labelled x1 through xN. Let the prior probability of the wreck being in grid xk be denoted by p(xk+), for k=1,…,N. Now let the probability of successful detection in grid xk given the wreck is in grid xk be denoted p(Sk+|xk+). If the search in grid xk turns up empty, then the posterior probability of the wreck being in grid xk given the fact that the search in grid x is unsuccessful is:

Meanwhile the posterior probability of the wreck being in any other grid xm is also updated by the information that the search in grid xk turned up unsuccessful:

Note that p(xk+|Sk-) < p(xk+), and p(xm+|Sk-) > p(xm+).

A Bayesian search strategy would start from the grids with the highest prior probability mass, if nothing is found in those grids then update the posterior probability of all grids via Bayes theorem and start all over treating the new posterior probabilities as current prior probabilities. This could be a lengthy process, but as long as the wreck is located within the prior region, it would eventually be located.

An unprecedented search area

When AF447 crashed, BEA was able to quickly establish that the plane would have to lie within a 40 nautical mile radius circle from the plane’s last known location. This is roughly 6600 square miles of initial search area, compare to MH370’s current Poland-sized search area of more than 100,000 square miles. Considering it took two years, and five rounds of search efforts to finally locate AF447, the difficulty involved in finding MH370 is unprecedented in the history of modern aviation. While a Bayesian search method might not locate the remains of MH370 any time soon, its flexibility and systematic nature, not to mention its past successes, makes it a powerful tool to seriously consider for the current search efforts.

For interested readers, here is the paper that documented the Metron teams’ efforts in using Bayesian inference to develop the probability map of AF447’s location.


1. S. B. McGrayne. “The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy”, Yale University Press, 2011.

2. L. D. Stone, C. M. Keller, T. M. Kratzke and J. P. Strumpfer. “Search for the Wreckage of Air France Flight AF 447”, submitted to Statistical Science, 2013.

March Madness and Predictive Modeling

Jared Dean and son at a 2013 NCAA Tournament game one

In my region of North Carolina (Raleigh, Durham, and Chapel Hill) one of the most anticipated times of the year has arrived— the NCAA basketball tournament. This is a great time of year for me, because I get to combine several of my passions.

For those who don’t live among crazed college basketball fans, the NCAA (National Collegiate Athletic Association) holds an annual tournament that seeds the regional conference winners and the best non-conference winning teams in a single elimination tournament of 68 teams to determine the national champion in collegiate basketball.  The teams are ranked and seeded so that the perceived best teams don’t face each other until the later rounds.

In the tournament history stretching back more than 75 years, only 14 universities have won more than one championship, and three schools local to SAS world headquarters are on that list (the University of North Carolina, Duke University, and North Carolina State University). That concentration, combined with the fact that this area is a well-known cluster for statistics, means that I am not alone amongst my neighbors in combining my passions.

The NCAA tournament carries with it a tradition of office betting pools, where coworkers, families, and friends predict the outcome of the 67 games to earn money, pride, or both. Unbeknownst to many of them they are building predictive models, something near and dear to my heart. As a data miner, I analyze data and build predictive models about human behavior, machine failures, credit worthiness, and so on. But predictive modeling in the NCAA tournament can be as simple as choosing the winner by favorite color, most fierce mascot, or alphabetizing. Others rely on their observation of the teams throughout the regular season and conference championships to inform their decisions, and then they use their “gut” to pick a winner when they have little or no information about one of both of the teams.

I’m sure some readers have used these kinds of strategies and lost or maybe even won the “kitty” in these betting pools, but the best results will come using historical information to identify patterns in the data. For example, did you know that since 2008 the 12th seed has won 50% of the time against the 5th seed? Or that the 12th seed has beat the 5th seed more often than the 11th seed has beat the 6th seed?

Upon analyzing tournament data, patterns like these emerge about the tournament, specific teams (e.g. NC State University struggles to make free throws in the clutch), or certain conferences. To make the best predictions, use this quantitative information in conjunction with your own domain expertise, in this case about basketball.

Predictive modeling methodology generally comes from two groups: statisticians and computer scientists (who may take a more machine learning approach). The field of data mining encompasses both groups with the same aim - to make correct predictions of a future event. Common data mining techniques include logistic regression, decision trees, generalized linear models, support vector machines (SVM), neural networks, and many many more (all available in SAS).

While these techniques are applied to a broad range of problems, professors Jay Coleman and Mike DuMond have successfully used those from SAS to create their NCAA“ dance card,” a prediction of the winners that has had a 98% success rate over the last three years.

If you think you have superior basketball knowledge and analytical skills, then hopefully you entered the ultimate payday competition from Warren Buffett. He will pay you $1 billion if you can produce a perfect bracket.  Before you go out and start ordering extravagant gift, it is worth considering that the odds of winning at random are 1 in 148 pentillion (148,000,000,000,000,000,000), but with some skill your odds could improve to 1 in 1 billion. In this new world of crowdsourcing, about 8,000 people have united to try and win the billion dollar  prize. I don’t know how many picked Dayton to beat Ohio State last night, but that one game appears to have eliminated about 80% of participants.

If you’re looking for even more opportunities to combine basketball and predictive analytics, then check out this Kaggle contest with a smaller payday but better odds.

Statistician George Box is famous for saying, “essentially, all models are wrong but some are useful”.  I wish you luck in your office pool, and if you beat the odds remember the bloggers in your life :)

Skills needed for competitive advantage in analytics (hint, it's not just the math)

I'm a big believer in both/and thinking, so I'll stand squarely in the middle and say that the most important skills for competitive advantage in analytics include a combination of top-notch modeling abilities along with business acumen, critical thinking, and curiosity. I was intrigued by a blog post on this topic from, where editor Beth Schulz began with the provocative title "Quants Not so Necessary for Predictive Modeling?" She takes her cue from a recent TDWI best practices report on "Predictive Analytics for Competitive Advantage," where among other things they shared the results of a survey on the skills necessary to perform predictive analytics. More than two thirds of respondents agreed that knowledge of the business, critical thinking, and understanding of the data are essential. Agreement then begins to diminish, with 41% citing training in predictive analytics but only 34% agreeing on a degree in this field.

Greta Roberts of Talent Analytics has her own angle on this question, based on a study they did of analytics professionals themselves to profile their traits. Beyond the obvious findings their research points to curiosity and creativity, as well as discipline, as top attributes to look for when hiring this kind of talent. But when I am in conversations with our customers I hear varying opinions on this question. Some say these skills don't all exist in one person, which is why they have teams. Others believe that the key is training the quants in the "soft skills."

A solid foundation in the fundamentals of a quantitative discipline provides an excellent start for a career in analytics, and those who come into analytics from other paths will find themselves going back to learn things they hadn't studied in school. It can be a tough slog to try to catch up on linear algebra at night. But the "math" alone won't solve a business problem. It is essential to understand the business context around a problem to formulate its solution. And then this proposed solution must typically be explained to a group of stakeholders that include at least some people without deep analytical training. And implementing most solutions involve collaborating across different business units with diverse background and training. So a much wider set of skills are necessary to go from problem to solution.

To highlight this challenge SAS teamed up with the Analytics Section of INFORMS for the Student Analytical Scholar Competition, which requires applicants to read a case study and submit a Statement of Work explaining how they would address the business problem in the case study. Naturally this requires analytical skills, but those skills alone don't lead to the best applications. In this interview, 2013 winner Alex Akulov talks about his perspective after finishing his bachelor's degree in math (with a minor on optimization). He assumed that when presented with a problem he would formulate it and then be done. "It's an optimal solution, so you present it to the manager and of course, he's going to say, "Yeah, let's do it because it's optimal." But how often does that ever happen?

Communication skills are essential as well, which is why this competition offers students a chance to ask questions as if they were consultants on the job interacting with the "customers." In this discussion forum (open until February 14 at 5:00 pm EST), they can query the individuals involved in the case study who will respond as they see fit. That will give applicants more information to incorporate into their submission, which is due by midnight on February 17. The winner will have their expenses paid to attend the INFORMS Conference on Business Analytics and Operations Researchin Boston March 30-April 1, where they will have a fantastic opportunity to attend sessions given by analytics practitioners and networks amongst them. Alex cited that experience as a great benefit of winning.

LinkedIn discussions are full of students asking what it takes to succeed in analytics. How would you advise them - what do you look for when hiring analytics teams?

Using panel data to measure the economic impact of the Super Bowl

Site of Super Bowl 52

MetLife Stadium in East Rutherford, New Jersey is the site of Super Bowl 48

This year the Super Bowl will take place in East Rutherford, New Jersey at MetLife Stadium just outside New York City. For the first time in this event’s 48-year history, the game will take place outdoors in a cold-weather environment, potentially subjecting players and fans to sub-freezing temperatures.  The fans in their excitement will find ways to not freeze, but will the local economy be “warmed” by the games? While the game itself will likely sell-out, the more long-lasting economic impact of the game is the effect, if any, on the local economy as result of the game.

Several economists at University of Maryland-Baltimore County have used historical information to estimate the marginal effect of the game on the local economy. They compile annual per capita income (a measure of local prosperity) for cities that host a playoff game for 1969 to 1997. They also record a number of potential confounding factors such as population growth, other economic drivers, and sports-related events.  These other factors are necessary controls in order to separate the effect of the games from other spurious effects.

The model estimated in their paper is

where   is an area-specific fixed effect and   is a time-specific fixed effect common to all areas.  Economists are fans of specifying these area-specific effects as fixed effects (FE) rather than random effects (RE), which tends to be more popular among statisticians. While these differences seem subtle, the implications are large. This fixed effects strategy depends on an assumption regarding the correlation between the unobserved effect and the idiosyncratic error term. In a purely randomized experiment the individual-specific effect is, by construction, uncorrelated with the variable of interest. When using observational data, however, we cannot credibly make this assumption. Why? Well, can we randomly select cities to host playoff games and observe everything affecting the income of a city?  No, of course not. We only get the data the world gives us and do the best we can. In this case, we observe cities that host playoff games, cities that do not, and what those cities look like before and after. It is from these observational data that we make inference. Economists economize everywhere. Even with statistics.  So what do our fearless economists find?

Using the FE estimator they find an economically and statistically insignificant impact of a postseason game on the per capita income of an area, holding all else constant (this model can be estimated with the PANEL procedure in SAS/ETS ®).  The only effect of the Super Bowl on income appears to be on the average income of the city whose team wins the game - the winning team’s city gets a positive income bump.  The authors chalk this up to an unidentified productivity-enhancement, which in lay terms means the Denver or Seattle economy might have a positive economic boost in the near future.

From an economic perspective, the result of “no impact” for New York City (or any area) is not surprising. Economic activity requires available resources. Cities hosting games are likely to have only so many hotels and restaurants with which to host visitors. These resources are likely to be busy most of the year anyway. This is especially true of the New York City Metro area. Visitors for the "big game" likely dissuade other visitors who might otherwise visit the city. In essence, we have a “crowding out.”

One of New York City’s most beloved sports figures, Yogi Berra, sums up crowding out well when he says, “Nobody goes there anymore, it’s too crowded.”

So, while commercials, politicians and the media will likely portray Super Bowl 48 as a boon for the New York City economy, we can thank some clever econometrics for setting us straight.

*In fact, one can even test the validity of the fixed effect vs. random effect assumption by using a Hausman Test, which tests for systematic differences in the estimates. The authors test this assumption and soundly reject the use of random effects. This test is automatically calculated when the random effects estimator is employed in the PANEL procedure.


Coates, D. and B. R. Humphreys. 2002. The Economic Impact of Postseason Play in Professional Sports. Journal of Sports Economics, 3(3): 291-299.

SAS Institute, The PANEL Procedure, SAS/ETS(R) 13.1,

Image credit: photo by Anthony Quintano // attribution by creative commons