How Bayesian analysis might help find the missing Malaysian airplane

At the time this blog entry was written, there still appears to be little to no signs of locating the missing Malaysian flight MH370. The area of search, although already narrowed down from the size of the United States at one point to the size of Poland, is still vast and presents great challenges to all participating nations. Everything we’ve seen in the news so far have been leads that turn out to be nothing but dead ends.

There are a great many uncertainties surrounding the disappearance of flight MH370, making a search and rescue operation all but seem like finding a needle in an ocean-sized haystack. There is, however, an already established statistical framework based on Bayesian inference that has had great success in locating, amongst other things, a Hydrogen bomb lost over the Mediterranean sea1, a sunken nuclear submarine from the US Navy (USS Scorpion)1,  and the wreckage of Air France Flight 447 just several years ago.

The U.S. Coast Guard’s SAROPS (Search and Rescue Optimal Planning System) is based on the same Bayesian search framework that’s refined to accommodate ocean drift and crosswinds. As there is currently no evidence that the Malaysian government or Malaysian airlines is employing a Bayesian optimal search method, it is worthwhile to point out why a Bayesian search strategy should at least be  considered for a situation such as the missing MH370 case.

Unknown variables

First of all, there are still many unknowns regarding the missing MH370. Unknown variables are typically modeled probabilistically in the statistical world. Most of us are familiar with the frequency definition of probability.  If I handed you an old beat-up coin and asked you to tell me what the probability of heads is if the coin is flipped, your best bet is to flipped the coin say 5000 times and record the number of times it came up heads. Then you would divide the number of heads by 5000 and get a pretty good estimate of the probability in question. This is the frequency interpretation of probability:  the probability of an event is the relative frequency of the event happening in an infinite population of repeatable trials.

In the real world, however, we are often faced with rare and unique events, events that are non-repeatable. Hopefully, we wouldn’t have to study 5000 plane crashes to get a good estimate of a plane accident happening. In reality, there have only been 80 recorded missing planes since 1948. This calls for a different interpretation of probability, a subjective one that reflects an expert’s degree of belief. The subjective nature of the uncertainties of a rare event such as the loss of flight MH370 places us squarely in the domain of Bayesian inference. In the case of Air France 447, the prior distribution (initial belief about the crash location) of the search area was taken to be a mixture of three probability distributions, each representing a different scenario. The mixture weights were then decided based on consultations with experts at the BEA.

All information is useful

A big advantage of employing a Bayesian search method is that a Bayesian framework provides a systematic way to incorporate all available information via Bayes’ rule. This is invaluable in a large and complex search operation where new information will constantly emerge and the situation could change at a moment’s notice, requiring the search strategy to be constantly updated. The important thing to note here is that any information is considered useful. One area turning up empty will lower the probability of the wreckage in that area after a Bayesian update, but at the same time, it will increase the probability of the wreckage in other areas yet unsearched.

Air France 447 went missing in June 2009, when BEA commissioned scientific consulting firm Metron Scientific Solutions to come up with a probability map of the search area in 2011, two years of search efforts had turned up nothing. In their model, the Metron team took into account all four unsuccessful previous searches when updating their prior distribution of the crash location. Based on their recommendation of resuming the new round of search efforts around the region with the highest posterior probability, the wreck was located only one week into the search2.

While there are many intricate steps involved in deploying a Bayesian search strategy, particularly in coming up with the prior distribution and quantifying the likelihood of the different accident scenarios, the core math involved is surprisingly straightforward. For illustration purposes, assume that the search area is divided up into N grids, labelled x1 through xN. Let the prior probability of the wreck being in grid xk be denoted by p(xk+), for k=1,…,N. Now let the probability of successful detection in grid xk given the wreck is in grid xk be denoted p(Sk+|xk+). If the search in grid xk turns up empty, then the posterior probability of the wreck being in grid xk given the fact that the search in grid x is unsuccessful is:

Meanwhile the posterior probability of the wreck being in any other grid xm is also updated by the information that the search in grid xk turned up unsuccessful:

Note that p(xk+|Sk-) < p(xk+), and p(xm+|Sk-) > p(xm+).

A Bayesian search strategy would start from the grids with the highest prior probability mass, if nothing is found in those grids then update the posterior probability of all grids via Bayes theorem and start all over treating the new posterior probabilities as current prior probabilities. This could be a lengthy process, but as long as the wreck is located within the prior region, it would eventually be located.

An unprecedented search area

When AF447 crashed, BEA was able to quickly establish that the plane would have to lie within a 40 nautical mile radius circle from the plane’s last known location. This is roughly 6600 square miles of initial search area, compare to MH370’s current Poland-sized search area of more than 100,000 square miles. Considering it took two years, and five rounds of search efforts to finally locate AF447, the difficulty involved in finding MH370 is unprecedented in the history of modern aviation. While a Bayesian search method might not locate the remains of MH370 any time soon, its flexibility and systematic nature, not to mention its past successes, makes it a powerful tool to seriously consider for the current search efforts.

For interested readers, here is the paper that documented the Metron teams’ efforts in using Bayesian inference to develop the probability map of AF447’s location.


1. S. B. McGrayne. “The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy”, Yale University Press, 2011.

2. L. D. Stone, C. M. Keller, T. M. Kratzke and J. P. Strumpfer. “Search for the Wreckage of Air France Flight AF 447”, submitted to Statistical Science, 2013.

March Madness and Predictive Modeling

Jared Dean and son at a 2013 NCAA Tournament game one

In my region of North Carolina (Raleigh, Durham, and Chapel Hill) one of the most anticipated times of the year has arrived— the NCAA basketball tournament. This is a great time of year for me, because I get to combine several of my passions.

For those who don’t live among crazed college basketball fans, the NCAA (National Collegiate Athletic Association) holds an annual tournament that seeds the regional conference winners and the best non-conference winning teams in a single elimination tournament of 68 teams to determine the national champion in collegiate basketball.  The teams are ranked and seeded so that the perceived best teams don’t face each other until the later rounds.

In the tournament history stretching back more than 75 years, only 14 universities have won more than one championship, and three schools local to SAS world headquarters are on that list (the University of North Carolina, Duke University, and North Carolina State University). That concentration, combined with the fact that this area is a well-known cluster for statistics, means that I am not alone amongst my neighbors in combining my passions.

The NCAA tournament carries with it a tradition of office betting pools, where coworkers, families, and friends predict the outcome of the 67 games to earn money, pride, or both. Unbeknownst to many of them they are building predictive models, something near and dear to my heart. As a data miner, I analyze data and build predictive models about human behavior, machine failures, credit worthiness, and so on. But predictive modeling in the NCAA tournament can be as simple as choosing the winner by favorite color, most fierce mascot, or alphabetizing. Others rely on their observation of the teams throughout the regular season and conference championships to inform their decisions, and then they use their “gut” to pick a winner when they have little or no information about one of both of the teams.

I’m sure some readers have used these kinds of strategies and lost or maybe even won the “kitty” in these betting pools, but the best results will come using historical information to identify patterns in the data. For example, did you know that since 2008 the 12th seed has won 50% of the time against the 5th seed? Or that the 12th seed has beat the 5th seed more often than the 11th seed has beat the 6th seed?

Upon analyzing tournament data, patterns like these emerge about the tournament, specific teams (e.g. NC State University struggles to make free throws in the clutch), or certain conferences. To make the best predictions, use this quantitative information in conjunction with your own domain expertise, in this case about basketball.

Predictive modeling methodology generally comes from two groups: statisticians and computer scientists (who may take a more machine learning approach). The field of data mining encompasses both groups with the same aim - to make correct predictions of a future event. Common data mining techniques include logistic regression, decision trees, generalized linear models, support vector machines (SVM), neural networks, and many many more (all available in SAS).

While these techniques are applied to a broad range of problems, professors Jay Coleman and Mike DuMond have successfully used those from SAS to create their NCAA“ dance card,” a prediction of the winners that has had a 98% success rate over the last three years.

If you think you have superior basketball knowledge and analytical skills, then hopefully you entered the ultimate payday competition from Warren Buffett. He will pay you $1 billion if you can produce a perfect bracket.  Before you go out and start ordering extravagant gift, it is worth considering that the odds of winning at random are 1 in 148 pentillion (148,000,000,000,000,000,000), but with some skill your odds could improve to 1 in 1 billion. In this new world of crowdsourcing, about 8,000 people have united to try and win the billion dollar  prize. I don’t know how many picked Dayton to beat Ohio State last night, but that one game appears to have eliminated about 80% of participants.

If you’re looking for even more opportunities to combine basketball and predictive analytics, then check out this Kaggle contest with a smaller payday but better odds.

Statistician George Box is famous for saying, “essentially, all models are wrong but some are useful”.  I wish you luck in your office pool, and if you beat the odds remember the bloggers in your life :)

Skills needed for competitive advantage in analytics (hint, it's not just the math)

I'm a big believer in both/and thinking, so I'll stand squarely in the middle and say that the most important skills for competitive advantage in analytics include a combination of top-notch modeling abilities along with business acumen, critical thinking, and curiosity. I was intrigued by a blog post on this topic from, where editor Beth Schulz began with the provocative title "Quants Not so Necessary for Predictive Modeling?" She takes her cue from a recent TDWI best practices report on "Predictive Analytics for Competitive Advantage," where among other things they shared the results of a survey on the skills necessary to perform predictive analytics. More than two thirds of respondents agreed that knowledge of the business, critical thinking, and understanding of the data are essential. Agreement then begins to diminish, with 41% citing training in predictive analytics but only 34% agreeing on a degree in this field.

Greta Roberts of Talent Analytics has her own angle on this question, based on a study they did of analytics professionals themselves to profile their traits. Beyond the obvious findings their research points to curiosity and creativity, as well as discipline, as top attributes to look for when hiring this kind of talent. But when I am in conversations with our customers I hear varying opinions on this question. Some say these skills don't all exist in one person, which is why they have teams. Others believe that the key is training the quants in the "soft skills."

A solid foundation in the fundamentals of a quantitative discipline provides an excellent start for a career in analytics, and those who come into analytics from other paths will find themselves going back to learn things they hadn't studied in school. It can be a tough slog to try to catch up on linear algebra at night. But the "math" alone won't solve a business problem. It is essential to understand the business context around a problem to formulate its solution. And then this proposed solution must typically be explained to a group of stakeholders that include at least some people without deep analytical training. And implementing most solutions involve collaborating across different business units with diverse background and training. So a much wider set of skills are necessary to go from problem to solution.

To highlight this challenge SAS teamed up with the Analytics Section of INFORMS for the Student Analytical Scholar Competition, which requires applicants to read a case study and submit a Statement of Work explaining how they would address the business problem in the case study. Naturally this requires analytical skills, but those skills alone don't lead to the best applications. In this interview, 2013 winner Alex Akulov talks about his perspective after finishing his bachelor's degree in math (with a minor on optimization). He assumed that when presented with a problem he would formulate it and then be done. "It's an optimal solution, so you present it to the manager and of course, he's going to say, "Yeah, let's do it because it's optimal." But how often does that ever happen?

Communication skills are essential as well, which is why this competition offers students a chance to ask questions as if they were consultants on the job interacting with the "customers." In this discussion forum (open until February 14 at 5:00 pm EST), they can query the individuals involved in the case study who will respond as they see fit. That will give applicants more information to incorporate into their submission, which is due by midnight on February 17. The winner will have their expenses paid to attend the INFORMS Conference on Business Analytics and Operations Researchin Boston March 30-April 1, where they will have a fantastic opportunity to attend sessions given by analytics practitioners and networks amongst them. Alex cited that experience as a great benefit of winning.

LinkedIn discussions are full of students asking what it takes to succeed in analytics. How would you advise them - what do you look for when hiring analytics teams?

Using panel data to measure the economic impact of the Super Bowl

Site of Super Bowl 52

MetLife Stadium in East Rutherford, New Jersey is the site of Super Bowl 48

This year the Super Bowl will take place in East Rutherford, New Jersey at MetLife Stadium just outside New York City. For the first time in this event’s 48-year history, the game will take place outdoors in a cold-weather environment, potentially subjecting players and fans to sub-freezing temperatures.  The fans in their excitement will find ways to not freeze, but will the local economy be “warmed” by the games? While the game itself will likely sell-out, the more long-lasting economic impact of the game is the effect, if any, on the local economy as result of the game.

Several economists at University of Maryland-Baltimore County have used historical information to estimate the marginal effect of the game on the local economy. They compile annual per capita income (a measure of local prosperity) for cities that host a playoff game for 1969 to 1997. They also record a number of potential confounding factors such as population growth, other economic drivers, and sports-related events.  These other factors are necessary controls in order to separate the effect of the games from other spurious effects.

The model estimated in their paper is

where   is an area-specific fixed effect and   is a time-specific fixed effect common to all areas.  Economists are fans of specifying these area-specific effects as fixed effects (FE) rather than random effects (RE), which tends to be more popular among statisticians. While these differences seem subtle, the implications are large. This fixed effects strategy depends on an assumption regarding the correlation between the unobserved effect and the idiosyncratic error term. In a purely randomized experiment the individual-specific effect is, by construction, uncorrelated with the variable of interest. When using observational data, however, we cannot credibly make this assumption. Why? Well, can we randomly select cities to host playoff games and observe everything affecting the income of a city?  No, of course not. We only get the data the world gives us and do the best we can. In this case, we observe cities that host playoff games, cities that do not, and what those cities look like before and after. It is from these observational data that we make inference. Economists economize everywhere. Even with statistics.  So what do our fearless economists find?

Using the FE estimator they find an economically and statistically insignificant impact of a postseason game on the per capita income of an area, holding all else constant (this model can be estimated with the PANEL procedure in SAS/ETS ®).  The only effect of the Super Bowl on income appears to be on the average income of the city whose team wins the game - the winning team’s city gets a positive income bump.  The authors chalk this up to an unidentified productivity-enhancement, which in lay terms means the Denver or Seattle economy might have a positive economic boost in the near future.

From an economic perspective, the result of “no impact” for New York City (or any area) is not surprising. Economic activity requires available resources. Cities hosting games are likely to have only so many hotels and restaurants with which to host visitors. These resources are likely to be busy most of the year anyway. This is especially true of the New York City Metro area. Visitors for the "big game" likely dissuade other visitors who might otherwise visit the city. In essence, we have a “crowding out.”

One of New York City’s most beloved sports figures, Yogi Berra, sums up crowding out well when he says, “Nobody goes there anymore, it’s too crowded.”

So, while commercials, politicians and the media will likely portray Super Bowl 48 as a boon for the New York City economy, we can thank some clever econometrics for setting us straight.

*In fact, one can even test the validity of the fixed effect vs. random effect assumption by using a Hausman Test, which tests for systematic differences in the estimates. The authors test this assumption and soundly reject the use of random effects. This test is automatically calculated when the random effects estimator is employed in the PANEL procedure.


Coates, D. and B. R. Humphreys. 2002. The Economic Impact of Postseason Play in Professional Sports. Journal of Sports Economics, 3(3): 291-299.

SAS Institute, The PANEL Procedure, SAS/ETS(R) 13.1,

Image credit: photo by Anthony Quintano // attribution by creative commons

"The hungry statistician" – or why we never can get enough data

As the “Year of Statistics” comes to a close, I write this blog in support of the many statisticians who carefully fulfil their analysis tasks day by day, and to defend what may appear to be demanding behavior when it comes to data requirements.

How do statisticians get this reputation?

Are we really that complicated, with data requirements that are hard to fulfil? Or are we just pushed by the business question itself and the subsequent demands of the appropriate analytical methods?

Let's accept the presumption of innocence and postulate that none of us ask IT departments to excavate data from historical time periods or add just any old variable to the data mart just for fun.

Very often a statistician's work is assessed on the quality of the analytical results. The more convincing, beneficial, and significant the results, the clearer it is to attribute the success to the statistician. Also, it is well-known that good results come from high quality and meaningful data. Thus it is understandable that statisticians emphasize the importance of the data warehouse.

And this insistence is not driven by selfishness “to appear in a good light,” but because we take seriously our work of making well-informed decisions based on the analysis.

Let’s consider three frequent data requirements in more detail.

Who is interested in old news? – can we learn from history?

In order to make projections about the future, historic patterns need to be discovered, analysed, and extrapolated. To do so, historic data is needed. For many operational IT systems, historic versions of the data are irrelevant; their focus is on having the actual version of the data to keep the operational process up and running.

Consider the example of a tariff (fee) change with your mobile phone provider. The operational billing system requires primarily the actual contracted tariff, in order to bill each phone call correctly. To analyze customer behaviour we must know the prior tariff to find out which pattern of tariff-change frequently leads to a certain event, like a product upgrade or a cancellation.

For many analyses we need to differentiate between historic data and the historic snapshot of data.

To forecast the number of rented cars for a car rental agency for the next four weeks, we may need not only the daily number of rented cars but often the bookings that have been already received. For example, for November 18, 2013 the statistical model will use the following data:

  • Number of rented cars on November 18
  • Number of bookings for the rental day, November 18, that are known as of November 17 (the day before)
  • Number of bookings for the rental day, November 18 that are known aso f November 16 (two days before)

As the historic booking status for a selected rental day is continuously overwritten by the operational system, the required data can only be provided if they are historicized in a data warehouse.

“More” is almost always better

In order get well-based conclusions from statistical results, a certain minimum data quantity is needed. This minimum quantity (also called sample size or number of cases) depends on the analysis task and the distribution of the data. The area of sample size planning deals with the determination of the number cases that are needed to make sure that a potential difference in the data can be recognized with certain statistical significance.

In predictive modelling, where for example the probability of a certain event will be predicted, we care not only about the number of observations but also the number of events. In a campaign response analysis, a data sample with 30 buyers and 70 non-buyers will allow us to make better conclusions about the reasons for a product purchase, compared to a situation where only five buyers and 95 non-buyers are in the data (although both cases have a sample size of n=100).

“More” can also mean that a larger number of attributes need to be in the data warehouse. These additional attributes can potentially increase the accuracy of the prediction or finding additional relationships. The increase in the number of attributes can be achieved by including additional data sources or by creating derived variables from transactional data.

Analyzing more data can also make the data volume hard to handle. Because of its computing power and the ability to handle large amounts of data (big data), SAS has always been excellently prepared for that task. Now we also offer a specialized SAS High Performance Solution.

Detailed data vs. aggregated data – or why external data are not always the solution to the data availability problem

External data are often considered as the solution to enrich analysis data with those aspects that are missing in your own data. In many cases this is truly possible; for example, where socio-demographic data per district is used to describe customer background.

But sometimes the features in this data are not available on individual customers but only aggregated by group. But analytical methods often need detailed data per analysis subject.

In my book “Data Quality for Analytics” I use an example of the performance of a sailboat during a sailing race. The boat has a GPS-tracking device on board but no wind-measuring device. Thus the position, speed, and compass heading are available for the boat but not the wind speed and the wind direction.

We could assume that “external data” of a meteorological station in the harbour could be a good substitute for this data. A more detailed view reveals that this data shows a good picture of the general wind situation. But they are measured far away from the race area and not representative of the individual race behaviour of a boat. In addition they are only collected in five-minute intervals and do not allow a detailed analysis on short term wind shifts.

We care! That’s why are demanding.

When we request comprehensive, historical, detailed data, we statisticians do not want to be nasty; we just want to treat our respective analysis question with the right amount of carefulness.

Got your interest?

If I have caught your interest, you can find more details in my books „Data Quality for Analytics Using SAS“ and „Data Preparation for Analytics Using SAS“.


How WAVELETS can help separate the signal from the noise

Wavelet analysis is an exciting and relatively new field of study that enables one to extract underlying patterns either from spatially varying or temporally varying data.  Pixel values representing the relative brightness and color that constitute an image are an example of spatially varying data, and daily variations of financial market prices are examples of temporally varying data. By focusing on underlying trends and patterns in the data, wavelet analysis has been used to make significant advances in information and image storage and retrieval, medical diagnosis, and even speech and voice recognition. Wavelet analysis is accomplished in SAS by leveraging various built-in subroutines available in the SAS/IML® software.

You might ask what a wavelet is and why wavelets are important.  As the name suggests, a wavelet is a short-lived ripple/oscillation that has a specific mathematical representation. An example of a wavelet is shown below:


Wavelet-based techniques employ wavelets to extract underlying patterns present in the data, remove noise from the data, and achieve data compression.

To understand how wavelet analysis works, let us consider its application in the context of electrocardiogram (ECG/EKG) time-series data of a patient with occasional arrhythmia, shown the figure below (in red). An EKG signal records the electrical activity of the heart and is used to determine whether or not the heart is functioning normally. The morphology and regularity of the heart-related features in an EKG signal are analyzed for detection of abnormalities in the heart. However these features are often embedded in noise and other spurious features that complicate the analysis. To make definitive diagnosis of a possible heart disease, it is crucial to remove these irrelevant features from the EKG signal as best as possible.  Shown in the figure are two such irrelevant features in the EKG signal. First, the patient’s respiration causes the overall signal to slowly drift from a reference baseline; this drift, known as the baseline drift, is captured in the slowly undulating dashed blue line in the graphic. Second, the EKG signal is partially embedded in noise generated from artifacts unrelated to the heart activity, captured in the irregular fluctuations in the graphic. Both the noise and the baseline drift are required to be removed for accurate diagnosis.

EKG time series plot

To understand how wavelets are used in this type of analysis, we simply have to change our perspective from ‘the whole’ to the ‘sum of parts’ or from the signal itself to the components or building blocks that constitute the composite EKG signal. With the help of wavelets, a complex signal such as the EKG can be broken down or decomposed into individual components that help capture patterns repeating at distinct time intervals. For example, the baseline drift in the EKG signal has a pattern that is quite distinct from the noise pattern as well as from the heart rhythm pattern; these different patterns can be detected and isolated by wavelet analysis, leaving us with data that pertains to the heart activity alone, allowing for accurate diagnosis. Furthermore, reconstruction of the decomposed signal after removal of noise and other artifacts not only improve the fidelity of the patient’s heart recording but also compresses the size of the dataset by getting rid of extraneous data. In essence, wavelets allow us to convey pertinent information using fewer data samples.

So how does wavelet analysis work on image storage and retrieval?  If you think of an image as a stream of pixelated information that varies in terms of its lightness and darkness, especially when reading from left to right and top to bottom, the information stream has characteristic oscillations that define content areas of the picture.  If we capture only the most important aspects of the contrast and color through wavelet analysis, then we can reduce the amount of information saved.  By reversing the process of converting the transformations back to regular data, we can accurately and precisely reconstruct the image.

Other areas of application of wavelets include:

  • The bio-medical industry performs DNA/protein and blood-pressure analysis, cancer detection, and breathing pattern analysis in new-born babies using wavelets.
  • In the government, wavelet-based techniques are being employed for facial recognition algorithms, fingerprint detection etc.
  • In the finance industry, quick variation in market prices and trading patterns are being studied using wavelets.
  • In the oil and gas industry, estimation of subsoil properties required for oil exploration employ wavelet-based analysis. This allows them to focus on underground features likely to hold the most oil or to map out the underlying rock structure.

Though wavelets were traditionally used for image analysis and compression, they are quickly gaining momentum in a variety of problems that require careful interpretation of complex patterns present in the data at different spatial/temporal resolutions and for isolating weak signals from the noise.

Get started today to leverage the value of wavelets in SAS/IML® software to reveal underlying patterns in the data for a better understanding of your business problem.


We are in the final countdown to the Analytics 2013 Conference to be held next week, October 21-22 at the Hyatt Regency Orlando.  There is a power-packed agenda for the conference, featuring four very strong keynote presentations from Dr. Jim Goodnight, CEO of SAS; Ed Gaffin, Walt Disney World; Will Hakes, Link Analytics, and Sven Crone, Lancaster University.

The full conference agenda includes seven vertical tracks, with up to five hourly presentations in each track on both days.  The tracks cover a wide variety of analytical topics such as Big Data, Health Care, Marketing, Forecasting, Text Mining, and Operations Research.  There will be something for everyone, regardless of industry.  For the first time, we are offering a “SAS Presents” track with a focus on ‘Updates from Advanced Analytics’.

The conference has an outstanding group of sponsors,  with Teradata and Intel as Platinum Sponsors.  In addition to the commercial sponsors, we have 15 academic sponsors who will be showcasing the programs that are educating the next generation of analytical talent.

The Analytics and Data Mining Shootout features the top three student teams presenting their solutions to a prepared, hands-on problem.  Over 50 teams compete each year, the top three are awarded prizes, and three honorable mentions are announced.  Support for the competitions has been provided over the past seven years from the Institute for Health & Business Insight.  Stop by the presentations on Tuesday and discover the amazing work these students produce.

In addition to the shootout competition, we have over 50 posters that will be on display.  Students competed in a poster competition, and six poster winners were selected to receive an all-expenses paid trip to the conference.  Visit the sessions and chat with these student researchers.

The activities planned for attendees begin on Saturday, with an offering of SAS’ certification exam for predictive modeling using SAS® Enterprise Miner™ and continue on Sunday morning with additional certification exams. A number of workshops are scheduled for Sunday.  Following the completion of the conference agenda on Tuesday afternoon, eight, three-day courses are offered beginning Wednesday morning. Course topics include including predictive modeling, data mining, operations research, customer segmentation, econometrics, and forecasting.  The week is rounded out with four, two-day courses and a several one-day events.  These classes cover text mining, high performance analytics and visual analytics.

So, come join us for the Analytics 2013 Conference in Orlando and stay over for some great training opportunities.  Learn, share and network during the day and enjoy the social activities in the evenings.

Why my aunt Susanne and her friends give us a hard time in statistical analysis

My aunt Susanne is an elderly lady, who lives at the countryside and looks forward to celebrating her 80th birthday soon. Since the 1960's she has had a telephone connection with her fixed line provider. At that time, and for many years later, in the country where my aunt lives, you had to apply for a telephone contract and hope that you received one. This was long before topics like "customer relationship management" or "customer care" became important. There was almost no personal data (date of birth, demographics) collected during the application process, as there was need for it. The most important details were the post address of the telephone line, so the provider could send out the bill.

In the 1990's, topics like "customer segmentation" and "know your customers" became more and more important, also at aunt Susanne’s phone provider. Since then it is mandatory to provide the date of birth with every new contract or contract change. My aunt, however, never changed or extended her phone contract (She says, “A simple phone is enough!”) and newer participated in customer surveys or marketing campaigns. Thus, no additional data were collected from her. And she is not the only one in this situation. In her circle of friends there are many with a similar “data history.”

The statistician in his cubicle

If the statistician in the analysis department of aunt Susanne’s phone provider now looks into the customer database and creates an analysis of "customer age," he might see the following picture.

Customer age distribition

Customer age distribition

The age distribution by years shows how many customers are in which customer age groups. Based on that information, it is possible to define priorities for product bundles and selections for marketing campaigns. Additionally, in this diagram, the statistician will see the proportion of missing values, where customer age could not be calculated because of a missing date of birth. In our case this proportion is 9.1 percent.

The statistician now must decide how to deal with the missing values.

  • Shall a group with "age unknown" be created?
  • Shall the observations with missing values just be excluded from the analysis?
  • Shall an average age of 42 years be assumed?
  • Or shall the imputation values be sampled from the true distribution.

The last two options assume implicitly that there is no pattern behind the fact that age is missing.

If we, however, now return to my aunt Susanne and her friends, we can assume that the missing values occur for customers in a higher age group. After a certain year it was not even possible to get a contract without providing the date of birth. So we can assume that the distribution of the missing age values does not cover the whole range of values, but are located at the right end of the distribution. The determination of an optimal replacement value for “age missing” has to consider this fact in form of a business rule.

Customer age distribution with missing values highlighted

Customer age distribution with missing values for Aunt Susan and her friends highlighted

The red area in the histogram thus is the "my aunt Susanne and her friends" group. In fact, they represent a specific customer segment: older, long term customers, who did not show affinity for product upgrades or contract changes. And they should be treated differently in marketing actions. Probably these customer have demand for specific hardware (phone with large keys, simple usage). Or they need special assistance through the customer care hotline.

Open your mind!

What do we statisticians learn from this story? The data that we analyse have a history! They do not only reflect the value that they measure, but are also influenced by the business process, the type of data collection and data storage. To generate good results, it is mandatory for us not only to look at the data from the statistical point of view. We also have to observe the business background. For statistical analysis, we have to consider that things happen randomly only very few cases. Let’s think twice when we treat features in the data like missing values, outliers and biases as random. Or whether we need to investigate the background and handle our data individually here.

Statistical methods and SAS can help here to decide whether missing values occur randomly or whether systematic patterns lie behind that. Methods to detect these patterns include tile charts for the missing value pattern as shown in my blog contribution from February 2013, or multivariate methods like principal components analyses for the missing value indicator. Another option is to use the missing Yes/No flag in a predictive model to analyse which variables are correlated with the fact that the date of birth is missing. In the case of my aunt Susanne, the flag would be “long term customer relationship” or “basic product bundle.”

Got your interest?

If this post arouses your interest, you can find more details in my new book, Data Quality for Analytics Using SAS, or you can download the slides from my presentation at Analytics 2013 in London. You can also find a picture-blog of my books here.

The truth about the big data evolution (or at least as I see it)

Some recent press articles question the value of big data while a book takes the opposite approach; I’ll choose the middle way. The New York Times article ‘Is Big Data an Economic Big Dud?’  questions the value of digital data and the resulting increase in the amount of data. This CNBC article points out that NASCAR hasn’t seen any impact to their bottom line.

These articles on the negative end of the spectrum are countered by a recently released book.

‘Big Data: A Revolution That Will Transform How We Live, Work, and Think’ takes the opposite perspective. This book argues that big data will transform our world and goes to the extreme making statements such as ‘with big data you don’t have to understand why a correlation occurs.’

The truth lies somewhere between these viewpoints.

Here are some thoughts and examples from recent projects where my consulting team has been directly involved with customers wrestling with these issues. Results we’ve seen at SAS support the idea of fundamental change underway in how we work with big data and the business results. Unlike the aforementioned book authors, I view the progress being made as an ‘evolution’, not a revolution. The market is already multiple years into working with in-memory and in-database technologies. It’s not as if there has been an overnight shift.

Utilizing in-memory technologies, along with time-tested predictive algorithms, our clients have seen an orders of magnitude decrease in time to do their analyses. One large customer saw several different predictive models decrease from taking hours to run down to minutes, and even seconds in some cases.

In addition to reduced number-crunching time required, our clients are also now able to work with larger amounts of data. Given this ability our clients can work with entire databases instead of subsets of their data. One implication is the ability to model an entire credit portfolio, and to do so in a shorter time window, than previously possible with only a sample of that database.

Now that we are able to leverage a larger volume of data at increased velocity we can do more analyses, looking at more and different business problems. We can consider modeling entire databases without sampling in these big data environments.

Though we are in the early days of the big data evolution, it’s clear the book ‘Big Data’ is more on the mark than the negative articles that have recently been published. If one considers we are at the ‘crawling stage’ in working with big data, and already seeing orders of magnitude changes in working with data, one can only imagine the impact on applications like data mining, forecasting, text mining, optimization and quality control.  In turn business will realize quicker time to value, increased profit margins, reduced costs, and more satisfied customer by leveraging big data and the technologies to work with that data.

How incremental response modeling can help you reach the right target group more precisely

Did you know that more than 30,000 Americans die in traffic accidents every year? Interestingly, the U.S. import of mangoes from Brazil is found to be highly correlated with this fatality rate, as shown in the graph below.  But are mango imports a good indicator of the future traffic fatality rate? The answer is obviously, “No!”, but it does illustrate well that correlation does not always imply causation. The importance of this premise is critical in understanding marketing response models, which predict if the receipt of a marketing offer incents customers to buy more products.

Traditional response models score customers based on their likelihood to purchase so that marketing campaigns can be targeted towards those customers who will maximize the response rate. Although correlation may exist between the response rate and marketing incentive, traditional predictive models fail to address if the response rate was caused by the marketing campaign, because they cannot distinguish between customers who would have responded positively regardless of the marketing offer and those who responded positively only because of the offer. Clearly we only want to target the latter group to measure the extra revenue they generate.

Causality can be established and isolated from correlation by using incremental response modeling[1]. Application of this modeling technique requires designing an experiment where you categorize a group of randomly selected subjects into treatment and control groups. While the treatment group receives a “treatment” (e.g. a marketing offer), the control group does not. Ideally, the subjects in the two groups are identical, so any difference in outcomes between the two groups can be attributed solely to the “treatment.” We then isolate the effect of the treatment by assessing how the two groups reacted.

Incremental response modeling is an advanced modeling technique designed exactly for this purpose, because it enables one to measure the incremental impact of a certain treatment (e.g. the incremental/additional revenue that was generated because of the marketing offer). This functionality (available in SAS® Enterprise Miner™) is a convenient way to build models that can predict the incremental impact of an action by measuring the difference in outcomes between the treatment and control groups.

Since incremental response models predict the difference in outcomes rather than the direct outcome, we encounter “second-order effects” in the model. To illustrate, consider predicting the trajectory of a golf ball. Gravity is the dominant effect while wind is a second-order effect in this prediction. Measuring the incremental impact due to small second-order effects could be challenging. In the golf ball analogy, this translates to measuring an incremental change in the ball’s trajectory because of the presence of wind; on a calm day the second-order effect due to wind would be small. The incremental response model functionality in SAS® Enterprise Miner™  handles small second-order effects by allowing the user the flexibility to identify and rank the variables that will maximize the incremental impact, improving model stability and predictive power. For instance, the clubhead speed may be relatively more important than the humidity in maximizing the incremental change in the golf ball's trajectory!

Benefits from this type of modeling can be seen in a diverse set of applications:

  • Producing higher responses from properly targeted marketing campaigns
  • Isolating people who will benefit from retention campaigns
  • Targeting the right medicine to the right patient in clinical healthcare
  • Generating additional votes by targeting the “swing states” in political election campaigns

In essence, incremental response modeling could be applicable in any experimental design where treatment and control groups can be identified to measure the incremental impact of a certain action. At SAS I have worked with several universities to identify additional admissions that could be generated by targeting fellowships or other financial aid (“treatment”) to the right prospects. This practice leads to better distribution of the limited fellowship resources among prospective students that will be impacted the most.

To underscore, correlation doesn’t imply causation! In the context of mango imports and fatality rate mentioned in the opening paragraph, the incremental change in fatality rate for a controlled experiment would certainly be zero, revealing that the relation between traffic deaths and mangoes is correlation only.

The paper Using Incremental Response Modeling with SAS® Enterprise Miner™ shows how this technique can lead to powerful decision making in variety of applications, thereby helping you gain that extra competitive advantage. So use your budget strategically. Start leveraging the power of SAS for incremental response modeling with prospects where the incentive will be the tipping point that influences their decision.

[1]Incremental response modeling is also known as ‘net lift’, ‘true lift’, ‘differential response’, ‘incremental impact’, ‘incremental lift’,  ‘true response’, ‘net response’ modeling.