An intuitive approach to the appropriate use of forecasts

It is a mild summer evening in July at Lake Neusiedl here in Austria. The participants of the traditional YES Cup Regatta are sitting with beer and barbecue chops on the terrace of our clubhouse. The mood is relaxed, and everyone wants to tell their story after two eventful races.

A conversation at the end of our table draws my attention, because it is about forecasting, more specifically the usability and accuracy of weather and wind forecasts. As expected, the opinions differ substantially. From "mostly wrong" to "we should be thankful that we have them – in earlier times no forecast existed on that level of detail" to "I make my decisions based on the cloud pictures."

Knowing the wind conditions before a regatta is important, to enable good decisions such as: "With what size of the sail shall I use to start the race, so that I don’t have to change during the regatta" or "What wind direction will prevail and which areas of the lake will therefore be favored?"

Marc, an old stager in race sailing, explains his use of wind forecasts as follows:

"I always consider several available forecasts; Windguru, Windfinder, Otto Lustyk, swz.at and ORF Burgenland . So I get a picture of the diversity or uniformity of the possible wind scenarios - because obviously the stations use different weather models. So I can judge whether weather and wind for the race weekend is easy or hard to predict and how much I can trust the forecasts in general. In addition, I also monitor how much the predictions for the weekend change during the week. If they stay stable all week, the weather seems to allow a clear prediction; if the predictions change daily, it seems that we get very unstable whether conditions. On the race day itself, watching the clouds and the sky is very important. Short-term and local facts cannot be included in these models and give me additional information based on my experience.”

Wind-2-300x290

 

 

Wind-1-300x174

 

 

 

 

 

 

A smile can be seen on my face, and I intentionally do not participate in the conversation, because I do not want to be seen as the statistician who “always considers everything so mathematically." And even more importantly: there is nothing to add to Marc's statement. Without knowing, he has summarized the most important principles of business forecasting and talked about the proper handling of statistical forecasts. And although his professional background is definitely not in dealing with data, forecasts, or things like "business intelligence," at the next Analytics Conference Marc can work together with me at the forecasting demo station. Because what he has just explained represents important features in SAS® Forecast Server.

  • Combine models for stable forecasts: " I always consider several available forecasts."
  • Segmentation of time series -  "I can judge whether weather and wind for the race weekend is easy or hard to predict."
  • Confidence intervals for forecasts: "How much I can trust the forecasts in general"
  • Forecast stability analysis and rolling simulations: "I also monitor how much the predictions for the weekend change during the week."
  • Overwrites and judgemental forecasts: "Short-term and local facts cannot be included in these models and give me additional information."

Forecast

So enjoy the fact that there is software that does the very things that people consider as intuitively correct, smile with satisfaction, and head towards the beer tap for another beer. At least, that's what we did after Marc shared his intuition with us.

 

Post a Comment

Combined forecasts: what to do when one model isn’t good enough

My esteemed colleague and recently-published author Jared Dean shared some thoughts on how ensemble models help make better predictions. For predictive modeling, Jared explains the value of two main forms of ensembles --bagging and boosting. It should not be surprising that the idea of combining predictions from more than one model can also be applied to other analytical domains, such as statistical forecasting.

Forecast combinations, also called ensemble forecasting, are the subject of many academic papers in statistical and forecasting journals; they are a known technique for improving forecast accuracy and reducing the variability of the resulting forecasts. In their article “The M3 Competition: Results, Conclusions, and Implications" published by the International Journal of
Forecasting
, Spyros Makridakis and Michèle Hibon write about the results of a forecasting competition and share as one of their four conclusions: “The accuracy of the combination of various methods outperforms, on average, the specific methods being combined and does well in comparison with other methods.”

The lesson from this statement is that a combination of forecasts from simple models can add substantial value in terms of enhancing the quality of the forecasts produced, but the statement also concedes that combinations might not always perform better than a suitably-crafted model.

But how do you combine statistical forecasts? Similar to ensembles for predictive models, the basic idea is to combine the forecasts created by individual models, such as exponential smoothing models or ARIMA models. Let’s have a look at three combination techniques typically used:

  • Simple average
    • Every forecast created is combined using a similarly-weighted value – while this sounds like a simplistic idea, it has been proven very successful by practitioners, in particular if the individual forecasts are very different from each other.
  • Ordinary least squares (OLS) weights
    • In this approach an OLS regression is used to combine the individual forecasts. The main idea is to assign higher weights to the more accurate forecast.
  • Restricted least squares weights
    • Extends the idea of OLS weights by forcing constraints on the individual weights. For example, it might make sense to force all weights to be non-negative.

It is worth mentioning that estimating prediction error variance needs to be considered separately. In all cases, the estimated prediction error variance of the combined forecast uses the estimates of prediction error variance from the forecasts that are combined.

Not every time series forecast benefits from combination. The power of this technique becomes apparent when you consider that modern software such as SAS® Forecast Server allows for combination methods to be applied to large-scale time series forecasting of hierarchically structured data. The software makes it possible to generate combinations for inclusion into its model selection process in an automated fashion. In all cases, combined forecasts must prove their worth by their performance in comparison to other forecasts in the model selection process. If you are interested in more details this paper provides an extended explanation.

How ensemble models help make better predictions

My oldest son is in the school band, and they are getting ready for their spring concert. Their fall concert was wonderful; hearing dozens of students with their specific instruments playing together creates beautiful, rich sounding music. The depth of sound from orchestral or symphonic music is unmatched. In data mining, and specifically in the area of predictive modeling, a similar effect can be created using ensembles of models that leads to results that are more “beautiful” than a single model. A predictive model ensemble combines the posterior predictions from more than one model. When you combine multiple models together you create model crowdsourcing. Each individual model is described by a set of rules, and when the rules are applied in concert you can consider the "opinions" of many models. How to use these opinionated models depends on the goal. The two main ways are to (1) let every model vote and decide democratically the target label or (2) label the target with the opinion of the most confident model (probabilistically speaking).

Types of Ensembles

The two main forms of ensembles are boosting and bagging (more specifically called bootstrap aggregating). The most popular forms of ensembles are using decision trees. Random forest and gradient boosting machines are two examples that are very popular in the data mining community right now. While decision trees are the most popular they are not the only ensemble algorithm. Any model algorithm can be part of an ensemble and heterogenous ensembles can be quite powerful.

Bagging

Bagging, as the name alludes, takes repeated unweighted samples with replacement of the data to build models and then combines them. Think of your observations like grains of wild rice in a bag. Your objective is to identify the black grains because they have a resale price 10x greater when sold separately.

  1. Take a scoop of rice from the bag.
  2. Use your scoop of rice to build a model based on the grain’s characteristics, excluding that of color.
  3. Write down your model classification logic and fit statistics.
  4. Pour the scoop of rice back into the bag.
  5. Shake the bag for good measure and repeat.

examples of mixed riceHow big the scoop is relative to the bag, and how many scoops you take, will vary by industry and situation, but I usually use 25-30% of my data and take 7-10 samples. This results in a likelihood that every observation will be included 1-2 times in the model.

Boosting

Boosting is similar to bagging except that the observations in the samples are now weighted. To follow the rice problem from above, after step 3 I would take the grains of rice I had incorrectly classified (e.g. black grains I said were non-black or non-black grains I thought were black) and place them aside. I would then take a scoop of rice from the bag and leave some room to add the grains I had incorrectly classified. By including previously misclassified grains at a higher rate, the algorithm has more opportunities to identify the characteristics for correct classifications. This is the same idea behind giving more time to review flashcards of facts you didn’t know than those you did. For what it's worth, I tend to use bagging models for prediction problems and boosting for classification problems. By taking multiple samples of the data and modelling over iterations you allow factors that are otherwise weak to be explored. This provides a more stable and generalizable solution. When model accuracy is the most important consideration, ensemble models will be your best bet. This topic was recently discussed in much greater detail at SAS Global Forum. See this paper by Miguel Maldonado for more details.

Image credit: photo by Ludovico Sinz // attribution by creative commons

How Bayesian analysis might help find the missing Malaysian airplane

At the time this blog entry was written, there still appears to be little to no signs of locating the missing Malaysian flight MH370. The area of search, although already narrowed down from the size of the United States at one point to the size of Poland, is still vast and presents great challenges to all participating nations. Everything we’ve seen in the news so far have been leads that turn out to be nothing but dead ends.

There are a great many uncertainties surrounding the disappearance of flight MH370, making a search and rescue operation all but seem like finding a needle in an ocean-sized haystack. There is, however, an already established statistical framework based on Bayesian inference that has had great success in locating, amongst other things, a Hydrogen bomb lost over the Mediterranean sea1, a sunken nuclear submarine from the US Navy (USS Scorpion)1,  and the wreckage of Air France Flight 447 just several years ago.

The U.S. Coast Guard’s SAROPS (Search and Rescue Optimal Planning System) is based on the same Bayesian search framework that’s refined to accommodate ocean drift and crosswinds. As there is currently no evidence that the Malaysian government or Malaysian airlines is employing a Bayesian optimal search method, it is worthwhile to point out why a Bayesian search strategy should at least be  considered for a situation such as the missing MH370 case.

Unknown variables

First of all, there are still many unknowns regarding the missing MH370. Unknown variables are typically modeled probabilistically in the statistical world. Most of us are familiar with the frequency definition of probability.  If I handed you an old beat-up coin and asked you to tell me what the probability of heads is if the coin is flipped, your best bet is to flipped the coin say 5000 times and record the number of times it came up heads. Then you would divide the number of heads by 5000 and get a pretty good estimate of the probability in question. This is the frequency interpretation of probability:  the probability of an event is the relative frequency of the event happening in an infinite population of repeatable trials.

In the real world, however, we are often faced with rare and unique events, events that are non-repeatable. Hopefully, we wouldn’t have to study 5000 plane crashes to get a good estimate of a plane accident happening. In reality, there have only been 80 recorded missing planes since 1948. This calls for a different interpretation of probability, a subjective one that reflects an expert’s degree of belief. The subjective nature of the uncertainties of a rare event such as the loss of flight MH370 places us squarely in the domain of Bayesian inference. In the case of Air France 447, the prior distribution (initial belief about the crash location) of the search area was taken to be a mixture of three probability distributions, each representing a different scenario. The mixture weights were then decided based on consultations with experts at the BEA.

All information is useful

A big advantage of employing a Bayesian search method is that a Bayesian framework provides a systematic way to incorporate all available information via Bayes’ rule. This is invaluable in a large and complex search operation where new information will constantly emerge and the situation could change at a moment’s notice, requiring the search strategy to be constantly updated. The important thing to note here is that any information is considered useful. One area turning up empty will lower the probability of the wreckage in that area after a Bayesian update, but at the same time, it will increase the probability of the wreckage in other areas yet unsearched.

Air France 447 went missing in June 2009, when BEA commissioned scientific consulting firm Metron Scientific Solutions to come up with a probability map of the search area in 2011, two years of search efforts had turned up nothing. In their model, the Metron team took into account all four unsuccessful previous searches when updating their prior distribution of the crash location. Based on their recommendation of resuming the new round of search efforts around the region with the highest posterior probability, the wreck was located only one week into the search2.

While there are many intricate steps involved in deploying a Bayesian search strategy, particularly in coming up with the prior distribution and quantifying the likelihood of the different accident scenarios, the core math involved is surprisingly straightforward. For illustration purposes, assume that the search area is divided up into N grids, labelled x1 through xN. Let the prior probability of the wreck being in grid xk be denoted by p(xk+), for k=1,…,N. Now let the probability of successful detection in grid xk given the wreck is in grid xk be denoted p(Sk+|xk+). If the search in grid xk turns up empty, then the posterior probability of the wreck being in grid xk given the fact that the search in grid x is unsuccessful is:

Meanwhile the posterior probability of the wreck being in any other grid xm is also updated by the information that the search in grid xk turned up unsuccessful:

Note that p(xk+|Sk-) < p(xk+), and p(xm+|Sk-) > p(xm+).

A Bayesian search strategy would start from the grids with the highest prior probability mass, if nothing is found in those grids then update the posterior probability of all grids via Bayes theorem and start all over treating the new posterior probabilities as current prior probabilities. This could be a lengthy process, but as long as the wreck is located within the prior region, it would eventually be located.

An unprecedented search area

When AF447 crashed, BEA was able to quickly establish that the plane would have to lie within a 40 nautical mile radius circle from the plane’s last known location. This is roughly 6600 square miles of initial search area, compare to MH370’s current Poland-sized search area of more than 100,000 square miles. Considering it took two years, and five rounds of search efforts to finally locate AF447, the difficulty involved in finding MH370 is unprecedented in the history of modern aviation. While a Bayesian search method might not locate the remains of MH370 any time soon, its flexibility and systematic nature, not to mention its past successes, makes it a powerful tool to seriously consider for the current search efforts.

For interested readers, here is the paper that documented the Metron teams’ efforts in using Bayesian inference to develop the probability map of AF447’s location.

References:

1. S. B. McGrayne. “The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy”, Yale University Press, 2011.

2. L. D. Stone, C. M. Keller, T. M. Kratzke and J. P. Strumpfer. “Search for the Wreckage of Air France Flight AF 447”, submitted to Statistical Science, 2013.

March Madness and Predictive Modeling

Jared Dean and son at a 2013 NCAA Tournament game one

In my region of North Carolina (Raleigh, Durham, and Chapel Hill) one of the most anticipated times of the year has arrived— the NCAA basketball tournament. This is a great time of year for me, because I get to combine several of my passions.

For those who don’t live among crazed college basketball fans, the NCAA (National Collegiate Athletic Association) holds an annual tournament that seeds the regional conference winners and the best non-conference winning teams in a single elimination tournament of 68 teams to determine the national champion in collegiate basketball.  The teams are ranked and seeded so that the perceived best teams don’t face each other until the later rounds.

In the tournament history stretching back more than 75 years, only 14 universities have won more than one championship, and three schools local to SAS world headquarters are on that list (the University of North Carolina, Duke University, and North Carolina State University). That concentration, combined with the fact that this area is a well-known cluster for statistics, means that I am not alone amongst my neighbors in combining my passions.

The NCAA tournament carries with it a tradition of office betting pools, where coworkers, families, and friends predict the outcome of the 67 games to earn money, pride, or both. Unbeknownst to many of them they are building predictive models, something near and dear to my heart. As a data miner, I analyze data and build predictive models about human behavior, machine failures, credit worthiness, and so on. But predictive modeling in the NCAA tournament can be as simple as choosing the winner by favorite color, most fierce mascot, or alphabetizing. Others rely on their observation of the teams throughout the regular season and conference championships to inform their decisions, and then they use their “gut” to pick a winner when they have little or no information about one of both of the teams.

I’m sure some readers have used these kinds of strategies and lost or maybe even won the “kitty” in these betting pools, but the best results will come using historical information to identify patterns in the data. For example, did you know that since 2008 the 12th seed has won 50% of the time against the 5th seed? Or that the 12th seed has beat the 5th seed more often than the 11th seed has beat the 6th seed?

Upon analyzing tournament data, patterns like these emerge about the tournament, specific teams (e.g. NC State University struggles to make free throws in the clutch), or certain conferences. To make the best predictions, use this quantitative information in conjunction with your own domain expertise, in this case about basketball.

Predictive modeling methodology generally comes from two groups: statisticians and computer scientists (who may take a more machine learning approach). The field of data mining encompasses both groups with the same aim - to make correct predictions of a future event. Common data mining techniques include logistic regression, decision trees, generalized linear models, support vector machines (SVM), neural networks, and many many more (all available in SAS).

While these techniques are applied to a broad range of problems, professors Jay Coleman and Mike DuMond have successfully used those from SAS to create their NCAA“ dance card,” a prediction of the winners that has had a 98% success rate over the last three years.

If you think you have superior basketball knowledge and analytical skills, then hopefully you entered the ultimate payday competition from Warren Buffett. He will pay you $1 billion if you can produce a perfect bracket.  Before you go out and start ordering extravagant gift, it is worth considering that the odds of winning at random are 1 in 148 pentillion (148,000,000,000,000,000,000), but with some skill your odds could improve to 1 in 1 billion. In this new world of crowdsourcing, about 8,000 people have united to try and win the billion dollar  prize. I don’t know how many picked Dayton to beat Ohio State last night, but that one game appears to have eliminated about 80% of participants.

If you’re looking for even more opportunities to combine basketball and predictive analytics, then check out this Kaggle contest with a smaller payday but better odds.

Statistician George Box is famous for saying, “essentially, all models are wrong but some are useful”.  I wish you luck in your office pool, and if you beat the odds remember the bloggers in your life :)

Skills needed for competitive advantage in analytics (hint, it's not just the math)

I'm a big believer in both/and thinking, so I'll stand squarely in the middle and say that the most important skills for competitive advantage in analytics include a combination of top-notch modeling abilities along with business acumen, critical thinking, and curiosity. I was intrigued by a blog post on this topic from AllAnalytics.com, where editor Beth Schulz began with the provocative title "Quants Not so Necessary for Predictive Modeling?" She takes her cue from a recent TDWI best practices report on "Predictive Analytics for Competitive Advantage," where among other things they shared the results of a survey on the skills necessary to perform predictive analytics. More than two thirds of respondents agreed that knowledge of the business, critical thinking, and understanding of the data are essential. Agreement then begins to diminish, with 41% citing training in predictive analytics but only 34% agreeing on a degree in this field.

Greta Roberts of Talent Analytics has her own angle on this question, based on a study they did of analytics professionals themselves to profile their traits. Beyond the obvious findings their research points to curiosity and creativity, as well as discipline, as top attributes to look for when hiring this kind of talent. But when I am in conversations with our customers I hear varying opinions on this question. Some say these skills don't all exist in one person, which is why they have teams. Others believe that the key is training the quants in the "soft skills."

A solid foundation in the fundamentals of a quantitative discipline provides an excellent start for a career in analytics, and those who come into analytics from other paths will find themselves going back to learn things they hadn't studied in school. It can be a tough slog to try to catch up on linear algebra at night. But the "math" alone won't solve a business problem. It is essential to understand the business context around a problem to formulate its solution. And then this proposed solution must typically be explained to a group of stakeholders that include at least some people without deep analytical training. And implementing most solutions involve collaborating across different business units with diverse background and training. So a much wider set of skills are necessary to go from problem to solution.

To highlight this challenge SAS teamed up with the Analytics Section of INFORMS for the Student Analytical Scholar Competition, which requires applicants to read a case study and submit a Statement of Work explaining how they would address the business problem in the case study. Naturally this requires analytical skills, but those skills alone don't lead to the best applications. In this interview, 2013 winner Alex Akulov talks about his perspective after finishing his bachelor's degree in math (with a minor on optimization). He assumed that when presented with a problem he would formulate it and then be done. "It's an optimal solution, so you present it to the manager and of course, he's going to say, "Yeah, let's do it because it's optimal." But how often does that ever happen?

Communication skills are essential as well, which is why this competition offers students a chance to ask questions as if they were consultants on the job interacting with the "customers." In this discussion forum (open until February 14 at 5:00 pm EST), they can query the individuals involved in the case study who will respond as they see fit. That will give applicants more information to incorporate into their submission, which is due by midnight on February 17. The winner will have their expenses paid to attend the INFORMS Conference on Business Analytics and Operations Researchin Boston March 30-April 1, where they will have a fantastic opportunity to attend sessions given by analytics practitioners and networks amongst them. Alex cited that experience as a great benefit of winning.

LinkedIn discussions are full of students asking what it takes to succeed in analytics. How would you advise them - what do you look for when hiring analytics teams?

Using panel data to measure the economic impact of the Super Bowl

Site of Super Bowl 52

MetLife Stadium in East Rutherford, New Jersey is the site of Super Bowl 48

This year the Super Bowl will take place in East Rutherford, New Jersey at MetLife Stadium just outside New York City. For the first time in this event’s 48-year history, the game will take place outdoors in a cold-weather environment, potentially subjecting players and fans to sub-freezing temperatures.  The fans in their excitement will find ways to not freeze, but will the local economy be “warmed” by the games? While the game itself will likely sell-out, the more long-lasting economic impact of the game is the effect, if any, on the local economy as result of the game.

Several economists at University of Maryland-Baltimore County have used historical information to estimate the marginal effect of the game on the local economy. They compile annual per capita income (a measure of local prosperity) for cities that host a playoff game for 1969 to 1997. They also record a number of potential confounding factors such as population growth, other economic drivers, and sports-related events.  These other factors are necessary controls in order to separate the effect of the games from other spurious effects.

The model estimated in their paper is

where   is an area-specific fixed effect and   is a time-specific fixed effect common to all areas.  Economists are fans of specifying these area-specific effects as fixed effects (FE) rather than random effects (RE), which tends to be more popular among statisticians. While these differences seem subtle, the implications are large. This fixed effects strategy depends on an assumption regarding the correlation between the unobserved effect and the idiosyncratic error term. In a purely randomized experiment the individual-specific effect is, by construction, uncorrelated with the variable of interest. When using observational data, however, we cannot credibly make this assumption. Why? Well, can we randomly select cities to host playoff games and observe everything affecting the income of a city?  No, of course not. We only get the data the world gives us and do the best we can. In this case, we observe cities that host playoff games, cities that do not, and what those cities look like before and after. It is from these observational data that we make inference. Economists economize everywhere. Even with statistics.  So what do our fearless economists find?

Using the FE estimator they find an economically and statistically insignificant impact of a postseason game on the per capita income of an area, holding all else constant (this model can be estimated with the PANEL procedure in SAS/ETS ®).  The only effect of the Super Bowl on income appears to be on the average income of the city whose team wins the game - the winning team’s city gets a positive income bump.  The authors chalk this up to an unidentified productivity-enhancement, which in lay terms means the Denver or Seattle economy might have a positive economic boost in the near future.

From an economic perspective, the result of “no impact” for New York City (or any area) is not surprising. Economic activity requires available resources. Cities hosting games are likely to have only so many hotels and restaurants with which to host visitors. These resources are likely to be busy most of the year anyway. This is especially true of the New York City Metro area. Visitors for the "big game" likely dissuade other visitors who might otherwise visit the city. In essence, we have a “crowding out.”

One of New York City’s most beloved sports figures, Yogi Berra, sums up crowding out well when he says, “Nobody goes there anymore, it’s too crowded.”

So, while commercials, politicians and the media will likely portray Super Bowl 48 as a boon for the New York City economy, we can thank some clever econometrics for setting us straight.

*In fact, one can even test the validity of the fixed effect vs. random effect assumption by using a Hausman Test, which tests for systematic differences in the estimates. The authors test this assumption and soundly reject the use of random effects. This test is automatically calculated when the random effects estimator is employed in the PANEL procedure.

References:

Coates, D. and B. R. Humphreys. 2002. The Economic Impact of Postseason Play in Professional Sports. Journal of Sports Economics, 3(3): 291-299.

SAS Institute, The PANEL Procedure, SAS/ETS(R) 13.1, http://support.sas.com/documentation/cdl/en/etsug/66840/HTML/default/viewer.htm#etsug_panel_details40.htm

Image credit: photo by Anthony Quintano // attribution by creative commons

"The hungry statistician" – or why we never can get enough data

As the “Year of Statistics” comes to a close, I write this blog in support of the many statisticians who carefully fulfil their analysis tasks day by day, and to defend what may appear to be demanding behavior when it comes to data requirements.

How do statisticians get this reputation?

Are we really that complicated, with data requirements that are hard to fulfil? Or are we just pushed by the business question itself and the subsequent demands of the appropriate analytical methods?

Let's accept the presumption of innocence and postulate that none of us ask IT departments to excavate data from historical time periods or add just any old variable to the data mart just for fun.

Very often a statistician's work is assessed on the quality of the analytical results. The more convincing, beneficial, and significant the results, the clearer it is to attribute the success to the statistician. Also, it is well-known that good results come from high quality and meaningful data. Thus it is understandable that statisticians emphasize the importance of the data warehouse.

And this insistence is not driven by selfishness “to appear in a good light,” but because we take seriously our work of making well-informed decisions based on the analysis.

Let’s consider three frequent data requirements in more detail.

Who is interested in old news? – can we learn from history?

In order to make projections about the future, historic patterns need to be discovered, analysed, and extrapolated. To do so, historic data is needed. For many operational IT systems, historic versions of the data are irrelevant; their focus is on having the actual version of the data to keep the operational process up and running.

Consider the example of a tariff (fee) change with your mobile phone provider. The operational billing system requires primarily the actual contracted tariff, in order to bill each phone call correctly. To analyze customer behaviour we must know the prior tariff to find out which pattern of tariff-change frequently leads to a certain event, like a product upgrade or a cancellation.

For many analyses we need to differentiate between historic data and the historic snapshot of data.

To forecast the number of rented cars for a car rental agency for the next four weeks, we may need not only the daily number of rented cars but often the bookings that have been already received. For example, for November 18, 2013 the statistical model will use the following data:

  • Number of rented cars on November 18
  • Number of bookings for the rental day, November 18, that are known as of November 17 (the day before)
  • Number of bookings for the rental day, November 18 that are known aso f November 16 (two days before)

As the historic booking status for a selected rental day is continuously overwritten by the operational system, the required data can only be provided if they are historicized in a data warehouse.

“More” is almost always better

In order get well-based conclusions from statistical results, a certain minimum data quantity is needed. This minimum quantity (also called sample size or number of cases) depends on the analysis task and the distribution of the data. The area of sample size planning deals with the determination of the number cases that are needed to make sure that a potential difference in the data can be recognized with certain statistical significance.

In predictive modelling, where for example the probability of a certain event will be predicted, we care not only about the number of observations but also the number of events. In a campaign response analysis, a data sample with 30 buyers and 70 non-buyers will allow us to make better conclusions about the reasons for a product purchase, compared to a situation where only five buyers and 95 non-buyers are in the data (although both cases have a sample size of n=100).

“More” can also mean that a larger number of attributes need to be in the data warehouse. These additional attributes can potentially increase the accuracy of the prediction or finding additional relationships. The increase in the number of attributes can be achieved by including additional data sources or by creating derived variables from transactional data.

Analyzing more data can also make the data volume hard to handle. Because of its computing power and the ability to handle large amounts of data (big data), SAS has always been excellently prepared for that task. Now we also offer a specialized SAS High Performance Solution.

Detailed data vs. aggregated data – or why external data are not always the solution to the data availability problem

External data are often considered as the solution to enrich analysis data with those aspects that are missing in your own data. In many cases this is truly possible; for example, where socio-demographic data per district is used to describe customer background.

But sometimes the features in this data are not available on individual customers but only aggregated by group. But analytical methods often need detailed data per analysis subject.

In my book “Data Quality for Analytics” I use an example of the performance of a sailboat during a sailing race. The boat has a GPS-tracking device on board but no wind-measuring device. Thus the position, speed, and compass heading are available for the boat but not the wind speed and the wind direction.

We could assume that “external data” of a meteorological station in the harbour could be a good substitute for this data. A more detailed view reveals that this data shows a good picture of the general wind situation. But they are measured far away from the race area and not representative of the individual race behaviour of a boat. In addition they are only collected in five-minute intervals and do not allow a detailed analysis on short term wind shifts.

We care! That’s why are demanding.

When we request comprehensive, historical, detailed data, we statisticians do not want to be nasty; we just want to treat our respective analysis question with the right amount of carefulness.

Got your interest?

If I have caught your interest, you can find more details in my books „Data Quality for Analytics Using SAS“ and „Data Preparation for Analytics Using SAS“.

 

How WAVELETS can help separate the signal from the noise

Wavelet analysis is an exciting and relatively new field of study that enables one to extract underlying patterns either from spatially varying or temporally varying data.  Pixel values representing the relative brightness and color that constitute an image are an example of spatially varying data, and daily variations of financial market prices are examples of temporally varying data. By focusing on underlying trends and patterns in the data, wavelet analysis has been used to make significant advances in information and image storage and retrieval, medical diagnosis, and even speech and voice recognition. Wavelet analysis is accomplished in SAS by leveraging various built-in subroutines available in the SAS/IML® software.

You might ask what a wavelet is and why wavelets are important.  As the name suggests, a wavelet is a short-lived ripple/oscillation that has a specific mathematical representation. An example of a wavelet is shown below:

wavelet

Wavelet-based techniques employ wavelets to extract underlying patterns present in the data, remove noise from the data, and achieve data compression.

To understand how wavelet analysis works, let us consider its application in the context of electrocardiogram (ECG/EKG) time-series data of a patient with occasional arrhythmia, shown the figure below (in red). An EKG signal records the electrical activity of the heart and is used to determine whether or not the heart is functioning normally. The morphology and regularity of the heart-related features in an EKG signal are analyzed for detection of abnormalities in the heart. However these features are often embedded in noise and other spurious features that complicate the analysis. To make definitive diagnosis of a possible heart disease, it is crucial to remove these irrelevant features from the EKG signal as best as possible.  Shown in the figure are two such irrelevant features in the EKG signal. First, the patient’s respiration causes the overall signal to slowly drift from a reference baseline; this drift, known as the baseline drift, is captured in the slowly undulating dashed blue line in the graphic. Second, the EKG signal is partially embedded in noise generated from artifacts unrelated to the heart activity, captured in the irregular fluctuations in the graphic. Both the noise and the baseline drift are required to be removed for accurate diagnosis.

EKG time series plot

To understand how wavelets are used in this type of analysis, we simply have to change our perspective from ‘the whole’ to the ‘sum of parts’ or from the signal itself to the components or building blocks that constitute the composite EKG signal. With the help of wavelets, a complex signal such as the EKG can be broken down or decomposed into individual components that help capture patterns repeating at distinct time intervals. For example, the baseline drift in the EKG signal has a pattern that is quite distinct from the noise pattern as well as from the heart rhythm pattern; these different patterns can be detected and isolated by wavelet analysis, leaving us with data that pertains to the heart activity alone, allowing for accurate diagnosis. Furthermore, reconstruction of the decomposed signal after removal of noise and other artifacts not only improve the fidelity of the patient’s heart recording but also compresses the size of the dataset by getting rid of extraneous data. In essence, wavelets allow us to convey pertinent information using fewer data samples.

So how does wavelet analysis work on image storage and retrieval?  If you think of an image as a stream of pixelated information that varies in terms of its lightness and darkness, especially when reading from left to right and top to bottom, the information stream has characteristic oscillations that define content areas of the picture.  If we capture only the most important aspects of the contrast and color through wavelet analysis, then we can reduce the amount of information saved.  By reversing the process of converting the transformations back to regular data, we can accurately and precisely reconstruct the image.

Other areas of application of wavelets include:

  • The bio-medical industry performs DNA/protein and blood-pressure analysis, cancer detection, and breathing pattern analysis in new-born babies using wavelets.
  • In the government, wavelet-based techniques are being employed for facial recognition algorithms, fingerprint detection etc.
  • In the finance industry, quick variation in market prices and trading patterns are being studied using wavelets.
  • In the oil and gas industry, estimation of subsoil properties required for oil exploration employ wavelet-based analysis. This allows them to focus on underground features likely to hold the most oil or to map out the underlying rock structure.

Though wavelets were traditionally used for image analysis and compression, they are quickly gaining momentum in a variety of problems that require careful interpretation of complex patterns present in the data at different spatial/temporal resolutions and for isolating weak signals from the noise.

Get started today to leverage the value of wavelets in SAS/IML® software to reveal underlying patterns in the data for a better understanding of your business problem.

ANALYTICS 2013 – A BEST PRACTICE CONFERENCE FOR ANALYTICS PROFESSIONALS

We are in the final countdown to the Analytics 2013 Conference to be held next week, October 21-22 at the Hyatt Regency Orlando.  There is a power-packed agenda for the conference, featuring four very strong keynote presentations from Dr. Jim Goodnight, CEO of SAS; Ed Gaffin, Walt Disney World; Will Hakes, Link Analytics, and Sven Crone, Lancaster University.

The full conference agenda includes seven vertical tracks, with up to five hourly presentations in each track on both days.  The tracks cover a wide variety of analytical topics such as Big Data, Health Care, Marketing, Forecasting, Text Mining, and Operations Research.  There will be something for everyone, regardless of industry.  For the first time, we are offering a “SAS Presents” track with a focus on ‘Updates from Advanced Analytics’.

The conference has an outstanding group of sponsors,  with Teradata and Intel as Platinum Sponsors.  In addition to the commercial sponsors, we have 15 academic sponsors who will be showcasing the programs that are educating the next generation of analytical talent.

The Analytics and Data Mining Shootout features the top three student teams presenting their solutions to a prepared, hands-on problem.  Over 50 teams compete each year, the top three are awarded prizes, and three honorable mentions are announced.  Support for the competitions has been provided over the past seven years from the Institute for Health & Business Insight.  Stop by the presentations on Tuesday and discover the amazing work these students produce.

In addition to the shootout competition, we have over 50 posters that will be on display.  Students competed in a poster competition, and six poster winners were selected to receive an all-expenses paid trip to the conference.  Visit the sessions and chat with these student researchers.

The activities planned for attendees begin on Saturday, with an offering of SAS’ certification exam for predictive modeling using SAS® Enterprise Miner™ and continue on Sunday morning with additional certification exams. A number of workshops are scheduled for Sunday.  Following the completion of the conference agenda on Tuesday afternoon, eight, three-day courses are offered beginning Wednesday morning. Course topics include including predictive modeling, data mining, operations research, customer segmentation, econometrics, and forecasting.  The week is rounded out with four, two-day courses and a several one-day events.  These classes cover text mining, high performance analytics and visual analytics.

So, come join us for the Analytics 2013 Conference in Orlando and stay over for some great training opportunities.  Learn, share and network during the day and enjoy the social activities in the evenings.