Is Poker a Skill Game? A Panel Data Analysis

The annual SAS Analytics Conference is upon us again. This year it is known by a different name, Analytics Experience 2016, but the location, Las Vegas, is the same as it has been the previous two years. Just like last year, I will be attending and presenting on analytics for panel data using SAS/ETS® for econometrics and time series.

While preparing for my trip I was reminded of a paper I once read in Chance magazine (Croson, Fishman and Pope 2008) that concluded that poker, like golf, is a game of skill rather than luck.  The paper was published in 2008 during the heyday of televised poker, when it seemed that ESPN aired poker tournaments and little else.  The paper especially struck me because it quoted one of my favorite movies:

"Why do you think the same five guys make it to the final table of the World Series of Poker every year? What are they, the luckiest guys in Las Vegas?" – Mike McDermott (played by Matt Damon in Rounders)

Upon rereading the paper I realized the datasets the authors gathered followed a design for panel data.

Panel data occur when a set of individuals, or panel, are each measured on several occasions. Panel data are ubiquitous in all fields, because they allow each individual to act as their own control group. That allows you to focus on identifying causal relationships between response and regressor, knowing that you can control for all factors specific to the individual, both measured and unmeasured.

In regards to Croson et al. (2008), the individuals were poker players whose results were recorded over multiple poker tournaments. The authors gathered two panel datasets, one for poker players and one for professional golfers. They surmised that if the associations you see for poker mimic those for golf, then you should conclude that poker, like golf, is a game of skill.  After all, one would never theorize that Tiger Woods has won 14 major championships based purely on good karma.

Focusing on the data for poker, the authors gathered tournament results on 899 poker players. Because poker tournaments vary in the number of entries, only results in the top 18 were considered, and that number was chosen because it corresponds to the final two tables of 9 players each. The response was the final rank (1 through 18, lower being better) and the regression variables were three measures of previous performance. One such measure was experience, a variable indicating whether the player had a previous top 18 finish.

Among other similar analyses, the authors fit a least-squares regression of rank on experience:

\(Rank_{ij} = \beta_{0} + \beta_{1} Experience_{ij} + \epsilon_{ij}\)

where i represents the player and j the player’s ordered top-18 finish.  From the analysis they found a statistically significant negative association between current rank and previous success. Because lower ranks are better, they concluded that good previous performance was associated with good present performance. Furthermore, the magnitude of the association was analogous to the parallel analysis they performed for golf. They concluded that because you can predict current results based on previous performance – in the same way you can with golf – then poker must be a skill game.

The authors used simple least squares regression, with the only adjustment for the panel design being that they calculated "cluster robust’’ standard errors that controlled for intra-player correlation.  They did not consider directly whether there were any player effects in the regression.

After obtaining the data, I used PROC PANEL in SAS/ETS to explore this issue.  I considered three different estimation strategies applied to the previous regression. PROC PANEL compactly summarized the results as follows:

comparison of model parameter estimates

The OLS Regression column precisely reproduces the analysis of Croson et al. (2008) and shows a significant negative association between current rank and previous experience.  The Within Effects column is from a fixed-effects estimation that utilizes only within-player comparisons. You can interpret that coefficient (0.39) as the effect of experience for a given player. Conversely, the Between Effects column is from a regression using only player-level means, that is, the estimator uses only between-player comparisons. Because the estimator of the within effect for experience is not significant and that for the between effect is strongly significant, you can conclude the data exhibit substantial latent player effects. That is not surprising, because measures of player ability (technical, psychological or mystical) weren’t included in the model.

The augmented analysis does nothing to invalidate the Croson el al. (2008) conclusion that poker involves more skill than luck. However, to believe that premise you must begin with the untested (yet reasonable) assumption that luck is something that, even if it plays a factor in one tournament, cannot be maintained over a career. You must rely on common sense and not the data at hand to rule out luck as a latent (and mystical) player ability. With that question settled, the data go on to indicate that luck is not even a factor for single tournaments, each of which can be thought of as a long-run realization of hundreds of poker hands.

The PROC PANEL output merely furthers the point that some poker players (like their golfing counterparts) are just better at their craft than others.

Then again, maybe they really are the luckiest guys in Vegas.

If you are curious to know more about panel data, what’s available in SAS and how it may be applied, you can catch my theater presentation (that’s just a fancy way to say `talk’’), "Modeling Panel Data: Choosing the Correct Strategy," at the SAS Analytics Experience conference September 12-14 in Vegas. I'll be speaking on Wednesday, September 14, 1:15 PM - 2:00 PM. You will not catch me at the poker tables, however. My poker game stinks.



Croson, R., P. Fishman and D. G. Pope. 2008.  Poker Superstars: Skill or Luck? Similarities between golf --- thought to be a game of skill --- and poker.  Chance 21(4): 25-28.

SAS Institute, The PANEL Procedure, SAS/ETS(R) 14.1 documentation

Post a Comment

Spatial econometric modeling using PROC SPATIALREG

In our previous post, Econometric and statistical methods for spatial data analysis, we discussed the importance of spatial data. For most people, understanding that importance is relatively easy because spatial data are often found in our daily lives and we are all accustomed to analyzing them. We can all relate to the first law of geography—“Everything is related to everything else, but near things are more related than distant things”—and we can agree that our interaction with close things around us plays an important role in our decision process. Applications of spatial data in our daily lives are often seamless, and you could argue that we are all spatial statisticians and econometricians without even realizing it. Although most human beings have an innate ability to incorporate spatial information, computer-based analytics need to be given tools to include such information in their analyses. SAS/ETS 14.2 introduces one such tool, the SPATIALREG procedure, which enables you to include spatial information in the analysis and improve the econometric inference and statistical properties of estimators.

In this post, we discuss how you can use the SPATIALREG procedure to analyze 2013 home value data in North Carolina at the county level. The five variables in the data set are county (county name), homeValue (median value of owner-occupied housing units), income (median household income in 2013 in inflation-adjusted dollars), bachelor (percentage of people with bachelor’s degree or higher who live in the county), and crime (rate of Crime Index offenses per 100, 000 people). The data for home values, income, and bachelor’s degree percentages in each county were obtained from the website of the United States Census Bureau and computed using the 2009–2013 American Community Survey five-year estimates. Data for crime were retrieved from the website of North Carolina Department of Public Safety. For the purpose of numerical stability and interpretation, all five variables are log-transformed during the process of data cleansing. We use this data set to demonstrate the modeling capabilities of the SPATIALREG procedure and to understand the impact of household income, crime rate, and education attainment on home values.

As a preliminary data analysis, we first show a map of North Carolina that depicts the county-level home values in Figure 1. It is easy to see that the home values tend to be clustered together. Higher values are found in the coastal, urban, and mountain areas of North Carolina and lower home values can be found in rural areas. Home values of neighboring counties more closely resemble each other than home values of counties that are far apart.

Figure 1: Median value of owner-occupied housing units

Figure 1: Median value of owner-occupied housing units

From a modeling perspective, findings from Figure 1 suggest that the data might contain a spatial dependence, which needs to be accounted for in the analysis.  In particular, an endogenous interaction effect might exist in the data—home values tend to be spatially correlated with each other. PROC SPATIALREG enables you to analyze the data by using a variety of spatial econometric models.

Table 1: parameter estimates for a linear regression model

Table 1: parameter estimates for a linear regression model

To lay the groundwork for discussion, you can start the analysis with a linear regression. For this model, the value of Akaike’s information criterion (AIC) is –106.12. The results of parameter estimation from a linear regression model, shown in Table 1, suggest that three predictors—income, crime, and bachelor—are all significant at the 0.01 level. Moreover, crime exerts a negative impact on home values, indicating that high crime rates reduce home values. On the other hand, both income and bachelor have positive impacts on home values.

Figure 2 provides the plot of predicted homeValue from the linear regression model. Although the comparison of Figure 1 and Figure 2 might suggest that predicted homeValue from the linear regression model captures the general pattern in the observed data, you need to be careful about some underlying assumptions for linear regression. Among those assumptions, a critical one is that the values of the dependent variable are independent of each other, which is not likely for the data at hand. As a matter of fact, both Moran’s I test and Geary’s C test suggest that there is a spatial autocorrelation in homeValue at the 0.01 significance level. Consequently, if you ignore the spatial dependence in the data by fitting a linear regression model to the data, you run the risk of false inference.


Figure 2: predicted median value of owner-occupied housing units using a linear regression model

Figure 2: predicted median value of owner-occupied housing units
using a linear regression model

Because of the spatial dependence in homeValue, a good candidate model to consider might be a spatial autoregressive (SAR) model for its ability to accommodate the endogenous interaction effect.  You can use PROC SPATIALREG to fit a SAR model to the data. Before you proceed with model fitting, you need provide a spatial weights matrix. Generally speaking, a spatial weights matrix summarizes the spatial neighborhood structure; entries in the matrix represent how much influence one unit exerts over another.

Table 2: parameter estimates for a SAR model

Table 2: parameter estimates for a SAR model

The spatial weights matrix specification is of vital importance in spatial econometric modeling. Despite many different ways of specifying such a matrix, results can be sensitive to the choice of a spatial weights matrix.  Without delving into the nitty-gritty of such choice, you can simply define two counties to be neighbors of each other if they share a common border. After creating the spatial weights matrix, you can feed it into PROC SPATIALREG and run a SAR model. Table 2 presents the results of parameter estimation from a SAR model.

For this model, the value AIC is –110.79. The regression coefficients that correspond to income, crime, and bachelor are all significantly different from 0 at the 0.01 level of significance. Both income and bachelor exhibit a significantly positive short-run direct impact on home values. In contrast, crime rate shows a significantly negative short-run direct impact on home values. In addition, the spatial autoregressive coefficient ρ is significantly different from zero at 0.01 level, suggesting that there is a significantly positive spatial dependence in home values.

Figure 3 shows the predicted values for homeValue from the SAR model. Comparing Figures 1 and 3 suggest that the fitted home values capture the trend in the data reasonably well.

Figure 3: predicted median value of owner-occupied housing units using a SAR model

Figure 3: predicted median value of owner-occupied housing units using a SAR model

In this post, we introduced the SPATIALREG procedure, fit a SAR model, and compared predicted values from the SAR model to those from linear regression. Even though the SAR model presented an improvement over the linear model in terms of AIC, many other models are available in the SPATIALREG procedure that might provide even more desirable results and more accurate predictions. These models include the spatial Durbin model (SDM), spatial error model (SEM), spatial Durbin error model (SDEM), spatial autoregressive confused (SAC) model, spatial autoregressive moving average (SARMA) model, spatial moving average (SMA) model, and so on. In the next post, we will discuss their features and show you how to select the most suitable model for the home value data set. We will also be giving a talk, "Location, Location, Location! SAS/ETS® Software for Spatial Econometric Modeling," at the SAS Analytics Experience conference September 12-14, 2016 in Las Vegas, so stop by and let's talk spatial!

This post was co-written with Jan Chvosta.


Post a Comment

The benefits of artificial intelligence

Briggs and Riley rolling suitcase ad2

Photo courtesy of U.S. Luggage , Briggs & Riley

Asking about the benefits of artificial intelligence and machine learning reminds me a little of the transition to suitcases with wheels. Do you remember lugging around those old suitcases? If not, good for you - this original advertisement from US Luggage will take you back! Thank Bernard Sadow for persistence with his idea to add wheels, because when he pitched his idea people thought he was crazy. Surely no one would want to pull their own suitcase? His patent application stated, “Whereas formerly, luggage would be handled by porters and be loaded or unloaded at points convenient to the street, the large terminals of today, particularly air terminals, have increased the difficulty of baggage-handling….Baggage-handling has become perhaps the biggest single difficulty encountered by an air passenger.”

We can wheel our own suitcases these days, but baggage handling is still a challenge for airlines. One of the benefits of artificial intelligence and machine learning is improvements companies like Amadeus are applying to baggage handling in airports to reduce the risk of lost bags. And to improve the overall customer experience moving through the Frankfurt Airport Fraport uses predictive modeling from SAS, part of the extensive set of machine learning capabilities from SAS.

bank tellerI hear plenty of verbal and online chatter predicting that artificial intelligence and machine learning will eliminate jobs. But a review of history shows that many such past predictions have not come true. Remember the introduction of ATMs? The expectation was that bank tellers would become an anachronism, but in fact demand for tellers has increased greater than average. Automation reduced the number of tellers needed per bank, but this savings allowed banks to open new branches, thus stimulating demand for tellers.

The same pattern repeated with the introduction of grocery store scanners and cashiers and electronic document discovery and paralegals. Today your friendly bellhop still greets you at the hotel as you roll your suitcase to the entrance because in fact the US Bureau of Labor Statistics predicts average growth in demand for baggage porters and bellhops. I believe that the benefits of artificial intelligence and machine learning include increased productivity that will lead to job creation. Plenty of enthusiastic electronic ink has been spilled about the benefits of artificial intelligence and machine learning for business, so I’m going to focus on another reason why I’m excited about this field – the public benefit in areas like our health, economic development, the environment, child welfare, and public services.

Machine learning and artificial intelligence help use data for good

In a blog post on LinkedIn, Microsoft CEO Satya Nadella envisions a future where computers and humans work together to address some of society’s biggest challenges. Instead of believing computers will displace humans, he argues that at Microsoft “we want to build intelligence that augments human abilities and experiences.” He understands the trepidation some have about jobs and even the supposed Singularity (the idea that machines will run amok and take over), writing “…we also have to build trust directly into our technology,” to address privacy, transparency and security. He cites an example of the social benefits of machine learning and artificial intelligence in the form of a young Microsoft engineer who lost his sight at an early age but who works with his colleagues to build what is essentially a mini-computer work like glasses to give him information in an audible form he can consume.

xray tumorNadella's example of his young colleague is one of many where machine learning and artificial intelligence are making fantastic advances in providing great help for people with disabilities in the form of various health care wearables and prosthetics. Health care is replete with examples, as deep learning and other techniques show rapid gains on humans for diagnosis. For example, the deep learning startup, Enlitic, makes software that in trials is 50% more accurate than humans in classifying malignant tumors, with no false-negatives (i.e. saying that scans show no cancer when in fact there is malignancy) when tested against three expert human radiologists (who produced false-negatives 7% of the time). In the field of population health management AICure makes a mobile phone app that increases medication adherence among high-risk populations using facial recognition and motion detection. Their technology makes sure that the right person is taking the right medication at the right time.

There are nonprofits that have been drawn to the benefits of artificial intelligence and machine learning, such as DataKind, which “harnesses the power of data science in the service of humanity.” In a project with the nonprofit GiveDirectly, DataKind volunteers worked on an algorithm to classify satellite images to identify the poorest households in rural villages in Kenya and Uganda. A team from SAS is working with DataKind and the Boston Public Schools to improve transportation for their students, using optimization. Thorn: Digital Defenders of Children, uses technology and innovation to fight child sexual exploitation. Much of the trafficking is done online, so analysis of chatter, images, and other data can aid in identifying children and the predators.

elephant uganda queen eliz parkTrafficking in elephant ivory leads to an estimated 96 elephant deaths every day, but a machine learning app is helping wildlife patrols predict the best routes to track poachers. The app drew on 14 years of poaching data activity, produces routes that are randomized so poachers can be foiled, and learns from new data entered. So far its routes have outperformed those by previous ranger patrols. Protection Assistant for Wildlife Security (PAWS) was developed by Milind Tambe, a professor from the University of Southern California, based on security game theory. Tambe has also built these kinds of algorithms for federal agencies like Homeland Security, the Transportation Security Administration, and the Coast Guard to optimize the placement of staff and surveillance to combat smuggling and terrorism.

Machine learning and artificial intelligence in the public sector

nypdOther public sector organizations also realize the benefits of artificial intelligence and machine learning. The New York Police Department has developed the Domain Awareness System, which uses sensors, databases, devices, and more, along with operations research and machine learning, to put updated information in the hands of cops on the beat and at the precincts. Delivering this information even faster than the dispatchers means cops are better prepared when they arrive on the scene. Teams from the University of Michigan’s Flint and Ann Arbor campuses are working together with the City of Flint to use machine learning and predictive algorithms to predict where lead levels are highest and build an app to help both residents and city officials with resources to better identify issues and prioritize responses. It took a lot of work to gather all the disparate information together, but interestingly their initial findings indicate that the troubles are not in the lines themselves but in individual homes, although the distribution of the problems doesn’t cluster like you’d expect.

These are just a few of the many examples of the social benefits of artificial intelligence and machine learning, but they illustrate why I’m excited about their potential to improve our society. Automation fueled by artificial intelligence is likely to result in what economists call "structural unemployment," when there is a mismatch between the skills some workers have and those the economy demands, typically a result of technological change. This disruption is undoubtedly devastating for those who lose their jobs, and I believe as a society we have an obligation to provide workforce development programs and training to help those impacted shift to new skills. But I am hopeful that machine learning will be able to offer help to those disrupted by these changes.

And it may even offer job opportunities. SAS is working with our local Wake Technical Community College, which has launched the nation's first Associate's Degree in Business Analytics, fueled in part by a grant from the US Trade Adjustment Assistance Community College and Career Training initiative. They will also offer a certificate program aimed at displaced or underemployed workers will be targeted and required to earn 12 credit hours to gain a certificate of training. While these graduates will not likely start off doing machine learning, they may move in that direction, and at a minimum contribute to teams that do use these methods.

And LinkedIn uses machine learning extensively, for recommendations, image analysis, and more, but through their Economic Graph and LinkedIn for Good initiatives the company aims to connect talent to opportunities by filling in gaps in skills. In partnership with the Markle Foundation their new LinkedIn Cities program offers training for middle skill workers, those with a high school diploma and some college but no degree, and is piloting in Phoenix and Denver. The combination of online and offline tools with connections to educators and employers will help these individuals improve their opportunities.

Boston Public Schools busSAS will highlight the data for good movement at our upcoming Analytics Experience conference in Las Vegas September 12-14. Jake Porway, the Founder and Executive Director of DataKind, will be one of the keynote speakers. My colleague Jinxin Yi will be giving a super demo on the SAS/DataKind project I mentioned that aims to improve transportation for the Boston Public Schools. His session is one of several that have been tagged in the program as Data for Good sessions. We’ll have a booth where you can learn more and get engaged with #data4good. Stop by and say hi to me if you're there!

Suitcase image credit: Photo courtesy of U.S. Luggage, Briggs & Riley
Bank teller image credit: photo by AMISOM Public Information // attribution by creative commons
Xray image credit: photo by Yale Rosen // attribution by creative commons
Elephants image credit: photo by Michele Ursino // attribution by creative commons
NYPD image credit: photo by Justin Norton // attribution by creative commons
Bus image credit: photo by ThoseGuys119 // attribution by creative commons

Machine learning applications for NBA coaches and players

Machine learning applications for NBA coaches and players might seem like an odd choice for me to write about. Let us get something out of the way: I don’t know much about basketball. Or baseball. Or even soccer, much to the chagrin of my friends back home in Europe. However, one of the perks of working in data science and machine learning is that I can still say somewhat insightful things about sports, as long as I have data. In other words, instant expertise! So with that expertise I’ll weigh in to offer some machine learning applications for basketball.

During a conversation with my good colleague Ray Wright, who does know quite a bit about basketball and had been looking at historical data from NBA games, we suddenly realized something about player shooting. There are dozens of shot types, ranges and zones… and no player ever tries them all. What if we could automatically suggest new shot combinations appropriate for each individual player? Who knew there could be machine learning applications for the NBA?

Such a system that suggests actions or items to users is called a recommender system. Large companies in retail and media regularly use recommender systems to suggest movies, songs and other items to users based on their behavior history, as well as that of other similar users, so you’ve liked used such a system from Amazon, Netflix, etc.In basketball terms, the users are the players, and the items are shot types. As with the other domains mentioned above, available data does not even come close to covering all possible combinations, in this case for players and shots. When the available data matches this scenario it is called sparse. And fortunately, SAS has a new offering, SAS® Viya™ Data Mining and Machine Learning, that includes a new method specifically designed for sparse predictive modeling: PROC FACTMAC, for Factorization Machines.

Let me quickly introduce you to factorization machines. Originally proposed by Steffen (Rendle, 2010), they are a generalization of matrix factorization that allows multiple features with high cardinality (lots of unique values) and sparse observations. The parameters of this flexible model can be estimated quickly, even in the presence of massive amounts of data, through stochastic gradient descent, which is is the same type of optimization solver behind the recent successes of deep learning.

Factorization Machines return bias parameters and latent factors, which in this case can be used to characterize players and shot combinations. You can think of a player’s bias as the overall propensity to successfully score, whereas the latent factors are more fine-grained characteristics that can be related to play style, demographics and other information.

Armed with this thinking, our trusty machine learning software from SAS, and some data science tricks up our sleeves, we decided to try our hand at machine learning applications in the form of automated basketball coaching (sideline yelling optional!). Before going into our findings, let’s take a look at the data. We have information about shots taken during the 2015-2016 NBA basketball season through March 2016. A total of 174,190 shots were recorded during this period. Information recorded for each shot includes the player, shot range, zone, and style (“action type”), and whether the shot was successful. After preprocessing we retained 359 players, 4 ranges, 5 zones, and 36 action types.

And here is what we found after fitting a factorization machine model. First, let’s examine some established wisdom- does height matter much for shot success? As the box and whisker plot below shows, the answer is yes, somewhat, but not quite as much as one would think. The figure depicts the distribution of bias values for players, grouped by their height. There is a bump for the 81-82 inch group, but it is not overwhelming. And it decays slightly for the 82-87 inch group.

NBA box and whiskers plot

Now look at the following figure, which shows made shots (red) vs missed (blue), by location in the court and by action type. There is definitely a very significant dependency! Now if only someone explained to me again what a “driving layup” is…

NBA shots

Let us investigate the biases again, now by action type. The following figure shows the bias values in a horizontal bar plot. It is clear that all actions involving “dunk” lead to larger bars, corresponding to greater probability of success.

NBA bias values

What about other actions? What should we recommend to the typical player? That is what the following two tables show.

NBA most recommended shots

Most recommended shots

Least recommended shots

Least recommended shots

Based on the predicted log-odds, the typical player should strive for dunk shots and avoid highly acrobatic and complicated actions, or highly contested ones such as jump shots. Now, of course not all players are “”typical.” The following figure shows a 2D embedding of the fitted factors for players (red) and actions (blue). There is significant affinity between Manu Giobili and driving floating layups. Players Ricky Rubio and Derrick Rose exhibit similar characteristics based on their shot profiles, as do Russell Westbrook and Kobe Bryant, and others. Also, dunk shot action types form a grouping of their own!

NBA fitted factors

Overall, our 25-factor factorization machine model is successful in predicting log-odds of shot success with high accuracy: RMSE=0.929, which outperforms other models such as SVMs. Recommendations can be tailored to specific players, and many different insights can be extracted. So if any NBA coaches or players want to call about our applications of machine learning for basketball we are available for consultation!

We are delighted that this analysis has been accepted for presentation at the 2016 KDD Large-Scale Sports Analytics workshop this Sunday, August 14, where Ray will be representing our work with this paper: "Shot Recommender System for NBA Coaches." And my other colleague (and basketball fan), Brett Wujek, will be giving a demo theater presentation on “Improving NBA Shot Selection: A Matrix Factorization Approach” at the SAS Analytics Experience Conference September 12-14, 2016 in Las Vegas.

Surely, many basketball experts will be able to give us good tips to augment our applications of machine learning for the NBA. One thing is certain, though when in doubt, always dunk!


Rendle, S. (2010). Factorization Machines. Proceedings of the 10th IEEE International Conference on Data Mining (ICDM).


Multi-echelon inventory optimization at a major durable goods company

Multi-echelon inventory optimization is ever more a requirement in this era of globalization, which is both a boon and bane for manufacturing companies. Optimizing pricing is also important. Global reach allows these companies to expand to new territories but at the same time increases the competition on their home turf. Consider General Motors (GM), which faces new competition at home from Asian manufacturers such as Hyundai and Kia but has also benefited by global expansion. In 2015, they sold 63% of their vehicles outside North America. In fact, GM and their joint ventures sold as many vehicles in China as in North America. The supply chain of most manufacturing companies has also expanded globally, because most of them source their raw materials from different parts of the world, making their supply chain more complex, which is why a holistic approach to optimize the entire supply chain is critical. However, customers still expect the same great service despite the complexity, which is why multi-echelon inventory optimization can provide real benefits to global companies, particularly if they also optimize their pricing.

A major durables goods company in this situation partnered with SAS to leverage our advanced analytics capabilities to improve their profitability and right-size their inventory. The company built a new pricing platform that helps them design data-driven pricing and promotion strategies. SAS® Demand-Driven Planning and Optimization provided this company with a structured and efficient process to right-size the inventory in their complex, multi-echelon supply chain network. This new adaptive platform uses SAS/OR, SAS Visual Analytics and SAS Office Analytics to provide a scalable solution for multi-echelon inventory optimization.

The durable goods company sells their products to end consumers through retailers, who in turn represent their customers. The transactions between the company and retail customer are called invoice data, and the transactions between retail customers and end consumers are called retail sales data. The retail sales data is ten times larger than invoice data. The ability to visualize these transaction data and identify patterns and areas of improvement from them is critical to designing a successful pricing strategy.

Process flowThe new platform consists at its heart of a powerful server that can easily be scaled to the growing needs of the durable goods company. SAS/ACCESS engines enable easy connections to a number of different data sources, such as .NET databases or ODBC connections. The results from this computing environment can be surfaced on the web using SAS Visual Analytics or through desktop applications such as Microsoft Excel, Word, or PowerPoint using SAS Office Analytics.

The new pricing process consist of three simple steps:

Varun chart 2.jpg

First step: identify areas of improvement

Two primary areas of improvement for this company were related to profitability and market share. Using SAS Visual Analytics it is possible to visualize data using heat maps, waterfall charts, time series plots to quickly identify the customers and products that need improvement. Below is an example of how to easy it is to identify areas of improvement using a SAS Visual Analytics report in the form of a Pareto chart. The X axis shows groupings of customers, the bar chart shows the net sales by customer, and the green line indicates the profitability of these customers for the manufacturer. The customers are shown in descending order by their net revenue. The largest customers are shown on the left side of the Pareto chart and the smallest customers on the right side. Due to economies of scale and the buying power they have, largest customers are expected to provide the lowest profit margin percentage to the manufacturer. So in an ideal scenario, the profitability line should increase monotonically from left to right. In the chart below Cust3, Cust5 and Cust9 are clear outliners whose profitability can be improved.

Varun chart 3

Second step: design strategies

The second step involves designing and simulating different pricing strategies to overcome the challenges identified in step 1. After evaluating the different strategies, we selected the best strategy for implementation. Consider Cust2 from the chart above as we break down and analyze the cost components of Cust2 in detail using a waterfall chart. The first bar indicates the gross price. The second bucket indicates the gross to net (G2N) discounts that are offered to customers to increase the sales. If you are an expert, then you can immediately spot that 13% is very high, because you typically offer G2N discounts between 4% and 10%. So in this case the solution to improve the profitability is to reduce the G2N discounts.

Varun chart 4

Lowering the G2N discounts will certainly lower sales volume. Win rate curves help identify the volume changes as a result of price changes. Using SAS/OR you can derive a piecewise linear equation to describe the cumulative historical win rate at different price points. You can then use this analytical model to calculate the volume changes due to price changes. For example, let us consider a product with selling price of $300/item and average cost of $200/item. Current annual sales volume is 376,000 (win rate = 95%) and annual earnings before interest & taxes (EBIT) is $5.6 million.

Varun chart 5

If we were to increase the price to $370, then the win-rate drop to 82%. The new expected annual sales volume is 325,000 and annual EBIT = $27.5 Million. The sales volume could drop significantly if you increase the prices too high. For example if you were to increase the price to $580, then win-rate will drop to only 18% which might be unacceptable as most companies would like to maintain a minimum market-share. SAS/OR can be used to solve this optimization problem to identify the price of the product that maximizes profitability while satisfying constraints such as market share requirements.

Varun chart 6Third step: execute and monitor

In the third step, we implement strategies that we identified in step 2, and importantly we monitor the performance in using Visual Analytics report. The real time monitoring enables you to refine the strategy after it goes live to make sure that we achieve the goals outlined in step 2.

Varun chart 7

Inventory Optimization

The durable goods company implemented SAS® Inventory Optimization Workbench to right-size the inventory and achieve multi-echelon inventory optimization across their complex supply chain network. The SAS inventory optimization process consists of four steps:

Varun chart 8.jpg

In the first step, we fit identify the probability distribution that best fits the forecasted demand. The SAS Inventory Optimization Workbench (IOW) support a number of continuous distributions, such as normal, log normal and discrete distributions like Poisson and binomial. One of the key differentiators of SAS IOW is its ability to handle intermittent demand. In most supply chain, top 20% of the items account 80% of the total demand. There are lot of items with low demand with high variance. SAS IOW uses a specialized technique called Croston method to handle these intermittent demand items. In the Croston method, demand is estimated through exponential smoothing and the interval between demands is estimated separately. In addition we utilize historical performance to estimate the right amount of safety stock that would satisfy the service level requirements.

The second step is to calculate the inventory target or order-up-to-level. The inventory target consists of three components: pipeline stock, cycle stock, and safety stock. Pipeline stock is the amount of inventory required to cover the demand during lead time while cycle stock is the inventory required to cover the period between replenishment. Safety stock is the amount of inventory required to cover the uncertainty in demand. The variance of forecast data and the desired service level for a given product-location pair are the main drivers for the safety stock calculation.

In the third step, we simulate 200 scenarios of forecasted demand over the horizon and calculate KPIs, such as on hand inventory, service level, backlog, and replenishment orders using Monte Carlo Simulation. In the final step, we update the Visual Analytics report using the output from the SAS Inventory Optimization Workbench. Users can easily visualize inventory target, forecasted demand, and project orders over time and use these reports to spot any outliers.

Varun chart 9.jpg

Inventory is a significant investment for every company. So it is critical for supply chain team to estimate the benefits from inventory optimization so that they can set the right expectation with their management team. In addition there are many parameters that needs to be fine-tune periodically to optimize the system performance. SAS team developed a new simulation-optimization approach called Tuning and Validation to overcome these challenges. In the first step, called tuning, we use historical data to automatically calibrate the parameters of the inventory optimization. In second step, we simulate using the optimized policy and compare it to the historical performance to quantify the improvements in KPIs such as inventory cost, backlog, service level etc. as a result of inventory optimization.

Varun chart 10.jpg.pngConclusion

For this durable goods manufacturer, we saw the service level improve from 65% to 92% and backorders reduce from 8% to 2% in 10 months after implementation by using multi-echelon inventory optimization. In addition, with the implementation of the new analytical pricing platform, their analysts can design their pricing and promotion strategies with the click of a button. They can monitor their performance in real-time with visualization. With SAS Demand-Driven Planning and Optimization they have an enhanced, integrated platform for demand planning and multi-echelon inventory optimization, eliminating reliance on multiple Excel-based processes. To learn more about this project you may wish to read this SAS Global Forum 2016 paper, "Leveraging Advanced Analytics in Pricing and Inventory Decisions at a Major Durable Goods Company."


Econometric and statistical methods for spatial data analysis

We live in a complex world that overflows with information. As human beings, we are very good at navigating this maze, where different types of input hit us from every possible direction. Without really thinking about it, we take in the inputs, evaluate the new information, combine it with our experience and previous knowledge, and then make decisions (hopefully, good informed decisions). If you think about the process and types of information (data) we use, you quickly realize that most of information we are exposed to contains a spatial component (a geographical location), and our decisions often include neighborhood effects. Are you shopping for a new house? In the process of choosing the right one, you will certainly consider its location, neighboring locations, schools, road infrastructure, distance from work, store accessibility, and many other inputs (Figure 1). Going on a vacation abroad? Visiting a small country with a low population will probably be very different from visiting a popular destination surrounded by densely populated larger countries. All these are examples that illustrate the value of econometric and statistical methods for spatial data analysis.

Econometric and statistical methods for spatial data analysis

Figure 1: Median Listing Price of Housing Units in U.S. Counties (Source:; retrieved on June 6th 2016)

We are all exposed to spatial data, which we use in our daily lives almost without thinking about it. Not until recently have spatial data become popular in formal econometric and statistical analysis. Geographical information systems (GIS) have been around since the early 1960s, but they were expensive and not readily available until recently. Today every smart phone has a GPS, cars have tracking devices showing their locations, and positioning devices are used in many areas including aviation, transportation, and science. Great progress has also been made in surveying, mapping, and recording geographical information in recent years. Do you want to know the latitude and longitude of your house? Today that information might not be much further away than typing your address into a search engine.

Thanks to technological advancement, spatial data are now only a mouse-click away. Though their variety and volume might vary, data of interest for econometric and statistical methods for spatial data analysis can be divided into three categories: spatial point-referenced data, spatial point-pattern data, and spatial areal data.  The widespread use of spatial data has put spatial methodology and analysis front and center. Currently, SAS enables you to analyze spatial point-referenced data and spatial point-pattern data with the KRIGE2D and VARIOGRAM procedures and spatial point-pattern data with the SPP procedure, all of which are in SAS/STAT. The next release of SAS/ETS (version 14.2) will include a new SPATIALREG procedure that has been developed for analyzing spatial areal data. This type of data is the focus of spatial econometrics.

Spatial econometrics was developed in the 1970’s in response to the need for a new methodological foundation for regional and urban econometric models. At the core of this new methodology are the principles on which modern spatial econometrics is based. These principles essentially deal with two main spatial aspects of the data: spatial dependence and spatial heterogeneity. Simply put, spatial econometrics concentrates on accounting for spatial dependence and heterogeneity in the data under the regression setting. This is important because the ignorance of spatial dependence and heterogeneity could lead to biased/inefficient parameter estimates or flawed inference. Unlike standard econometric models, spatial econometric models do not assume that observations are independent. In addition, the quantification of spatial dependence and heterogeneity is often characterized by the proximity of two regions, which is represented by a spatial weights matrix in spatial econometrics. The idea behind such quantification resonates with the first law of geography—“Everything is related to everything else, but near things are more related than distant things.”

In spatial econometric modeling, the key challenge often centers on how to choose a model that well describes the data at hand. As a general guideline, model specification often starts with understanding where spatial dependence and heterogeneity come from, which is often problem-specific. Some examples of such problems are pricing policies in marketing research, land use in agricultural economics, and housing prices in real estate economics. As an example, it is likely that car sales from one auto dealership might depend on sales from a nearby dealership either because the two dealerships compete for the same customers or because of some form of unobserved heterogeneity common to both dealerships. Based on this understanding, you proceed with a particular model that is capable of addressing the spatial dependence and heterogeneity that the data exhibit. Following that, you revise the model until you identify one that meets certain criteria, such as Akaike’s information criterion (AIC) or the Schwarz-Bayes criterion (SBC). Three types of interaction contribute to spatial dependence and heterogeneity: exogenous interaction, endogenous interaction, and interaction among the error terms. Among a wide range of spatial econometric models, some are known to be good for one type of interaction effect, whereas others are good for other alternatives. If you don’t choose your model properly, your analysis will provide false assurance and flawed inference about the underlying data.

In the next blog post, we’ll talk more about econometric and statistical methods for spatial data analysis by discussing spatial econometric analysis that uses the SPATIALREG procedure. In particular, we’ll discuss some useful features in the SPATIALREG procedure (such as parameter estimation, hypothesis testing, and model selection), and we’ll demonstrate these features by analyzing a real-world example. In the meantime, you can also read more in our 2016 SAS Global Forum paper, How Do My Neighbors Affect Me? SAS/ETS® Methods for Spatial Econometric Modeling.

Optimization for machine learning and monster trucks

beloved monster trucks

My son with a beloved monster truck

Optimization for machine learning is essential to ensure that data mining models can learn from training data in order to generalize to future test data. Data mining models can have millions of parameters that depend on the training data and, in general, have no analytic definition. In such cases, effective models with good generalization capabilities can only be found by using optimization strategies.  Optimization algorithms come in all shapes and sizes, just like anything in life. Attempting to create a single optimization algorithm for all problems would be as foolhardy as seeking to create a single motor vehicle for all drivers —there is a reason we have semi-trucks, automobiles, motorcycles, etc. Each prototype is developed for a specific use case. Just as in the motor vehicle world, optimization for machine learning has its super stars that receive more attention than other approaches that may be more practical in certain scenarios, so let me explain some ways to think about different algorithms with an analogy related to my personal experience with monster trucks.

My son fell in love with monster trucks sometime after he turned two years old. He had a favorite monster truck that he refused to put down, day or night. When he turned two, I unwisely thought he would enjoy a monster truck rally and purchased tickets, imagining father and son duo making great memories together. If you have never been to a monster truck rally, I don’t think I can convey to you how loud the sporadically-revved monster truck engine can truly be. The sound feels like a physical force hitting squarely in the chest, sending waves of vibrations outward.  Despite purchasing sound suppression “monster truck tire” ear muffs for my son, he was not a happy camper and refused to enter the stadium. Instead he repeatedly asked to go home. I circled the outer ring of the stadium, where convenience stands sell hot dogs, turkey drumsticks, and monster truck paraphernalia. I naïvely hoped he would relax and decide to go inside to watch the show. I even resorted to bribery — “whatever monster truck you want I’ll buy if you just go inside the stadium with me.” But within 20 minutes of the rally starting and relentless loop of “I want to go home, Daddy,” the mighty father and son duo passed perplexed ticket-taking staff on our retreat to the parking lot in search of our humble but reliable 2010 Toyota Corolla.

lots of monster trucks

Just a few of his monster trucks

My biggest fear was that my son’s deep love and fascination with monster trucks had been destroyed. Fortunately, he continued watching Hard Hat Harry monster truck episodes incessantly with his current favorite monster truck snuggled safe beside him. So I know more fun facts about monster trucks than I care to admit. For example, monster trucks can actually drive on water, have front and back wheel steering, windows at the driver’s feet for wheelie control, engines that shut off remotely, are capable of consecutive back-flips, can jump over 230 feet without injury, and put away seven gallons of gas traveling less than a mile. When not on exhibitions, monster trucks require constant tuning and tweaking, such as shaving rubber from their gigantic tires to reduce weight and permit them to fly as high as possible on jumps. Monster trucks are their owners' passion, and they don’t mind spending 24/7 maintaining these (I now realize) technological marvels created by human ingenuity and perhaps too much spare time.

Despite how much my son loves them, I have zero desire to own a monster truck. I am a busy dad and value time even more than money (to my wife’s lament). Though my son would be overjoyed if I brought home a truck with tires that towered over him, I personally prioritize factors of safety, fuel efficiency, reliability, and simplicity.

There are reliable and powerful motor vehicles that we use to make our lives easier, and then there are even more powerful ones that astound and enlighten us at competitions. We similarly have superstar algorithms that do optimization for machine learning that can be used to train rich and expressive deep learning models to perform amazing feats of learning,  such as the recent showdown between Atari Games and a deep learning tool. One of the stars of today's machine learning show is stochastic gradient methods and the dual coordinate descent variants.  These methods:

  • are surprisingly simple to describe in serial, while requiring eloquent mathematics to explain convergence and in practice
  • can train a model remarkably fast (with careful initialization)
  • are currently producing world record results.

This trifecta makes stochastic gradient descent (SGD) the ideal material for journal articles. Prototype approaches can be created relatively quickly, requiring publishable mathematics to prove convergence, pounded home with surprisingly good results. Like a pendulum swinging, my personal opinion is that the stochastic gradient algorithms are (1) awesome, (2) incredibly useful, practical, and relevant, and (3) may be eclipsing classes of algorithms that for many end users could actually be more practical and relevant when applicable.

As my career has progressed and I have been immersed in diverse optimization groups, this experience has impressed upon me the philosophy that for every class of algorithms there exists a class of problems where the given algorithm appears superior. This is sometimes referred to as the no free lunch theorem of optimization. For end users, this simply means that true robustness cannot come from any single best class of algorithms, no matter how loud its monster engine may roar.  Robustness must start with diversity.

Figure 1 Objective value for different choices of learning rate specified by the user. Beforehand it is almost impossible to know what learning rate range works. The user must hunt and choose.

Figure 1 Objective value for different choices of learning rate specified by the user. Beforehand it is almost impossible to know what learning rate range works. The user must hunt and choose.

An example of someone thinking outside the current mainstream is the relatively recent work on Hessian-free optimization by James Martens. Deep learning models are notoriously hard to solve. Martens’ shows that adapting the second-order methodology of Newton’s method in the Hessian-free context creates an approach that can be used almost out-of-the box. This realization is in stark contrast to many existing approaches, which work only after careful option tuning and initializations. Note that a veteran data miner may spend weeks repeatedly training the same model with the same data in search of the needle in the haystack golden — a set of solver options that result in the best solve time and solution. For example, Figure 1 shows how the choice of the so-called “learning rate” can decide whether an algorithm converges to a meaningful solution or goes nowhere.

This is why I am excited about this Hessian-free approach, because although it is currently not mainstream, and it lacks the rock star status of stochastic gradient descent (SGD) approaches, it has the potential to save the user significant user processing time. I am certainly also excited about SGD, which is a surprisingly effective rock star algorithm being rapidly developed and expanded as part of the SAS software portfolio. Ultimately, our focus is to provide users an arsenal of options to approach any future problem and ideally, like a monster truck, readily roll over any challenge undaunted. We are also working to provide an auto-tune feature that permits the efficient use of the cloud to find the best tool in for any given problem.

Stochastic gradient descent (SGD) approaches dominate the arena these days, with their power proving itself in the frequency with which they are used in competitions. However, not typically reported or advertised is the amount of time the data miner(s) spent tuning both the model and solver options, often called hyper-parameters. Typically the time reported for resulting models only measures the training time using these best parameter, and left unsaid are the long hours of concerted effort hunting, accumulated over weeks, for these hyper-parameters. This consuming tuning process is a well-known property of stochastic gradient methods, because the solvers surface a set of user options that typically need to be tuned for each new problem.

But on a day-to-day basis data miners seek to solve real data problems for their respective companies and organizations and not ascend the leaderboard in a competition. Not all problems are big data problems, and not all problems require the full power and flexibility of SGD. Two questions that all data miners must ask are:

  1. What model configuration to use?
  2. What solver should train the model?

The user searches for the unquantifiable best model, best model options, best solver, and best solver options. Combinatorically speaking there are billions of potential combinations one might try, and at times one needs a human expert to narrow down the choices based on as-yet-undefinable and inexplicable human intuition (another artificial intelligence opportunity!). A primary concern when picking a model and a corresponding training algorithm is solve time. How long must the user wait to obtain a working model? And what is the corresponding solution quality? In general users want the best model in the least time.

Personally when I consider solver time I think of two separate quantities:

  1. User processing time.  How much time does the user need to babysit the procedure sitting in front of a computer?
  2. Computer processing time.  How much time does the user need to wait for a given procedure to finish?

Often times we only can quantify (2) and end up leaving (1) unreported. However, a SAS customer’s satisfaction likely hinges on (1) being relatively small, unless this is their day job. I personally love algorithm development and enjoy spending my waking hours seeking to make algorithms faster and more robust. Because of this I jealously guard my time when it comes to other things in my life, such as the car I drive. I don’t own a monster truck and would never be interested in the required a team of mechanics and constant maintenance to achieve world record results. I just want a vehicle that is stable, reliable, and gets me to work, so I can focus entirely on what interests me most.

Figure 2 There is very little communication needed when training models in parallel and thus a great fit for the cloud or any available cluster.

Figure 2 There is very little communication needed when training models in parallel and thus a great fit for the cloud or any available cluster.

At SAS we want to support all types of customers. We work with experts who prefer total control and are happy to spend significant time manually searching for the best hyper-parameters. But we also work with other practitioners who may be new to data mining or have time constraints and prefer a set-it-and-forget scenario, where SAS searches for options for them. Now, in a cloud environment, we can train many models in parallel for you in the time it takes to build a single model. Figure 2 shows an example where we train up to 32 models concurrently, using four workers for each training session (resulting in a total of 128 workers used) to create 1800 different candidate models.  Imagine attempting to run all 1800 models sequentially, manually deciding which options to try next based on the previous solve performance. I have done this for weeks at a time and found it cumbersome, circular (I would try the same options multiple times), and exhausting. Once presented this parallel capability I obtained new records in hours (instead of weeks) in exchange of me making a single click. Further, I am free to take care of other things while the auto-tune procedure works constantly in the background for me.

Out with the old, in with the new

Out with the old, in with the new

Another way to cut down on user processing time is using alternative algorithms that require less tuning but perhaps longer run times. Basically, shifting time from (1) to (2), such as with auto-tuning in parallel. This approach may mean substituting rock-star algorithms like SGD for algorithms that are slower and less exciting, such as James Martens’ Hessian-free work mentioned earlier. His may never be a popular approach in competitions, where data mining experts prefer the numerous controls offered by alternative first-order algorithms, because ultimately they need only report computer processing time for the final solve. However, the greatest strength of Martens’ work is the potential gain in reducing the amount of user processing time. In a cloud computing environment we can always conduct multiple training sessions at the same time. So, we don’t necessarily have to choose a favorite when we can run everything at the same time. A good tool box is not one that offers a single tool used to achieve record results but a quality and diverse ensemble of tools that cover all possible use cases. For that reason I think it best to offer both SGD and Hessian-Free, as well as other standard solvers. By having a diverse portfolio, you increase the chance that you have a warm start on the next great idea/algorithm. For example, my son, now five and half, spends more time at the toy store looking at power rangers than monster trucks.

My sturdy and reliable Toyota Corolla, like second-order algorithms, may not ever be as carefully and lovingly tuned to win competitions, but practitioners may be far more satisfied with the results and lesser maintenance time. After all, judging by the vehicles we see on the road, most of us, most of the time, will prefer a practical choice for our transportation rather than a race car or monster truck.

I’ll be expanding on optimization for machine learning later this week at the International Conference on Machine Learning on Friday, June 24. My paper, “Unconventional iterative methods for nonconvex optimization in a matrix-free environment,” is part of a workshop on Optimization Methods for the Next Generation of Machine Learning, and I’m pleased to be joined by such rock stars as Yoshua Bengio of the University of Montreal  and Leon Bottou of Facebook AI Research, who are just a few of the stellar people presenting on such topics during the event.


Analytics, OR, data science and machine learning: what's in a name?

data science and machine learning feud but shouldn't

Casa di Giulietta balcony, where Juliet supposed stood while Romeo declared his love

Analytics, statistics, operations research, data science and machine learning - with which term do you prefer associate? Are you from the House of Capulet or Montague, or do you even care? Shakespeare's Juliet derides excess identification with names in the famous play, Romeo and Juliet.

"What's in a name? That which we call a rose
By any other name would smell as sweet."

Romeo was from the house of Montague, Juliet, the house of Capulet, and this distinction that meant that their families were sworn enemies. The play is a tragedy, because by the end the two lovers end up dead as a result of this long-running feud. Statistics, data science and machine learning are but a few of the "houses" that feud today over names, and while to my knowledge no deaths have resulted from this debate the competing camps have nearly come to blows.

"Operations Research? Management Science? Analytics? What’s in a brand name? How has the emerging field of Analytics impacted the Operations Research Profession? Is Analytics part of OR or the other way around? Is it good, bad, relevant, a nuisance or an opportunity for the OR profession? Is OR just Prescriptive or is it something more? In this panel discussion, we will explore these topics in a session with some of the leading thinkers in both OR and Analytics. Be sure to attend to have your questions answered on these highly complementary and valuable fields.” This was the abstract for a panel at the INFORMS Annual Meeting last year that included two past presidents of INFORMS and other long-time members. To many long-time members of INFORMS this abstract was provocative indeed, because not all embrace of the term "analytics" to describe what they do.

Interestingly, the American Statistical Association (ASA) is host to a very similar debate. Last fall the ASA released a statement on the Role of Statistics in Data Science. There’s a whole wing of data science practitioners who are downright hostile to statisticians. This camp makes assertions like "sampling is obsolete,” due to computational advances for processing big data. Popular blogger Vincent Granville has even said "Data Science without statistics is possible, even desirable," extending the obsolescence to statisticians themselves. INFORMS member and self-described statistical data scientist Randy Bartlett, who is both a Certified Analytics Professional (the CAP certification offered by INFORMS) and an Accredited Professional Statistician (the PSTAT certification from ASA), has written about this statistics denial in an excellent series of blog posts he publishes from his LinkedIn page. In the face of such direct attacks on their profession it is no wonder the ASA felt a need to take a stance.

Many statisticians assert "Aren't We Data Science?," as Marie Davidian (professor of statistics at NC State University) did in in 2013 in an article published during her tenure as president of the ASA. More recently David Donoho (professor of statistics at Stanford University) makes a similar but complex argument in a long-form piece, "50 Years of Data Science," which he released last fall (after a presentation on it at the Tukey Centennial workshop). Donoho is equally dismayed at much of the current data science movement. As he puts it, "The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers." Donoho points out the harm in the huge oversight of the contributions of statistics while also exhorting academic statisticians to expand beyond a narrow focus on theoretical statistics and "fancy-seeming methods." He proposes a definition of data science, based on people who are "learning from data," drawing upon a remarkably prescient article John Tukey published more than 50 years ago, "The future of data analysis," in The Annals of Mathematical Statistics. In making his case, Donoho also points to subsequent essays by John Chambers (of Bell Labs and co-developer of the S language), William Cleveland (also of Bell Labs and arguably the one who coined the term data science in 2001), and Leo Breiman (of the University of California at Berkeley). These gentleman together argue for addressing a wider portion of the lifecycle of analysis, such as the preparation of data as well as its presentation, and the importance of prediction (and not just inference). I heartily recommend reading Donoho's excellent analysis.

While I haven’t seen a similar assault on operations research, there are those within the OR/MS community who see terms like analytics as a threat to the survival of OR. Several years ago, when INFORMS began exploring whether to embrace the term analytics, researchers surveyed the INFORMS membership and published their results in an article in Interfaces entitled "INFORMS and the Analytics Movement: The View of the Membership." Member views of the relationship between OR and analytics were roughly divided into three camps: OR is a subset of analytics, analytics is a subset of OR, and analytics is the intersection of OR and analytics. While the numbers may have shifted I doubt they have yet converged into a clear definition or consensus among members. There will always be naysayers - 6% of the membership surveyed at the time thought there was no relationship between OR and analytics, and for that matter there are statisticians who see no association with data science or analytics at all. Today INFORMS embraces the term analytics, describing itself as "the largest society in the world for professionals in the field of operations research (O.R.), management science, and analytics."  By adding the word analytics, instead of replacing operations research, INFORMS has shown that this is not an either/or question - it values its roots while acknowledging a present that includes those who describe their work as "analytics."

Trying to wrangle a clear definition of analytics, statistics, data science, machine learning, and even operations research can be as messy as cleaning up a typical data set! These are distinct disciplines, related but not the same. Increasingly analytics is used synonymously with data science, which is derived in large part from statistics, which also is a foundation for machine learning, which relies upon from optimization techniques, which I’d argue are part of analytics. These days I observe increased cross-fertilization, like my operations research peers clamoring to attend machine learning conferences and machine learning talks that are standing-room-only at the Allied Social Sciences Annual Meeting (where the economists gather).

We can parse terminology all day, but instead we should invest our energy in the opportunity at hand and drive towards increased adoption. Some of these buzzy terms get people’s attention and provide an incredible opportunity to use mathematically-based methods to make a significant impact. Last year my friend Jack Levis accepted my invitation to give a keynote at an analytics conference SAS hosted. He spoke about ORION, the OR project he leads at UPS that Tom Davenport has said is "arguably the world's largest operations research project in the world." While few of the conference attendees likely understood in great detail the operations research methods employed, all were amazed at the impact his team has had on saving miles, time, and money for UPS, happily tweeting their excitement during his talk. No doubt this impact is why Jack's team won the 2016 INFORMS Edelman Competition, which some call "the Super Bowl of OR."

The important advances in research presented at conferences like the Joint Statistical Meetings and the INFORMS Annual Meeting pave the way for progress that enables success in practice at places like UPS. We need academics and other researchers to continue to invest in advancing the unique approaches of their disciplines. Most INFORMS members would call what Jack's team does operations research. Most attendees at the conference where Jack spoke probably thought of it as analytics. Does it really matter what we call it, if people value what was done, want to share the story, and pave the path for adoption through their enthusiasm ? If it leads to the expansion of OR I don’t care if this application of operations research is referred to popularly as analytics, because after all, in the words of the bard, "a rose by any other name would smell as sweet." Each of the historic disciplines has a chance to have not only a bigger slice of the pie but a bigger slice of a bigger pie if analytics is embraced at large. I understand the value and pride in disciplines like operations research and statistics, and their contributions to data science and machine learning, which are the knowledge base analytics draws upon. We don't have to throw out older terms to embrace new ones. The houses of Capulet and Montague can celebrate their unique heritage but put an end to their feuds. This is not an either/or proposition, but I do believe it is our opportunity to squander. Instead of feuding let us learn more, put that learning into practice, and make the world a better, smarter place.


Image credit: photo by Adam W // attribution by creative commons

Understanding data mining clustering methods

When you go to the grocery store, you see that items of a similar nature are displayed nearby to each other.  When you organize the clothes in your closet, you put similar items together (e.g. shirts in one section, pants in another). Every personal organizing tip on the web to save you from your clutter suggests some sort of grouping of similar items together. Even we don't notice it, we are involved in grouping similar objects together in every aspect of our life. This is called clustering in machine learning, so in this post I will provide an overview of data mining clustering methods.

In machine learning or data mining, clustering assigns similar objects together in order to discover structures in data that doesn't have any labels. It is one of the most popular unsupervised machine learning techniques with varied use cases. When you have a huge volume of information and want to turn this information into a manageable stack cluster analysis is a useful approach. It can be used to summarize the data as well as prepare the data for other techniques. For instance, assume you have a large number of images and wish to organize them based on their content. In this case, you first identify the objects in the images and then apply clustering to find meaningful groups. Consider this situation: you have an enormous amount of customer data and want to identify a marketing strategy for your product, as illustrated in figure 1. Again, you could first cluster your customer data into groups that have similar structures and then plan different marketing campaigns for each group.

data mining clustering based on similarities

Figure 1. Clustering: group the data based on the similarities.

Defining similarity / dissimilarity is an important part in clustering, because it affects the structure of your groups. You need to change your measure based on the application that you have. One measure may be good for continuous numeric variables but not good for nominal variables. Euclidean distance, the geometric distance in multidimensional space, is one of the most popular distance measures used in distance-based clustering. When the dimension of the data is high, due to the issue called the ‘curse of dimensionality’, the Euclidean distances between any pair of the data loses its validity as a distance metric, because the distances will all be close to each other. In such cases other metrics like the cosine similarity and the Jaccard correlation coefficient are among the most popular alternatives.

k-means clustering, a data mining clustering method

Figure 2 - K-means clustering

One of the most popular distance-based clustering algorithms is ‘k-means’. K-means is conceptually simpler and computationally relatively faster compared to other clustering algorithms, making k-means one of the most widely used clustering algorithms. While k-means is centroid-based, it is more efficient when the clusters are in globular shapes. Figure 2 shows clustering results yielded by k-means when the clusters in the data are in globular shapes.

Density-based algorithms can help discover any shaped clusters in the data. In density-based clustering, clusters are areas of higher density than the other parts of the data set. Objects in these sparse areas - which are required to separate clusters - are usually considered to be border points corrupted by noise. The most popular density-based clustering method is DBSCAN. Figure 3 shows the results yielded by DBSCAN on some data with non-globular clusters.

DBSCAN algorithm, a data mining clustering method

Figure 3. DBSCAN algorithm.

For both the k-means and DBSCAN clustering methods mentioned above, each data point is supposed to be assigned to only one cluster. But consider this kind of situation: when a streaming music vendor tries to categorize its customers into groups for better music recommendations, a customer who likes songs from Reba McEntire may be suitable for the “country music lovers” category. But at the same time, s/he has considerable purchase records for pop music as well, indicating the “pop music lovers” category is also appropriate. Such situations suggest a ‘soft clustering,’ where multiple cluster labels can be associated with a single data point, and each data point is assigned a probability of its association with the various clusters. A Gaussian mixture model with expectation-maximization algorithm (GMM-EM) is a prominent example of probabilistic clusters modeling.

So far, all the clustering methods discussed above involve specifying the number of clusters beforehand, for example, the ‘k’ in the k-means (as shown in figure 4). As a typical unsupervised learning task, the number of clusters is unknown and needs to be learned during the clustering. This is actually why ‘nonparametric methods’ are used in the practice of clustering. A classic application of nonparametric clustering is to replace the probabilistic distribution with some stochastic process in a Gaussian mixture model. The stochastic process can help to discover the number of clusters that yield the highest probability for the data under analysis. In other words, the most suitable number of clusters for the data.

uncertainty in number of data mining clusters

Figure 4. The uncertainty in determining the number of clusters in data.

Besides, if you are interested in the internal hierarchy of the clusters in the data, then a hierarchical clustering algorithm can be applied, as shown in figure 5.

heat matrix and hierarchy of data mining clusters

Figure 5. An example of a heat matrix and hierarchy of clusters in a hierarchical clustering.

There are many different ways to cluster data. Each method tries to cluster the data from the unique perspective of its intended application. As mentioned, clustering discovers patterns in data without explaining why the patterns exist. It is our job to dig into the clusters and interpret them. That is why clusters should be profiled extensively to build their identity, to understand what they represent, and to learn how they are different from each other.

Clustering is a fundamental machine learning practice to explore properties in your data. The overview presented here about data mining clustering methods serves as an introduction, and interested readers may find more information in a webinar I recorded on this topic, Clustering for Machine Learning.

I am grateful to my colleague Yingjian Wang, who provided invaluable assistance to complete this article.

All images courtesy of author but last image, where credit goes to Phylogeny Figures  // attribution by creative commons

I've seen the future of data science....

"I've seen the future of data science, and it is filled with estrogen!" This was the opening remark at a recent talk I heard. If only I'd seen that vision of the future when I was in college. You see, I’ve always loved math (and still do). My first calculus class in college was at 8 a.m. on Mondays, Wednesdays and Fridays, and I NEVER missed a class. I showed up bright-eyed and bushy tailed, sat on the front row, took it all in, and aced every test. When class was over, I’d all but sprint back to my dorm room to do homework assignments for the next class. The same was true for all my math and statistics classes. But despite this obsession, I never considered it a career option. I don’t know why, maybe because I didn’t know any other female mathematicians or statisticians, or I didn’t know what job opportunities even existed. Estrogen wasn't visible in the math side of my world in those days; I didn't see myself as part of the future of data science.

Fast forward (many) years later, and I find myself employed at SAS in a marketing and communications capacity working closely with colleagues who are brilliant mathematical and analytical minds, many of whom are women. They are definitely the future of data science!

Several of these colleagues helped establish the NC Chapter of the Women in Machine Learning and Data Science (WiMLDS) Meetup that just held its inaugural gathering a couple of weeks ago. The Meetup was founded to facilitate stronger collaboration, networking and support among women in machine learning and data science communities as well as grow the talent base in these fields of study. In other words, build the future of data science and populate it with women! The NC chapter plans to host quarterly, informal Meetup events that will feature presentations from practitioners and the academic community, as well as tutorials and learning opportunities.

Jennifer smallThis inaugural event featured guest speaker Jennifer Priestley, Professor of Applied Statistics and Data Science at Kennesaw State University, who greeted the estrogen-filled audience. She talked at length about the field of data science and the talent gap, and she made the case for getting a PhD in data science.

She said she’s starting to see PhD programs recognize data science as a unique discipline or field of study. She referred to data science as “the science of understanding data; the science of how to translate massive amounts of data into meaningful information to find patterns, engage in research and solve business problems.”

Priestley attributes the rise of data science to the 3 Vs – volume, velocity and variety – across all industries and sectors. She said companies that wouldn’t have classified themselves as data companies a few years ago do now, and they require skilled labor to help them manage that data and use it to make business decisions.

To help fill this talent gap, she talked about the need for PhD programs in data science, but explained that such programs needed to be “21st-century” programs built around applied curriculum. That is how they've built their own Ph.D. Program in Analytics and Data Science at Kennesaw State University.

As I sat in the back listening in, I wondered what would have happened had I been exposed to a network like this during my early college days when I was trying to pick a major and think about a career. Would I have been part of the future of data science? Maybe I’d have made a different decision. Who knows – maybe it’s not too late. It’s certainly not too late to inspire another woman to tap into a supportive network like this.

For more information, visit the NC WiMLDS Meetup website.