How Santa’s Workshop uses social network analysis

Mrs. Claus

I recently met Mrs. Claus at the INFORMS Annual Meeting, where we got to talking about the social network analysis session she’d just attended. It turns out Mrs. Claus and I are both fans of a book by Alex Pentland, Social Physics: How Social Networks Can Make Us Smarter. Apparently years ago she had foreseen the trend toward analytics and returned to school for dual PhDs in Computer Science and Statistics at Stanford. She now carries the title of Chief Data Scientist of Santa’s Workshop. Who knew? We chatted about the many ways she and her team at Santa’s Workshop use social network analysis, some of which were commonly employed but others were surprising adaptations.

Santa's Workshop

Santa’s Workshop first started using social network analysis to uncover fraud. While naughty children exist (which requires predicting coal delivery, but that’s another post), the perpetrators Santa is after are adults. I bet you didn’t know that some households pretend to have young children by leaving out notes for Santa, even if they have no children at all or their children are grown and have left home. The challenge with finding fraudsters is not in seeing a pattern, because fraud is by definition a rare event, but in making meaningful connections between disparate data activities that may help spot fraud.

Social network analysis allows investigators to look at lots of data from multiple sources at the level of a network, where they can see different people (nodes) and their relationships (ties) in the form of a graph. The connections between people may not exist at the transactional level but jump out when viewed graphically in network form. Just like how the Los Angeles County Department of Public Social Services uses social network analysis to quickly identify potential co-conspirators in fraud rings, Santa’s Workshop uses social network analysis to find the bad guys. Mrs. Claus has uncovered several fraud rings using this analysis way and stopped delivery of presents to those homes.

Telecom companies are not the only ones to worry about churn. Santa’s Workshop also has to worry about this problem, which arises when children stop believing in Santa Claus and cancel their “Santa service” prematurely. You may think this transition of beliefs is an isolated event that is just a natural function of age, but Mrs. Claus uses the SAS® Enterprise Miner™ Link Analysis node (link analysis is a popular form of social network analysis) to uncover notable connections among parents, siblings, and schools that suggest the possibility of churn. They look at cell phone call records and social media connections to understand relationships, and then use targeted interventions to offer parents tools to allow children to continue to believe.

Drawing on Pentland’s research they have applied careful use of social network incentives to encourage older children not to tell their younger siblings or other children in the neighborhood. Pentland’s lab has found that this kind of positive social pressure is best applied to people in the target’s network rather than directly to the target. For Santa this means offering incentives to the older, respected friends of children at risk of “explaining” Santa to their younger siblings rather than to the older siblings themselves. Pentland’s research shows that this kind of nudge is far more effective than standard economic incentives, because it recognizes that we are social actors strongly influenced by our social ties. Mrs. Claus and team have also found that children at risk of churning held onto their beliefs longer when they received special mail messages from Santa himself.

Social network analysis

Using social network analysis to detect fraud and churn are common use cases, but what really intrigued me was how Santa’s Workshop uses social network analysis to generate ideas for presents by improving idea flow among the elves. You may have thought that Santa is only an order taker, but consider the challenge he faces when a child asks for something that simply is infeasible (pony requests in New York City, for example). So each year he must conceive of, design, and produce items that may not have been requested but are likely to please. Product Manager Elves are responsible for finding ideas for these gifts and delivering product specifications to the Manufacturing Elves.

Research shows that the best way to stimulate idea flow is to increase both exploration and engagement. Early in the year Product Manager Elves travel around (incognito, of course) to hunt for ideas. The Product Manager Elves most consistently successful at generating creative product ideas are those Pentland labels explorers. You know these kind of people – they are the ones who know lots of different kinds of people, love talking ideas with them, and then sharing those ideas they’ve just gathered in subsequent conversations. As Pentland describes, their focus is not “the ‘best’ people or ‘best’ ideas….but “people with different views and different ideas.” They then filter the best ideas by learning which ones generate the most traction in their subsequent conversations with others.

The other key to idea flow is engagement, which is when new ideas are shared among teams. The best ideas the Product Manager Elves discover go nowhere if they aren’t adopted and championed by other Elves. So drawing upon an example in Pentland’s book about improvement in idea flow at a call center, Mrs. Claus scheduled a common lunch hour, so everyone breaks at the same time to eat. Previously they didn’t want to bring down the line, so lunch was at various times, but they’ve learned that when all those Elves from different departments circulate at lunch they share ideas. Those ideas that stick are the ones the whole teams get excited about and start contributing to design specs, beginning to see these ideas as those belonging to Santa’s Workshop and not just the Product Manager Elf who discovered them initially. Plus, this seems to have helped in Elf retention, because all of them feel part of the entire process.

What does Mrs. Claus have on her 2017 data science horizon? She’s been exploring the use of sociometric badges that Pentland first employed in his research. Commonly known as sociometers, these are small electronic wearable devices that collect data on people’s interactions (face-to-face time, conversation, gestures, physical proximity, etc.). Pentland’s devices are the size of thick badges, but Santa’s Workshop has developed tiny ones they can surreptitiously place on toys to track similar behavior, with the added element of serving as gateways, so they can analyze the data in real-time. She hopes to make tweaks to gift-giving in 2017 as Santa travels around the world, drawing upon the initial reactions of children to new gifts.

I was glad to meet Mrs. Claus, another fan of Pentland’s book, Social Physics, because it is full of interesting ideas. I’m clearly an explorer, because when I read a book like this I want to discuss it with other people, hear their reactions, and learn new things. Later chapters talk about how the concepts of social physics can lead to smarter cities and even smarter societies. To ensure that the kind of data collected is used ethically Pentland even proposes a New Deal on Data. I'm encouraged by research like this that can be applied as part of the growing #data4good movement. So lots of good stuff here! If any of you have read this book please chime in and let me know what caught your attention.

Mrs. Claus image credit: photo by Public Information Office // attribution by creative commons

Santa's Workshop image credit: photo by Loozrboy // attribution by creative commons

4 tips to ensure your data4good efforts have an impact

In honor of today’s #GivingTuesday, which "harnesses the potential of social media and the generosity of people around the world to bring about real change in their communities,” I’ve been thinking about what constitutes “real change” and the role analytics can play on the many social issues our planet faces. The beauty of #GivingTuesday is that “it provides a platform….to encourage the donation of time, resources and talents to address local challenges." As more and more data scientists want to use their talents in the data4good movement, what does it take for analytics to make a serious impact?

Two weeks ago I heard two very interesting talks on data4good at the INFORMS Annual Meeting, where 5,000+ people focused on operations research gathered together. The first was on “Challenges and Lessons Learned from Influencing Policy Change in Organ Transplantation.” As you can see from the photo I took, this session combined quite a distinguished group of operations research (OR) academics and transplant surgeons who both want to make an impact. For many reasons, a pure market-based approach is not the best way to allocate organs for transplant, so the process is governed by policy makers, who have divided the country into regions. All parties agree that the current regional system results in disparities in access and is broken, but policy makers have been unable to settle on a solution.

Because this situation is a classic market-matching problem it has drawn the attention of the operations research field.* Over the years the academics on the panel had proposed a variety of mathematical solutions. But the most elegant mathematical model for a real world problem adds little value if it is not implemented. So why haven’t they solved the problem? Part of it is as simple and frustrating as problem definition. Are they trying to help those who are the sickest or those most likely to succeed with a transplant? Is it fair that some regions have shorter waiting lists because more organs are available due to increased deaths? Agreeing on the problem definition is tough, and as the clinicians explained, they “argue a lot, because life matters.”

The other session that triggered my thinking was a tutorial on healthcare analytics by Joris van de Klundert of the Erasmus University Institute of Health Policy and Management, who gave a challenge to the OR professionals in attendance. Healthcare analytics is a critical area for the world’s population but one where his fellow researchers in OR are making only a modest contribution. A big part of the problem is too much emphasis on research and too little on results that actually improve healthcare. His literature review of articles on healthcare analytics at various stages of the analytical life cycle highlighted this fact. The vast majority of research is in model building, with fewer and fewer articles published as you move along the cycle to solution development, then model implementation, and finally evaluation and monitoring. There was a lively discussion among the audience of the challenges, which include difficulty as academics getting involved in practice, the conflict between the simple models most often needed and those that will result in publication, and the risk tenure-track faculty face doing work that may not result in the right kind of publications.

After listening to these data4good talks I propose these tips to ensure your application of analytics have a real impact:

• Take time to listen to your “customer.” Even in the social sector realm, you still have “customers,” who are the people or groups for whom you are trying to solve the problem. The transplant surgeons emphasized that it takes a lot of time to build relationships between clinicians and what they called “engineers,” in part because of the big gap that can exist between what these two groups value. Be sure to explain your results, to increase the credibility of your model. As the San Bernadino County Department of Behavioral Health found, discussing their analysis with many of their partners in care helped them align on the goals they shared.
• Build models that match the problem as well as the solution. This means ensuring that you have defined the problem correctly, which as I indicated with the thorny organ transplant discussion maybe far more than half the battle. This also relates back to listening to your “customer.” SAS is working with DataKind and the Boston Public Schools to optimize their bus routes, and as we do so we have to periodically check in to see if the initial models we propose would make sense in practice. People who know math must talk to people who know buses to know if the model will work.
• Focus on putting models into practice. Modeling the problem is important, but as van de Klundert’s literature review shows, the OR community has plenty of success in this area. The challenge is working closely enough (see items 1 and 2 above) with your “customer” to find a path to implementation. After all, as one of the academics said “our endless models don’t necessarily provide the details practitioners want.” So find the details they do want, put them into your model, and work with them to put that model into practice. As Jake Porway, founder of DataKind, blogged: “we cannot make change with technology or data alone.”
• Remember Occam’s razor, or the simplest solution is often the best. For all your interest in trying out the latest non-negative matrix factorization model, a simple logistic regression is often hard to beat. And it will be far more interpretable to most non-analytics professionals.

Today’s #GivingTuesday celebrates giving of all forms, and the social sector could benefit so much from the talent of data scientists (in fact, what sector wouldn’t benefit?). But as Jake Porway likes to say, “you can’t just hack your way to social change.” You must consider impact from the start for your data4good efforts to succeed. After all, who wants to give their time and talent without it making a difference?
* Alvin Roth shared the Nobel Prize for his work in this area, "for the theory of stable allocations and the practice of market design," and while he is a professor of economics at Stanford his PhD is in operations research.

Local Search Optimization for HyperParameter Tuning

When shopping for a new TV, with many sets next to each other across a store wall, it is easy to compare the picture quality and brightness. What is not immediately evident and expected is the difference between how the set looked in the store and how it looks in your home. HDTV pictures are calibrated by default for large bright stores, as that is where the purchase decision is made. In most cases, the backlight setting for LED HDTVs is set at the factory at its maximum setting for bright display in stores. Many other adjustable settings also affect the quality of the picture – brightness, contrast, sharpness, color, tint, color temperature, picture mode, and other more advanced picture control options like motion mode and noise reduction. While most people simply connect the TV and use out-of-the-box settings, not having been instructed in the past to adjust the TV, it turns out that modern HDTVs need to be calibrated to the room size and typical lighting, which will vary. Simply reducing the backlight can make a huge difference (in my case, I reduced this setting from the peak of 20 down to 6!). Adjusting all the options manually, independently, can be tricky. Luckily there are online recommendations for most TV models for the average room. This can be a good start, but again, each room will be different. Calibration discs and/or professional calibration technicians can help to truly find an optimal setting for your environment, to tweak advanced settings. Wouldn’t it be nice if a TV could calibrate itself to its environment? Perhaps this is not far off, but for now the calibration is a manual process.

Once a TV is calibrated, it is ready to enjoy. The visual data, the broadcasted information, can be observed, processed, and understood in real time. When it comes to data analytics, however, with raw data in the form of numbers, text, images, etc., gathered from sensors and online transactions, ‘seeing’ the information contained within, as the source grows rapidly, is not so easy. Machine learning is a form of self-calibration of predictive models given training data. These modeling algorithms are commonly used to find hidden value in big data. Facilitating effective decision making requires the transformation of relevant data to high-quality descriptive and predictive models. The transformation presents several challenges however. As an example, take a neural network (Figure 1). A set of outputs are predicted by transforming a set of inputs through a series of hidden layers defined by activation functions linked with weights. How do we determine the activation functions and the weights to determine the best model configuration? This is a complex optimization problem.

Figure 1: neural network

The goal in this model training optimization problem is to find the weights that will minimize the error in model predictions given the training data, validation data, specified model configuration (number of hidden layers, number of neurons in each hidden layer) and regularization levels designed to reduce overfitting to training data. One recently popular approach to solving for the weights in this optimization problem is through use of a stochastic gradient descent (SGD) algorithm. The performance of this algorithm, as with all optimization algorithms, depends on a number of control parameters for which no set of default values are best for all problems. SGD parameters include among others a learning rate controlling the step size for selecting new weights, a momentum parameter to avoid slow oscillations, a mini-batch size for sampling a subset of observations in a distributed environment, and adaptive decay rate and annealing rate to adjust the learning rate for each weight and time. See related blog post ‘Optimization for machine learning and monster trucks’ for more on the benefits and challenges of SGD for machine learning.

Figure 2: momentum parameter

The best values of the control parameters must be chosen very carefully. For example, the momentum parameter dictates whether the algorithm tends to oscillate slowly in ravines where solutions lie, jumping across the ravine, or dives in quickly. But if momentum is too high, it could jump by the solution (Figure 2). The best values for these parameters also vary for different data sets, just like the ideal adjustments for an HDTV depending the characteristics of its environment. These options that must be chosen before model training begins dictate not only the performance of the training process, but more importantly, the quality of the resulting model – again like the tuning parameters of a modern HDTV controlling the picture quality. As these parameters are external to the training process – not the model parameters (weights in the neural network) being optimized during training – they are often called ‘hyperparameters’. Settings for these hyperparameters can significantly influence the resulting accuracy of the predictive models, and there are no clear defaults that work well for different data sets.

In addition to the optimization options already discussed for the SGD algorithm, the machine learning algorithms themselves have many hyperparameters. Following the neural net example, the number of hidden layers, the number of neurons in each hidden layer, the distribution used for the initial weights, etc., are all hyperparameters specified up front for model training that govern the quality of the resulting model.

The approach to finding the ideal values for hyperparameters, to tuning a model to a given data set, traditionally has been a manual effort. However, even with expertise in machine learning algorithms and their parameters, the best settings of these parameters will change with different data; it is difficult to predict based on previous experience. To explore alternative configurations typically a grid search or parameter sweep is performed. But a grid search is often too coarse. As expense grows exponentially with number of parameters and number of discrete levels of each, a grid search will often fail to identify an improved model configuration. More recently random search is recommended. For the same number of samples, a random search will sample the space better, but can still miss good hyperparameter values and combinations, depending on the size and uniformity the sample. A better approach is a random Latin hypercube sample. In this case, samples are exactly uniform across each hyperparameter, but random in combinations. This approach is more likely to find good values of each hyperparameter, which can then be used to identify good combinations (Figure 3).

Figure 3: hyperparameter search

True hyperparameter optimization, however, should allow searching between these discrete samples, as a discrete sample is unlikely to identify even a local accuracy peak or error valley in the hyperparameter space, to find good combinations of hyperparameter values. However, as a complex black-box to the tuning algorithm, machine learning training and scoring algorithms create a challenging class of optimization problems:

• Machine learning algorithms typically include not only continuous, but also categorical and integer variables. These variables can lead to very discrete changes in the objective.
• In some cases, the space is discontinuous where the objective blows up.
• The space can also be very noisy and non-deterministic. This can happen when distributed data is moved around due to unexpected rebalancing.
• Objective evaluations can fail due to grid node failure, which can derail a search strategy.
• Often the space contains many flat regions – many configurations give very similar models.

An additional challenge is the unpredictable computation expense of training and validating predictive models with changing hyperparameter values. Adding hidden layers and neurons to a neural network can significantly increase the training and validation time, resulting in a wide range of potential objective expense. A very flexible and efficient search strategy is needed.

SAS Local Search Optimization, part of the SAS/OR® offering, is a hybrid derivative-free optimization strategy that operates in parallel/distributed environment to overcome the challenges and expense of hyperparameter optimization. It is comprised of an extendable suite of search methods driven by a hybrid solver manager controlling concurrent execution of search methods. Objective evaluations (different model configurations in this case) are distributed across multiple evaluation worker nodes in a grid implementation and coordinated in a feedback loop supplying data from all concurrent running search methods. The strengths of this approach include handling of continuous, integer, and categorical variables, handling nonsmooth, discontinuous spaces, and easy of parallelizing the search strategy. Multi-level parallelism is critical for hyperparameter tuning. For very large data sets, distributed training is necessary. Even with distributed training, the expense of training severely restricts the number of configurations that can be evaluated when tuning sequentially. For small data sets, cross validation is typically recommended for model validation, a process that also increases the tuning expense. Parallel training (distributed data and/or parallel cross validation folds) and parallel tuning can be managed – very carefully – in a parallel/threaded/distributed environment. This is typically not discussed in the literature or implemented in practice;  typically either ‘data parallel’ or ‘model parallel’ (parallel tuning) is exercised.

Optimization for hyperparameter tuning typically can very quickly lead to several percent reduction in model error over default settings of these parameters. More advanced and extensive optimization, facilitated through parallel tuning to explore more configurations, can lead to further improvement, further refining parameter values. The neural net example discussed here is not the only machine learning algorithm that can benefit from tuning:  the depth and number of bins of a decision tree, number of trees and number of variables to split on in a random forest or gradient boosted trees, the kernel parameters and regularization in SVM and many more can all benefit from tuning. The more parameters that are tuned, the larger the dimensions of the hyperparameter space, the more difficult a manual tuning process becomes and the more coarse a grid search becomes. An automated, parallelized search strategy can also benefit novice machine learning algorithm users.

Machine learning hyperparameter optimization is the topic of a talk to be presented by Funda Günes and myself at The Machine Learning Conference (MLconf) in Atlanta on September 23.  The title of the talk is “Local Search Optimization for Hyperparameter Tuning” and includes more details on the approach, parallel training and tuning, and tuning results.

image credit: photo by kelly // attribution by creative commons

Machine learning fun at KDD

Who says machine learning can't be fun? A crew of us from SAS went to San Francisco for the recent KDD conference, which bills itself as "a premier interdisciplinary conference, [which] brings together researchers and practitioners from data science, data mining, knowledge discovery, large-scale data analytics, and big data." We brought these buttons with us, and they were a huge hit!

Polly and Simran setting up the booth

But we weren't at KDD just to have fun, of course. We came to learn and share, in our booth and in many other ways. Simran Bagga came to talk about all things text analytics, and she was nice enough to pitch in and help me set up the booth. Naturally, her favorite button was "I'm Feeling Unstructured Today." She gave two extended demos in the booth: "Combining Structured and Unstructured Data for Predictive Modeling Using SAS® Text Miner" and "Topic Identification and Document Categorizing Using SAS® Contextual Analysis."

Wayne Thompson served as a senior editor on the Review Board, which means he oversaw a group if volunteers who had the hard task of reviewing and making selections from the many excellent papers submitted for the Applied Data Science track. He was also was a panelist in a "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data." His favorite button was "Talk Data to Me," which he did during his panel, "Internet of Things, Industrial Internet, and Instrumented Environments: the Furious Need for Standards." He also gave an extended demo in the SAS booth on "Machine Learning on the Go."

Udo is third from left on this panel

"Can Tools Effectively Unleash the Power of Big Data?" Udo Sglavo thinks so, and he said as much in this panel he was part of in the Applied Data Science Invited Talks track. As someone who has been involved in data mining for many years, Udo's favorite button was "I Support Vector Machines." This button was popular, because it was also Wei Xiao's favorite. He was busy attending many sessions, but he did give his own extended demo in the booth on "A Probabilistic Machine Learning Approach to Detect Industrial Plant Fault."

Patrick in the booth

Susan Haller, who leads teams responsible for data mining and machine learning at SAS, had a different favorite button: "How Random are Your Forests?” Ray Wright on Susan's team's favorite was "You Can Engineer My Features." Ray is interested in automation, too, which was the subject of his extended demo: "Modeling Automation With SAS® Enterprise Miner™ and SAS® Factory Miner." But Ray also focused on basketball, giving a poster in the Large Scale Sports Analytics Workshop on "Shot Recommender System for NBA Coaches," which he co-authored with Ilknur Kaynar Kabul and Jorge Silva. Jorge didn't attend the conference, but Ilknur did, and her favorite button was "I'm Having a Cold Start Today." However, Ilknur was not having a cold start when she presented her extended demo: "Auto-Tuning Your Decision Tree, Random Forest and Neural Networks Models." Another member of Susan's team, Patrick Hall, spent a lot of time in the SAS booth, where he was great at answering all kinds of questions. However, he couldn't decide on his favorite button, because it was a tie between "I’m Feeling Unstructured Today” and “I Support Vector Machines.” Patrick answered a lot of questions on options for integrating open source software with SAS, and this was the topic of his extended demo: "Options for Open Source Integration in SAS® Enterprise Miner™." Also on Susan's team, Taiping He, liked: "I'm Feeling Unstructured Today," which may be a surprise, because his extended demo was: "Distributed Support Vector Machines in SAS® Viya™ System." Guess who develops our SVM procedure in SAS Enterprise Miner?

Scott with a smile on his face, as usual

KDD has a nice balance of practitioners and academics in attendance, so we were glad to interact with both groups. We met many students and professors in the booth, and Scott MacConnell was on hand from our Academic Outreach and Collaborations group to talk about all the great free resources SAS has to offer academics. Scott's favorite button was "I Am Feeling Unstructured Today."

Mural from the wall of the Stinking Rose restaurant

We made time for fun, too, and one night many of us ate dinner together at a restaurant called The Stinking Rose, which calls itself "A garlic restaurant." They had fun murals on the wall showing garlic in all kinds of ways you never even dreamed of! I had the Forty Clove Garlic Chicken, and even though I didn't eat anywhere near that number they provided, I do hope my choice didn't depress traffic in the booth. The food was delicious! And my favorite button? "My Networks Run Deep."

The Internet of medical things and of intern things

The internet of medical things, spurred by the advent of wearable sensors, has dramatic consequences in industry, healthcare, and analytics, just as the advent of the internet of things and analytics has consequences in education. When I began my internship at SAS in May, I knew little about the internet of things, wearable sensors that make up the internet of medical things, and analytics, but I knew I wanted to use data for good, and I knew how to program.

This past summer I used data from cell phones attached at the waist to predict the activity of the owner, which is an exciting application of the internet of medical things. There are a number of immediate applications of this research: contextualizing electrocardiogram signals, improving exercise analysis, and assisting in the care for elderly. As an intern, my first assignment was simply to replicate the results from an existing activity recognition paper using SAS/IML® to extract features from a time series and SAS® Enterprise Miner ™ to produce an accurate model. As I mentioned earlier, I started my summer knowing how to program in a few languages, including SAS, but I didn’t know what a time series of data was, or how to program in IML, and I knew absolutely nothing about how to use a neural network model.

My first obstacle for my summer project on the internet of medical things was fleshing out how I learn best. With respect to SAS Enterprise Miner, at first I spent time going through the documentation to get a feel for the different nodes and their respective settings and options. This was helpful to a point, but what I discovered was that I learned best when I tested different options and examined results. I found this to be true in other parts of my research; when I spent time to plot the time series data, using different graph types, styles, and filters, I was able to understand my data at a deeper level. When extracting features from a time series, it is important to extract intuitive and meaningful features that capture a characteristic of the time series that would be evident if you looked at it in its entirety. This is almost impossible without spending some time examining the data. I think this is a common trend in our new age of data science and analytics; it’s not about what you think the data should say, but about what the data are actually saying.

Throughout this project I observed some characteristics of the internet of medical things, but I also learned what I call the Internet of Intern Things.

2. Collaboration is not only helpful, but imperative.
If I were to summarize my summer experience at SAS into one word, it would be “collaboration.” Collaboration was crucial to my summer project, and to navigating such a large organization as an intern. After giving my first presentation of my preliminary work on my summer project, several other interns contacted me and shared their projects, and we found overlaps. While I was working on modeling human activity using feature engineering with a goal of classifying healthy or unhealthy heartbeats, others were working on motif discovery and motif comparison.

These projects logically overlap in our ultimate motivation: classification of health signals. My project was focused on extracting information from a time series, while others were reviewing the actual pattern of a time series in a pictorial sense. After realizing this overlap, we began to compare notes and share helpful resources for visualization. In my final intern presentation, I actually used a visualization application shared by my fellow intern. Our collaboration not only benefited our summer projects, but it was also in the spirit of the atmosphere of modern tech companies who are concerned with team work and shared work effort. Moreover, it points to the central theme of the internet of things: everything is connected in some way, and thus should be used in tandem for the most efficient and accurate results.

My intern experience this summer has impacted my research focus, education plans, and career path. Another amazing opportunity that grew out of my summer experience is presenting a student e-poster at the 2016 SAS Analytics Experience conference in Las Vegas. Besides being able to present my research, I am also very excited about this opportunity because I will be able to hear a talk given by Jake Porway, founder and executive director of DataKind, an organization committed to using “data for good”, along with many other interesting talks, sessions, and demos.

Having an experience at SAS (my own personal Internet of Intern Things) in the middle of my college career was perfect timing. I realized that knowing mathematics, statistics, and computer science are very important, but recognizing the overlaps and interconnectedness of these disciplines is crucial, just as in the internet of things, and as I have found, in the internet of medical things.

Time series machine learning techniques in healthcare

Time series machine learning techniques show great promise for the analysis of health care wearable data. As our busy lifestyles render continuous monitoring more and more essential, the need to analyze data to find correlations between these data streams becomes even more important, because they can provide important cues to people. These cues could be as simple as reminding a person to take a walk or move around, which is already being done by a lot of wearables available today, such as Fitbit, Garmin, Nike, etc. However, along with monitoring the current state of an individual, these popular devices are not able to perform the complex predictions that correlate the captured information to make sense at a higher level or provide causal relationships between the data. My research aims to develop advanced algorithms for analyzing time series data for estimation and prediction of physiological parameters (such as heart rate or respiration rate using kinematic and physiological data). My current work is applying time series machine learning techniques for greater insight.

I am currently a graduate student intern in machine learning at SAS and also a research assistant at the center for Advanced Self-Powered Systems of Integrated Sensors and Technologies (ASSIST) at North Carolina State University. The ASSIST Center is a National Science Foundation-sponsored Nanosystems Engineering Research Center (NERC), which means it develops and employs nanotechnology-enabled energy harvesting and storage, ultra-low power electronics, and sensors to create innovative, body-powered, and wearable health monitoring systems. SAS is one of the industry partners for the ASSIST Center, and the insights on real time data analysis from SAS have proven to be really helpful for our research. Our motivation behind this research can be explained through this simple example: suppose an individual has a pre-existing condition like asthma, where the surroundings and their activities could trigger an attack. In such cases predicting respiration rate in advance could be beneficial. For example, if s/he is biking and continues to bike for another 20 mins, his/her predicted respiration rate could help him/her decide if s/he should bike for another 20 mins or reduce the time to stay within healthy levels. The goal is to be able to notify people about these parameters by identifying the right activities, which then become an index to predict the physiological parameters. In my research, I address the problem of identifying activities by creating hierarchical models to learn robust parameters, which is one application of time series machine learning techniques. In the near future we will be able to use these models to then predict respiration rate and heart rate.

There have been numerous studies that make use of supervised learning for activity recognition, using motion capture data and inertial measurements obtained from inertial measurement units (IMU). An IMU is a device that measures and reports linear and angular motion from the body, and one widely-available example is a smartphone. Most of these studies make use of techniques such as feature extraction, clustering and machine learning approaches for classification. Feature extraction techniques range from using statistical moments of the data (e.g., mean, variance, kurtosis) to bag-of-words representations of poses and their temporal differences. Machine learning methods used include support vector machines (SVM), neural networks, and probabilistic graphical models (e.g., hidden Markov models and conditional random fields). There are also approaches using semi-supervised techniques, and even unsupervised techniques that rely on clustering user-defined similarity metrics to identify single activities. However, most of these approaches only work at a fixed scale. That is, they do not capture hierarchies in the activities, which are required to explain complex dependencies between activities. For example, a person’s arm swinging can be part of a simple activity, such as walking, or a complex activity, such as dancing. A two-level hierarchy has been captured through the computation of the so-called motifs that compose activities. Higher level hierarchies may also be essential but have not been carefully studied. The aim of this research is to capture these dependencies using a computationally efficient framework that will provide a robust characterization of the existing hierarchical structures.

Topological tools for high-dimensional data analysis have gained popularity in recent years. These techniques often focus on tracking the homology of a space, which is a group structure that carries information about its connectivity and number of holes. Techniques such as persistent homology have been used for the analysis of point cloud data, quantifying the stability of the features extracted in a computationally efficient way via the use of stability theorems. These techniques have been used in a variety of applications, including the study of shapes in protein, image analysis and speech pattern analysis. For this research project we use topological data analysis to find robust parameters and build hierarchical graphical representations to classify activities.

Our approach builds a hierarchical representation of the data streams by comparing segments of data over various window sizes. A graphical model is extracted by first clustering the segments over a fixed window size τ and then connecting clusters with sufficient overlap across τ values. The structure of the hierarchical graphical model depends on a clustering parameter ε. We propose a new methodology for selecting robust graphical structures from this data via the use of an aggregate version of the persistence diagram. We also provide a methodology for selecting parameter values for this representation based on inference performance and power consumption considerations.

From our approach we are able to report the prediction accuracy for each of the activities in our dataset (walking, bicycling, sitting, golfing and waving). We also show how persistence diagrams can help reduce computation time and help choose stable models for our hierarchical representations. Some of the future work will involve testing this method on other datasets and comparing it with other existing algorithms.

I personally am really excited about the advantages the wearable technologies provide us! They are changing the lifestyle of individuals at a personalized level. Coming from a biomedical background, I always wanted to work closely with the wearable devices and understand how they could benefit us to achieve a better and healthy living. Being able to apply time series machine learning techniques from my current studies in electrical engineering to health care wearables leverages my biomedical experience in exciting new ways!

I’ll be presenting this work as an e-poster during the SAS Analytics Experience Conference in Vegas September 12-14, 2016, so look for me if you’re there to learn more!

Editor’s note: Namita was one of six winners of the e-poster competition offered at the Conference, which meant she won a free trip to the event, so be sure to check out her work! This past summer Namita was also a SAS Summer Fellow in Machine Learning, which is a highly selective competitive program SAS offers for PhD students each year.

Is Poker a Skill Game? A Panel Data Analysis

The annual SAS Analytics Conference is upon us again. This year it is known by a different name, Analytics Experience 2016, but the location, Las Vegas, is the same as it has been the previous two years. Just like last year, I will be attending and presenting on analytics for panel data using SAS/ETS® for econometrics and time series.

While preparing for my trip I was reminded of a paper I once read in Chance magazine (Croson, Fishman and Pope 2008) that concluded that poker, like golf, is a game of skill rather than luck.  The paper was published in 2008 during the heyday of televised poker, when it seemed that ESPN aired poker tournaments and little else.  The paper especially struck me because it quoted one of my favorite movies:

"Why do you think the same five guys make it to the final table of the World Series of Poker every year? What are they, the luckiest guys in Las Vegas?" – Mike McDermott (played by Matt Damon in Rounders)

Upon rereading the paper I realized the datasets the authors gathered followed a design for panel data.

Panel data occur when a set of individuals, or panel, are each measured on several occasions. Panel data are ubiquitous in all fields, because they allow each individual to act as their own control group. That allows you to focus on identifying causal relationships between response and regressor, knowing that you can control for all factors specific to the individual, both measured and unmeasured.

In regards to Croson et al. (2008), the individuals were poker players whose results were recorded over multiple poker tournaments. The authors gathered two panel datasets, one for poker players and one for professional golfers. They surmised that if the associations you see for poker mimic those for golf, then you should conclude that poker, like golf, is a game of skill.  After all, one would never theorize that Tiger Woods has won 14 major championships based purely on good karma.

Focusing on the data for poker, the authors gathered tournament results on 899 poker players. Because poker tournaments vary in the number of entries, only results in the top 18 were considered, and that number was chosen because it corresponds to the final two tables of 9 players each. The response was the final rank (1 through 18, lower being better) and the regression variables were three measures of previous performance. One such measure was experience, a variable indicating whether the player had a previous top 18 finish.

Among other similar analyses, the authors fit a least-squares regression of rank on experience:

$Rank_{ij} = \beta_{0} + \beta_{1} Experience_{ij} + \epsilon_{ij}$

where i represents the player and j the player’s ordered top-18 finish.  From the analysis they found a statistically significant negative association between current rank and previous success. Because lower ranks are better, they concluded that good previous performance was associated with good present performance. Furthermore, the magnitude of the association was analogous to the parallel analysis they performed for golf. They concluded that because you can predict current results based on previous performance – in the same way you can with golf – then poker must be a skill game.

The authors used simple least squares regression, with the only adjustment for the panel design being that they calculated "cluster robust’’ standard errors that controlled for intra-player correlation.  They did not consider directly whether there were any player effects in the regression.

After obtaining the data, I used PROC PANEL in SAS/ETS to explore this issue.  I considered three different estimation strategies applied to the previous regression. PROC PANEL compactly summarized the results as follows:

The OLS Regression column precisely reproduces the analysis of Croson et al. (2008) and shows a significant negative association between current rank and previous experience.  The Within Effects column is from a fixed-effects estimation that utilizes only within-player comparisons. You can interpret that coefficient (0.39) as the effect of experience for a given player. Conversely, the Between Effects column is from a regression using only player-level means, that is, the estimator uses only between-player comparisons. Because the estimator of the within effect for experience is not significant and that for the between effect is strongly significant, you can conclude the data exhibit substantial latent player effects. That is not surprising, because measures of player ability (technical, psychological or mystical) weren’t included in the model.

The augmented analysis does nothing to invalidate the Croson el al. (2008) conclusion that poker involves more skill than luck. However, to believe that premise you must begin with the untested (yet reasonable) assumption that luck is something that, even if it plays a factor in one tournament, cannot be maintained over a career. You must rely on common sense and not the data at hand to rule out luck as a latent (and mystical) player ability. With that question settled, the data go on to indicate that luck is not even a factor for single tournaments, each of which can be thought of as a long-run realization of hundreds of poker hands.

The PROC PANEL output merely furthers the point that some poker players (like their golfing counterparts) are just better at their craft than others.

Then again, maybe they really are the luckiest guys in Vegas.

If you are curious to know more about panel data, what’s available in SAS and how it may be applied, you can catch my theater presentation (that’s just a fancy way to say `talk’’), "Modeling Panel Data: Choosing the Correct Strategy," at the SAS Analytics Experience conference September 12-14 in Vegas. I'll be speaking on Wednesday, September 14, 1:15 PM - 2:00 PM. You will not catch me at the poker tables, however. My poker game stinks.

References:

Croson, R., P. Fishman and D. G. Pope. 2008.  Poker Superstars: Skill or Luck? Similarities between golf --- thought to be a game of skill --- and poker.  Chance 21(4): 25-28.

SAS Institute, The PANEL Procedure, SAS/ETS(R) 14.1 documentation

Spatial econometric modeling using PROC SPATIALREG

In our previous post, Econometric and statistical methods for spatial data analysis, we discussed the importance of spatial data. For most people, understanding that importance is relatively easy because spatial data are often found in our daily lives and we are all accustomed to analyzing them. We can all relate to the first law of geography—“Everything is related to everything else, but near things are more related than distant things”—and we can agree that our interaction with close things around us plays an important role in our decision process. Applications of spatial data in our daily lives are often seamless, and you could argue that we are all spatial statisticians and econometricians without even realizing it. Although most human beings have an innate ability to incorporate spatial information, computer-based analytics need to be given tools to include such information in their analyses. SAS/ETS 14.2 introduces one such tool, the SPATIALREG procedure, which enables you to include spatial information in the analysis and improve the econometric inference and statistical properties of estimators.

In this post, we discuss how you can use the SPATIALREG procedure to analyze 2013 home value data in North Carolina at the county level. The five variables in the data set are county (county name), homeValue (median value of owner-occupied housing units), income (median household income in 2013 in inflation-adjusted dollars), bachelor (percentage of people with bachelor’s degree or higher who live in the county), and crime (rate of Crime Index offenses per 100, 000 people). The data for home values, income, and bachelor’s degree percentages in each county were obtained from the website of the United States Census Bureau and computed using the 2009–2013 American Community Survey five-year estimates. Data for crime were retrieved from the website of North Carolina Department of Public Safety. For the purpose of numerical stability and interpretation, all five variables are log-transformed during the process of data cleansing. We use this data set to demonstrate the modeling capabilities of the SPATIALREG procedure and to understand the impact of household income, crime rate, and education attainment on home values.

As a preliminary data analysis, we first show a map of North Carolina that depicts the county-level home values in Figure 1. It is easy to see that the home values tend to be clustered together. Higher values are found in the coastal, urban, and mountain areas of North Carolina and lower home values can be found in rural areas. Home values of neighboring counties more closely resemble each other than home values of counties that are far apart.

Figure 1: Median value of owner-occupied housing units

From a modeling perspective, findings from Figure 1 suggest that the data might contain a spatial dependence, which needs to be accounted for in the analysis.  In particular, an endogenous interaction effect might exist in the data—home values tend to be spatially correlated with each other. PROC SPATIALREG enables you to analyze the data by using a variety of spatial econometric models.

Table 1: parameter estimates for a linear regression model

To lay the groundwork for discussion, you can start the analysis with a linear regression. For this model, the value of Akaike’s information criterion (AIC) is –106.12. The results of parameter estimation from a linear regression model, shown in Table 1, suggest that three predictors—income, crime, and bachelor—are all significant at the 0.01 level. Moreover, crime exerts a negative impact on home values, indicating that high crime rates reduce home values. On the other hand, both income and bachelor have positive impacts on home values.

Figure 2 provides the plot of predicted homeValue from the linear regression model. Although the comparison of Figure 1 and Figure 2 might suggest that predicted homeValue from the linear regression model captures the general pattern in the observed data, you need to be careful about some underlying assumptions for linear regression. Among those assumptions, a critical one is that the values of the dependent variable are independent of each other, which is not likely for the data at hand. As a matter of fact, both Moran’s I test and Geary’s C test suggest that there is a spatial autocorrelation in homeValue at the 0.01 significance level. Consequently, if you ignore the spatial dependence in the data by fitting a linear regression model to the data, you run the risk of false inference.

Figure 2: predicted median value of owner-occupied housing units
using a linear regression model

Because of the spatial dependence in homeValue, a good candidate model to consider might be a spatial autoregressive (SAR) model for its ability to accommodate the endogenous interaction effect.  You can use PROC SPATIALREG to fit a SAR model to the data. Before you proceed with model fitting, you need provide a spatial weights matrix. Generally speaking, a spatial weights matrix summarizes the spatial neighborhood structure; entries in the matrix represent how much influence one unit exerts over another.

Table 2: parameter estimates for a SAR model

The spatial weights matrix specification is of vital importance in spatial econometric modeling. Despite many different ways of specifying such a matrix, results can be sensitive to the choice of a spatial weights matrix.  Without delving into the nitty-gritty of such choice, you can simply define two counties to be neighbors of each other if they share a common border. After creating the spatial weights matrix, you can feed it into PROC SPATIALREG and run a SAR model. Table 2 presents the results of parameter estimation from a SAR model.

For this model, the value AIC is –110.79. The regression coefficients that correspond to income, crime, and bachelor are all significantly different from 0 at the 0.01 level of significance. Both income and bachelor exhibit a significantly positive short-run direct impact on home values. In contrast, crime rate shows a significantly negative short-run direct impact on home values. In addition, the spatial autoregressive coefficient ρ is significantly different from zero at 0.01 level, suggesting that there is a significantly positive spatial dependence in home values.

Figure 3 shows the predicted values for homeValue from the SAR model. Comparing Figures 1 and 3 suggest that the fitted home values capture the trend in the data reasonably well.

Figure 3: predicted median value of owner-occupied housing units using a SAR model

In this post, we introduced the SPATIALREG procedure, fit a SAR model, and compared predicted values from the SAR model to those from linear regression. Even though the SAR model presented an improvement over the linear model in terms of AIC, many other models are available in the SPATIALREG procedure that might provide even more desirable results and more accurate predictions. These models include the spatial Durbin model (SDM), spatial error model (SEM), spatial Durbin error model (SDEM), spatial autoregressive confused (SAC) model, spatial autoregressive moving average (SARMA) model, spatial moving average (SMA) model, and so on. In the next post, we will discuss their features and show you how to select the most suitable model for the home value data set. We will also be giving a talk, "Location, Location, Location! SAS/ETS® Software for Spatial Econometric Modeling," at the SAS Analytics Experience conference September 12-14, 2016 in Las Vegas, so stop by and let's talk spatial!

This post was co-written with Jan Chvosta.

The benefits of artificial intelligence

Photo courtesy of U.S. Luggage , Briggs & Riley

Asking about the benefits of artificial intelligence and machine learning reminds me a little of the transition to suitcases with wheels. Do you remember lugging around those old suitcases? If not, good for you - this original advertisement from US Luggage will take you back! Thank Bernard Sadow for persistence with his idea to add wheels, because when he pitched his idea people thought he was crazy. Surely no one would want to pull their own suitcase? His patent application stated, “Whereas formerly, luggage would be handled by porters and be loaded or unloaded at points convenient to the street, the large terminals of today, particularly air terminals, have increased the difficulty of baggage-handling….Baggage-handling has become perhaps the biggest single difficulty encountered by an air passenger.”

We can wheel our own suitcases these days, but baggage handling is still a challenge for airlines. One of the benefits of artificial intelligence and machine learning is improvements companies like Amadeus are applying to baggage handling in airports to reduce the risk of lost bags. And to improve the overall customer experience moving through the Frankfurt Airport Fraport uses predictive modeling from SAS, part of the extensive set of machine learning capabilities from SAS.

I hear plenty of verbal and online chatter predicting that artificial intelligence and machine learning will eliminate jobs. But a review of history shows that many such past predictions have not come true. Remember the introduction of ATMs? The expectation was that bank tellers would become an anachronism, but in fact demand for tellers has increased greater than average. Automation reduced the number of tellers needed per bank, but this savings allowed banks to open new branches, thus stimulating demand for tellers.

The same pattern repeated with the introduction of grocery store scanners and cashiers and electronic document discovery and paralegals. Today your friendly bellhop still greets you at the hotel as you roll your suitcase to the entrance because in fact the US Bureau of Labor Statistics predicts average growth in demand for baggage porters and bellhops. I believe that the benefits of artificial intelligence and machine learning include increased productivity that will lead to job creation. Plenty of enthusiastic electronic ink has been spilled about the benefits of artificial intelligence and machine learning for business, so I’m going to focus on another reason why I’m excited about this field – the public benefit in areas like our health, economic development, the environment, child welfare, and public services.

Machine learning and artificial intelligence help use data for good

In a blog post on LinkedIn, Microsoft CEO Satya Nadella envisions a future where computers and humans work together to address some of society’s biggest challenges. Instead of believing computers will displace humans, he argues that at Microsoft “we want to build intelligence that augments human abilities and experiences.” He understands the trepidation some have about jobs and even the supposed Singularity (the idea that machines will run amok and take over), writing “…we also have to build trust directly into our technology,” to address privacy, transparency and security. He cites an example of the social benefits of machine learning and artificial intelligence in the form of a young Microsoft engineer who lost his sight at an early age but who works with his colleagues to build what is essentially a mini-computer work like glasses to give him information in an audible form he can consume.

Nadella's example of his young colleague is one of many where machine learning and artificial intelligence are making fantastic advances in providing great help for people with disabilities in the form of various health care wearables and prosthetics. Health care is replete with examples, as deep learning and other techniques show rapid gains on humans for diagnosis. For example, the deep learning startup, Enlitic, makes software that in trials is 50% more accurate than humans in classifying malignant tumors, with no false-negatives (i.e. saying that scans show no cancer when in fact there is malignancy) when tested against three expert human radiologists (who produced false-negatives 7% of the time). In the field of population health management AICure makes a mobile phone app that increases medication adherence among high-risk populations using facial recognition and motion detection. Their technology makes sure that the right person is taking the right medication at the right time.

There are nonprofits that have been drawn to the benefits of artificial intelligence and machine learning, such as DataKind, which “harnesses the power of data science in the service of humanity.” In a project with the nonprofit GiveDirectly, DataKind volunteers worked on an algorithm to classify satellite images to identify the poorest households in rural villages in Kenya and Uganda. A team from SAS is working with DataKind and the Boston Public Schools to improve transportation for their students, using optimization. Thorn: Digital Defenders of Children, uses technology and innovation to fight child sexual exploitation. Much of the trafficking is done online, so analysis of chatter, images, and other data can aid in identifying children and the predators.

Trafficking in elephant ivory leads to an estimated 96 elephant deaths every day, but a machine learning app is helping wildlife patrols predict the best routes to track poachers. The app drew on 14 years of poaching data activity, produces routes that are randomized so poachers can be foiled, and learns from new data entered. So far its routes have outperformed those by previous ranger patrols. Protection Assistant for Wildlife Security (PAWS) was developed by Milind Tambe, a professor from the University of Southern California, based on security game theory. Tambe has also built these kinds of algorithms for federal agencies like Homeland Security, the Transportation Security Administration, and the Coast Guard to optimize the placement of staff and surveillance to combat smuggling and terrorism.

Machine learning and artificial intelligence in the public sector

Other public sector organizations also realize the benefits of artificial intelligence and machine learning. The New York Police Department has developed the Domain Awareness System, which uses sensors, databases, devices, and more, along with operations research and machine learning, to put updated information in the hands of cops on the beat and at the precincts. Delivering this information even faster than the dispatchers means cops are better prepared when they arrive on the scene. Teams from the University of Michigan’s Flint and Ann Arbor campuses are working together with the City of Flint to use machine learning and predictive algorithms to predict where lead levels are highest and build an app to help both residents and city officials with resources to better identify issues and prioritize responses. It took a lot of work to gather all the disparate information together, but interestingly their initial findings indicate that the troubles are not in the lines themselves but in individual homes, although the distribution of the problems doesn’t cluster like you’d expect.

These are just a few of the many examples of the social benefits of artificial intelligence and machine learning, but they illustrate why I’m excited about their potential to improve our society. Automation fueled by artificial intelligence is likely to result in what economists call "structural unemployment," when there is a mismatch between the skills some workers have and those the economy demands, typically a result of technological change. This disruption is undoubtedly devastating for those who lose their jobs, and I believe as a society we have an obligation to provide workforce development programs and training to help those impacted shift to new skills. But I am hopeful that machine learning will be able to offer help to those disrupted by these changes.

And it may even offer job opportunities. SAS is working with our local Wake Technical Community College, which has launched the nation's first Associate's Degree in Business Analytics, fueled in part by a grant from the US Trade Adjustment Assistance Community College and Career Training initiative. They will also offer a certificate program aimed at displaced or underemployed workers will be targeted and required to earn 12 credit hours to gain a certificate of training. While these graduates will not likely start off doing machine learning, they may move in that direction, and at a minimum contribute to teams that do use these methods.

And LinkedIn uses machine learning extensively, for recommendations, image analysis, and more, but through their Economic Graph and LinkedIn for Good initiatives the company aims to connect talent to opportunities by filling in gaps in skills. In partnership with the Markle Foundation their new LinkedIn Cities program offers training for middle skill workers, those with a high school diploma and some college but no degree, and is piloting in Phoenix and Denver. The combination of online and offline tools with connections to educators and employers will help these individuals improve their opportunities.

SAS will highlight the data for good movement at our upcoming Analytics Experience conference in Las Vegas September 12-14. Jake Porway, the Founder and Executive Director of DataKind, will be one of the keynote speakers. My colleague Jinxin Yi will be giving a super demo on the SAS/DataKind project I mentioned that aims to improve transportation for the Boston Public Schools. His session is one of several that have been tagged in the program as Data for Good sessions. We’ll have a booth where you can learn more and get engaged with #data4good. Stop by and say hi to me if you're there!

Suitcase image credit: Photo courtesy of U.S. Luggage, Briggs & Riley
Bank teller image credit: photo by AMISOM Public Information // attribution by creative commons
Xray image credit: photo by Yale Rosen // attribution by creative commons
Elephants image credit: photo by Michele Ursino // attribution by creative commons
NYPD image credit: photo by Justin Norton // attribution by creative commons
Bus image credit: photo by ThoseGuys119 // attribution by creative commons

Machine learning applications for NBA coaches and players

Machine learning applications for NBA coaches and players might seem like an odd choice for me to write about. Let us get something out of the way: I don’t know much about basketball. Or baseball. Or even soccer, much to the chagrin of my friends back home in Europe. However, one of the perks of working in data science and machine learning is that I can still say somewhat insightful things about sports, as long as I have data. In other words, instant expertise! So with that expertise I’ll weigh in to offer some machine learning applications for basketball.

During a conversation with my good colleague Ray Wright, who does know quite a bit about basketball and had been looking at historical data from NBA games, we suddenly realized something about player shooting. There are dozens of shot types, ranges and zones… and no player ever tries them all. What if we could automatically suggest new shot combinations appropriate for each individual player? Who knew there could be machine learning applications for the NBA?

Such a system that suggests actions or items to users is called a recommender system. Large companies in retail and media regularly use recommender systems to suggest movies, songs and other items to users based on their behavior history, as well as that of other similar users, so you’ve liked used such a system from Amazon, Netflix, etc.In basketball terms, the users are the players, and the items are shot types. As with the other domains mentioned above, available data does not even come close to covering all possible combinations, in this case for players and shots. When the available data matches this scenario it is called sparse. And fortunately, SAS has a new offering, SAS® Viya™ Data Mining and Machine Learning, that includes a new method specifically designed for sparse predictive modeling: PROC FACTMAC, for Factorization Machines.

Let me quickly introduce you to factorization machines. Originally proposed by Steffen (Rendle, 2010), they are a generalization of matrix factorization that allows multiple features with high cardinality (lots of unique values) and sparse observations. The parameters of this flexible model can be estimated quickly, even in the presence of massive amounts of data, through stochastic gradient descent, which is is the same type of optimization solver behind the recent successes of deep learning.

Factorization Machines return bias parameters and latent factors, which in this case can be used to characterize players and shot combinations. You can think of a player’s bias as the overall propensity to successfully score, whereas the latent factors are more fine-grained characteristics that can be related to play style, demographics and other information.

Armed with this thinking, our trusty machine learning software from SAS, and some data science tricks up our sleeves, we decided to try our hand at machine learning applications in the form of automated basketball coaching (sideline yelling optional!). Before going into our findings, let’s take a look at the data. We have information about shots taken during the 2015-2016 NBA basketball season through March 2016. A total of 174,190 shots were recorded during this period. Information recorded for each shot includes the player, shot range, zone, and style (“action type”), and whether the shot was successful. After preprocessing we retained 359 players, 4 ranges, 5 zones, and 36 action types.

And here is what we found after fitting a factorization machine model. First, let’s examine some established wisdom- does height matter much for shot success? As the box and whisker plot below shows, the answer is yes, somewhat, but not quite as much as one would think. The figure depicts the distribution of bias values for players, grouped by their height. There is a bump for the 81-82 inch group, but it is not overwhelming. And it decays slightly for the 82-87 inch group.

Now look at the following figure, which shows made shots (red) vs missed (blue), by location in the court and by action type. There is definitely a very significant dependency! Now if only someone explained to me again what a “driving layup” is…

Let us investigate the biases again, now by action type. The following figure shows the bias values in a horizontal bar plot. It is clear that all actions involving “dunk” lead to larger bars, corresponding to greater probability of success.

What about other actions? What should we recommend to the typical player? That is what the following two tables show.

Most recommended shots

Least recommended shots

Based on the predicted log-odds, the typical player should strive for dunk shots and avoid highly acrobatic and complicated actions, or highly contested ones such as jump shots. Now, of course not all players are “”typical.” The following figure shows a 2D embedding of the fitted factors for players (red) and actions (blue). There is significant affinity between Manu Giobili and driving floating layups. Players Ricky Rubio and Derrick Rose exhibit similar characteristics based on their shot profiles, as do Russell Westbrook and Kobe Bryant, and others. Also, dunk shot action types form a grouping of their own!

Overall, our 25-factor factorization machine model is successful in predicting log-odds of shot success with high accuracy: RMSE=0.929, which outperforms other models such as SVMs. Recommendations can be tailored to specific players, and many different insights can be extracted. So if any NBA coaches or players want to call about our applications of machine learning for basketball we are available for consultation!

We are delighted that this analysis has been accepted for presentation at the 2016 KDD Large-Scale Sports Analytics workshop this Sunday, August 14, where Ray will be representing our work with this paper: "Shot Recommender System for NBA Coaches." And my other colleague (and basketball fan), Brett Wujek, will be giving a demo theater presentation on “Improving NBA Shot Selection: A Matrix Factorization Approach” at the SAS Analytics Experience Conference September 12-14, 2016 in Las Vegas.

Surely, many basketball experts will be able to give us good tips to augment our applications of machine learning for the NBA. One thing is certain, though when in doubt, always dunk!

Reference
Rendle, S. (2010). Factorization Machines. Proceedings of the 10th IEEE International Conference on Data Mining (ICDM).