Of Big Data Competitions, Sports Analytics, and the Internet of Bugs

It is said that everything is big in Texas, and that includes big data. During my recent trip to Austin I had the privilege of being a judge in the final round of the Texata Big Data World Championship, a fantastic example of big data competitions.

It felt fitting that I arrived on the day of a much-anticipated University of Texas Longhorns game and witnessed the city awash with college students proudly wearing burnt-orange shirts. Their enthusiasm notwithstanding, my personal sport of choice is not really football but rather big data competitions! And I saw plenty of competitiveness in this particular venue.

Texata is quite new, this being only its second year, but it has already generated significant buzz among big data competitions. Competitors come from all over the world and tend to be seasoned professionals or graduate students from prestigious university programs. By the time they reach the finals, they have already undergone two intense rounds of question answering, data exploration and coding. Unlike other competitions where teams have months to play with a dataset (often preprocessed and curated), and get ranked based on very specific quantitative criteria, in the Texata finals each individual participant is given a real dataset on the spot and only four hours to work with it and extract some sort of meaningful value. This year’s dataset was a collection of customer support tickets and chat logs from Cisco, in whose facility the finals took place. This closely resembles the real world of a data scientist... messy unstructured data, open problem definitions, and a running clock.

Not having a leaderboard, as some other big data competitions do, means that the judges must evaluate the candidates and pick the winner. This was a tough choice, given twelve very talented people, many having traveled from other continents, who put in a large effort and gave their very best. All of us on the judging panel took the responsibility very seriously. At the same time, it was sheer fun to see how each candidate took their own approach, reaching into a large toolbox: latent semantic analysis, clustering, multi-dimensional scaling, and graph algorithms, for example. Some contestants focused on categorizing, others on visualizing, and yet others on inferring causal relations. Every single solution yielded some unique and valuable insight into the data.

At the end of the day, the winner was Kristin Nguyen from Vietnam. Her analysis had the best balance between technical soundness of the code, variety of techniques and presentation clarity. Plus, in 2014 she had already placed second, so this was no fluke. Well deserved, Kristin!

As an added treat, on the following day I got to speak at the companion Texata Summit event. That gave me the chance to show off some exciting examples of using SAS in sports analytics, such as season ticket customer retention in football (both the American and European versions of it!). Also baseball – remember the 2011 movie Moneyball? Scouting continues to be a major application of analytics, allowing small teams to punch well above their weight. Many other sports use analytics, from basketball to Olympic rowing.

Perhaps most exciting of all, there are novel frontier areas identified in the comprehensive report “Analytics in Sports: The New Science of Winning” by Thomas Davenport. For instance, image and video data can be used for crowd management in a stadium, or to track players in the field. In other cases, athlete performance monitoring is of interest. This allowed me to slightly lift the veil on new R&D work related to images, video and wearable sensors:

slide1 slide2


Thanks to SAS’s ongoing collaboration with Prof. Edgar Lobaton and Prof. Alper Bozkurt of North Carolina State University, involving multiple groups within the Advanced Analytics division of R&D at SAS, I am now aware that golf is actually a rather stressful sport ! By looking at EKG activity, it is apparent that the heart rate goes up to enormous levels in the moments before a swing. Also, while wearables and the Internet of Things are hot topics right now, we should all keep an eye on Profs. Lobaton and Bozkurt’s other work - I like to call it the Internet of Bugs:


As featured in the New Scientist and Popular Science, these cyborg-augmented hissing cockroaches can be instrumental in search and rescue operations. Responders can steer them by applying electrical impulses to their antennas and locate potential survivors in rubble via directional microphones and positioning sensors. SAS has very strong tools that are uniquely suited for processing and analyzing this type of streaming data – for example, SAS® Event Stream Processing can acquire real-time sensor signals, while SAS® Forecast Server and SAS® Enterprise Miner™ can perform signal filtering, detect cycles and spikes, and analyze the aggregate position coordinates of the insects, for example to map a structure and find locations of interest.

After the presentation, #Texata and @SASSoftware Twitter traffic contained multiple variations of the words “weird” and “awesome” - which is a fitting description of data science itself. Truly you never know where your data will come from!


Post a Comment

Decision tree learning: What economists should know

As an economist, I started at SAS with a disadvantage when it comes to predictive modeling. After all, like most economists, I was taught how to estimate marginal effects of various programs, or treatment effects, with non-experimental data. We use a variety of identification assumptions and quasi-experiments to make causal interpretations, but we rarely focus on the quality of predictions. That is, we care about the “right-hand side” (RHS) of the equation. For those trained in economics in the past 20 years, predictive modeling was regarded as “data mining,” a dishonorable practice.

Since beginning my journey at SAS I have been exposed to many practical applications of predictive modeling that I believe would be valuable for economists. In this and a series of future blogs, I will write about how I think about the “left hand side” (LHS) of the equation with respect to the tools of predictive modeling. Up first: decision tree learning.


Decision tree learning using the HPSPLIT procedure in SAS/STAT

Decision trees are not unfamiliar to economists. If fact, almost all economists have used trees as they learned about game theory. Game theory courses use trees to illustrate sequential games, that is, where one agent moves first. Once such as example is Stackelberg competition in which firms sequentially compete in quantities with the first mover earning greater profits. We use trees to understand sequence of decisions. The use of trees in data analysis has many similarities but some important differences. First, what are decision trees?

Decision tree learning is very simple data categorization tool that can also happens to have great predictive power. How do they work? My colleague Barry de Ville and SAS author provides a nice introduction to basics of decision tree algorithms. If you think about what these algorithms actually do, they are trying to separate data into homogenous clusters, where each cluster has highly similar explanatory covariates and similar outcomes. You can find a great guide to all many of those algorithms and more in my colleague Padraic Neville’s primer on decision trees.

So why do trees work for prediction? They subset data into observations that are highly similar on a number of dimensions. The algorithms choose to use certain explanatory factors (X’s, covariates, features), as well as interactions of those factors, to create homogeneous groups. At that point, the prediction equation is derived by some pretty complicated math……a simple average!

That’s right, once the data set is broken down into subsets (a process known as ‘splitting’ and ‘pruning’) the fancy prediction math is nothing but a simple average. And the prediction equation. Equally simple. It is a series of if-then statements following the path of the tree eventually leading to that calculated sample average. So what does the decision tree help me to do as an economist?  Here are my top 3 things to love about a decision tree:

  1. Like regression, decision tree output can be interpreted. There are no coefficients but the results read like if-then-else business rules.
  2. Decision trees inform about predictive power of variables WITH concerns for redundancy. Variables will be split on if they matter for creating homogeneous groups and discarded otherwise. One caveat, however, is that only one of two highly collinear variables might be chosen.
  3. They inform about interaction effects for later regression analysis. A split tells us that an interaction effect matters in prediction. This could be useful for controlling for various forms of unobserved heterogeneity or for turning continuous variables into categorical variables.

So that is my list. What else should economists know about decision trees? Do you feel strongly that trees are a better exploratory data analysis tool than predictive tool?

Next time we will augment this discussion of decision trees by talking about, What Economists should know about … random forests.

The future of analytics – top 5 analytics predictions for 2016

Bean crystal ball small

My view of the world is shaped by where I stand, but from this spot the future of analytics for 2016 looks pretty exciting! Analytics has never been more needed or interesting.

  1. Machine learning established in the enterprise

Machine learning dates back to at least 1950 but until recently has been the domain of elites and subject to “winters” of inattention. I predict that it is here to stay, because large enterprises are embracing it. In addition to researchers and digital natives, these days established companies are asking how to move machine learning into production. Even in regulated industries, where low interpretability of models has historically choked their usage, practitioners are finding creative ways to use machine learning techniques to select variables for models, which can then be formulated using more commonly accepted techniques. Expect greater interest across academic disciplines, because machine learning benefits from many different approaches. Consider the popular keynote from the INFORMS Annual Meeting last year, where Dimitris Bertsimas talked about “Statistics and Machine Learning via a Modern Optimization Lens.” My colleague Patrick Hall offers his own perspective about "Why Machine Learning? Why Now?"

  1. Internet of Things hype hits reality

The Internet of Things (IoT) is at the peak of the Gartner Hype Cycle, but in 2016 I expect this hype to hit reality. One real barrier is plumbing – there’s a lot of it! One of my colleagues is analyzing the HVAC system on our newest building as an IoT test project. The building is replete with sensors, but getting to the data was not easy. Facilities told him data are the domain of IT, who then sent him to the manufacturer, because while the HVAC system collects the data, it is sent to the manufacturer. “Data ownership” is an emerging issue – you produce the data but may not have access to it. An even larger challenge for IoT will be to prove its value. There are limited implementations of IoT in full production at the enterprise level. The promise of IoT is fantastic, so in 2016 look to early adopters to work out the kinks and deliver results.

  1. Big data moves beyond hype to enrich modeling

Big data has moved beyond hype to provide real value. Modelers today can access a wider then ever range of data types (e.g., unstructured data, geospatial data, images, voice), which offer great opportunities to enrich models. Another new gain from big data is due to competitions, which have moved beyond gamification to provide real value via crowdsourcing and data sharing. Consider the Prostate Cancer DREAM Challenge, where teams were challenged to address open clinical research questions using anonymized data provided by four different clinical trials run by multiple providers, much of it publicly available for the first time. An unprecedented number of teams competed, and winners beat existing models developed by the top researchers in the field.

  1. Cybersecurity improved via analytics

And as IoT grows, the growing use of sensors must thrill cybercriminals, who use these devices to hack in using a slow but insidious Trojan Horse approach. Many traditional fraud detection techniques do not apply, because detection is no longer seeking one rare event but requires understanding an accumulation of events in context. Similar to IoT, one challenge of cybersecurity involves data, because streaming data is managed and analyzed differently. I expect advanced analytics to shed new light on detection and prevention as our methods catch up with the data. Unfortunately, growing methods for big data collaboration are off limits, because we don’t want the bad guys to know how we’ll find them, and much of the best work is done behind high security clearance. But that won't stop SAS and others from focusing heavily on cybersecurity in 2016.

  1. Analytics drives increased industry-academic interaction

The Institute for Advanced Analytics (IAA) at NC State University tracks the growth in analytics masters programs, and new programs seem to pop up daily. Industry demand for recruits fuels this growth, but I see increased interest in research. More companies are setting up academic outreach with an explicit interest in research collaborations. Sometimes this interest goes beyond partnership and into direct hiring of academic superstars, who either take sabbaticals, work on the side, or even go back and forth. For example, top machine learning researcher Yann LeCun worked at Bell Labs, became a professor at NYU, was the founding director of the NYU Center for Data Science, and now leads Artificial Intelligence Research at Facebook. INFORMS supports this academic-industry collaboration by providing academics a resource of teaching materials related to analytics. In 2016 INFORMS will offer industry a searchable database of analytics programs to facilitate connections and the new Associate Certified Analytics Professional credential to help vet recent graduates.

I wrote a shorter version of this piece for INFORMS, and it first appeared in Information Management earlier this week.

Image credit: photo by Chris Pelliccione // attribution by creative commons

Vector autoregressive models for generating economic scenarios

varmax blog

Diagnostic information from PROC VARMAX

Macroeconometrics is not dead: (and I wish I had paid better attention in my time series course):

I wrote this on the way to see one of our manufacturing clients in Austin, Texas, anticipating a discussion how to use vector autoregressive models in process control. It is a typical use case, especially in the age of the “Internet of Things,” to use multiple sensors on a device to detect mechanical issues early. Industrial engineers often use rolling window principal component analysis or a dynamic factor model to detect trends. These are fairly common applications. In my customer engagements, these approaches had been the primary applications of multivariate time series methods. As an applied microeconomist, I believed that my encounters confirmed my decision during graduate school to shirk time series econometric courses and instead focus on cross-sectional and panel data econometrics. Until I visited a banking customer in Dallas this fall....

See, for the past several years, the banking industry has been busy hiring quants for a regulatory overhaul called CCAR/DFAST. Quantitative analysts are being employed to estimate various forms of loss models to meet CCAR/DFAST regulations, which require various economic scenarios to be factors in these loss models. Some of these models resemble cross-sectional models and some of these autoregressive time series models, but both of these models are univariate in nature. My colleague Christian Macaro and I wrote a paper, "Incorporating External Economic Scenarios into Your CCAR Stress Testing Routines," that provides an overview of both methods.  Although some of these models “could” use multivariate vector autoregressive models, as my buddy Rajesh Selukar, developer of state space models at SAS (PROC SSM) says, ”If you don’t have to use multivariate time series methods, don’t use them,” conveying the complexity of modeling these interactions. Up to this point, my shirking had paid off. No PROC VARMAX needed.

Then came my trip to Dallas. At this bank I met a new type of CCAR/DFAST team, whose mission is to create macroeconomic forecasts of the US and international economies to be consumed by CCAR/DFAST modeling teams. Until now, most CCAR/DFAST economic scenarios have been generated by either the Federal Reserve or by one of several small consulting firms specializing in this analysis. These groups have long used complicated multivariate techniques such as dynamic stochastic general equilibrium (DSGE) or vector autoregressive models (VAR), but they tend to use smaller niche software tools in their consulting work. This new internal economic scenario team was charged with bringing economic scenario generation in house. I had heard that the Fed has become increasingly critical of relying on purchased forecasts. Tier 1 banks are now being required to generate these forecasts on their own, and the customer I met with is one of these banks.

Well darn. I guess I should have paid more attention to Dr. Ali’s ECO703 course! One of these days I will get back to the basics of multivariate time series analytics. After all, PROC SSM and VARMAX are two procedures in SAS/ETS® that work with multivariate time series models. In fact, when I get around to it, I will likely utilize a new book by Anders Milhøj, Multiple Time Series Modeling Using the SAS VARMAX Procedure, which provides both a theoretical and practical introduction to VAR models in SAS. If you already know a bit about multivariate time series and just wanted to get started with SAS, try the example programs with the VARMAX and SSM procedures and SAS Studio using free access to SAS OnDemand for Academics, whether you are a student, professor, or independent learner.

Pattern recognition software - is it for the birds?

hermit thrush

Can pattern recognition software tell us if it is a Hermit Thrush or a Swainson's Thrush we've seen? A few of us have been debating an identification question at work, because we agreed to help Fulbright Scholar and Duke University PhD student Natalia Ocampo-Peñuela with research she is doing related to bird collisions with windows. A sad little band of us at SAS spent three weeks this fall doing daily perambulations of multiple buildings on the SAS campus to look around the perimeter for dead birds, casualties of run-ins with our shiny pretty glass buildings. We recorded the species if possible (sometimes predators left us scanty evidence), hence the need for identification. I can tell you that a Hermit Thrush and Swainson's Thrush look very similar. As an avid birdwatcher myself, I have several field guide apps on my iPhone, but it got me wondering what algorithmic magic was behind the search tools most of these apps now have. You input features like state, the month, size, color, etc. and the app returns a filtered list of possibilities likely to be seen. But a new app, Merlin Bird Photo ID, developed in collaboration with the Cornell Lab of Ornithology and others, takes this flow a step further using machine learning techniques from computer vision to help identify birds. You upload an image of the bird you've seen, and Merlin compares features in the photo to those expected to be seen on that day in your location, based on a data set supplied by birders who report their sightings to a site called eBird (it's decently large data - 9.5 million observations were reported in the month of May alone!).

A quick search on pattern recognition software identified many papers on machine learning for bird identification. Improved Automatic Bird Identification through Decision Tree based Feature Selection and Bagging uses audio recordings instead of images for identification. Two researchers at Queen Mary University London argue that Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning, and they are even raising money in Kickstarter to build Warblr, a birdsong recognition app. It will use machine learning to help you figure out which bird just serenaded you (or a prospective mate, really, but you can consider it a gift anyhow). In their larger study they trained and tested a random forest classifer, which is more than ironic given that certainly many of the birdsongs were recorded in forests! Of course, birdsong doesn't work in the case of our task identifying a limp little bird, but many birds are more commonly heard than seen, so this approach offers great advantages.

Technical challenges include noise (you can't exactly get birds into a sound studio) and scalability, given the computational intensity. But some kinds of pattern identification pose even greater challenges. What if you just had a footprint to use? The bird survey at SAS was initiated by a connection with the great folks at Wildtrack, who use JMP Software from SAS to analyze data for their Footprint Identification Technique, a non-invasive method  used to track elusive endangered animals. Wildtrack's Zoe Jewell and Sky  Allibhai have partnered with researchers from NC State University to improve upon footprint identification, and some of their work includes a Manifold learning approach to curve identification with applications to footprint segmentation. It's a tough nut but they keep working to crack it.

My own colleagues in SAS Advanced Analytics R&D are doing interesting work on pattern recognition.  Patrick Hall, Ilknur Kaynar Kabul, and Jorge Silva used PROC NEURAL in SAS Enterprise Miner to extract representative features from a training set for digit recognition, a specific challenge for pattern recognition software to tackle. They built a stacked denoising autoencoder from the Mixed National Institute of Standards and Technologies (MNIST) digits data, which they describe in this paper on Machine Learning in SAS Enterprise Miner. The code is in Patrick's GitHub repo. Now if I can just get them interested in bird recognition maybe we'll be able to settle the debate about Hermit vs. Swainson's Thrush......

Additional resources

Image credit: photo by Kelly Colgan Azar // attribution by creative commons

6 machine learning resources for getting started

If you turned in for my recent webinar, Machine Learning: Principles and Practice, you may have heard me talking about some of my favorite machine learning resources, including recent white papers and some classic studies.

As I mentioned in the webinar, machine learning is not new. SAS has been pursuing machine learning in practice since the early 1980s. Over the decades, professionals at SAS have created many machine learning technologies and have learned how to apply machine learning to create value for organizations. This webinar series is one of many resources that can help you understand what machine learning is and how to use it. I'll also be at the MLConf in San Francisco on November 13, so stop by our booth to say if you're there and I'd be glad to show you any of these resources in person.

SAS Resources:

Machine Learning with SAS Enterprise Miner

See how a team of SAS Enterprise Miner developers used machine learning techniques to predict customer churn in a famous telecom dataset.

An Overview of Machine Learning with SAS Enterprise Miner

This technical white paper includes SAS code examples for supervised learning from sparse data, determining the number of clusters in a dataset, and deep learning.

Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners

Written for corporate leaders, and technology and marketing executives, this book shows how organizations can harness the power of high performance computing architectures and data mining, text analytics, and machine learning algorithms.

External Resources:

Statistical Modeling: The Two Cultures

The grandfather of machine learning, Leo Breiman, outlines the fundamental ideas and philosophies of the discipline and discusses two different approaches to modeling and data analysis.

7 common mistakes of machine learning

Whether you’re a seasoned pro or a noob, machine learning is tricky. Save yourself some time by avoiding these common mistakes.

11 clever Methods of overfitting and how to avoid them

Probably the most common mistake in machine learning, and one of the hardest to avoid, is overfitting your training data. This article highlights some common (and not so common) practices that can lead to overfitting.


Principal Component Analysis for Dimensionality Reduction

When you work with big data, you often deal with both a large number of observations and a large number of features. When the number of features is large, they can be highly correlated, resulting in significant amount of redundancy in the data. Principal component analysis can be a very effective method in your toolbox in a situation like this.

Consider a facial recognition example, in which you train algorithms on images of faces. If training is on 16x16 grayscale images, you will have 256 features, where each feature corresponds to the intensity of each pixel. Because the values of adjacent pixels in an image are highly correlated, most of the row features are redundant. This redundancy is undesirable, because it can significantly reduce the efficiency of most machine learning algorithms. Feature extraction methods such as principal component analysis (PCA) and autoencoder networks enable you to approximate the row image by using a much lower-dimensional space, often with very little error.

PCA is an algorithm that transforms a set of possibly-correlated variables into a set of uncorrelated linear combinations of those variables; these combinations are called principal components. PCA finds these new features in such a way that most of the variance of the data is retained in the generated low-dimensional representation. Even though PCA is one of the simplest feature extraction methods (compared to other methods such as kernel PCA, autoencoder networks, independent component analysis, and latent Dirichlet allocation), it can be very efficient in reducing dimensionality of correlated high-dimensional data.

For the facial recognition problem described above, suppose you reduce the dimension to 18 principal components while retaining 99% of the variation in the data. Each principal component corresponds to an “eigenface” as shown in Figure 1 which is a highly representative mixture of all the faces in the training data. Using the 18 representative faces generated by principal components, you can represent each image in your training set by an 18-dimensional vector of weights (\(w_{1}\),...,\(w_{18}\)) that tells you how to combine the 18 eigenfaces, instead of using the original 256-dimensional vector of raw pixel intensities.

Figure 1: Eigenfaces method for facial recognition

Figure 1: Eigenfaces method for facial recognition

Now suppose you have a new image, and you wonder if this image belongs to a person in your training set. You simply need to calculate the Euclidian distance between this new image’s weight vector (\(w_{1}\),...,\(w_{18}\)) and the weight vectors of the images in your training set. If the smallest Euclidian distance is less than some predetermined threshold value, voilà – facial recognition! Tag this new image as the corresponding face in your training data; otherwise tag it as an unrecognized face. If you want to learn more about face recognition, see this famous paper, Face Recognition Using Eigenfaces, for your enjoyment.

Labeling and suggesting tags in images are common uses of reduced dimensional data. Similarly this approach could be used for analyzing audio data for speech recognition or in text mining for web search or spam detection. Perhaps a more common application of dimension reduction is in predictive modeling.   You can feed your reduced-dimensional data into a supervised learning algorithm, such as a regression, to generate predictions more efficiently (and sometimes even more accurately).

Another major use of dimension reduction is to visualize your high-dimensional data, which you might not be able to otherwise visualize. It’s easy to see one, two, or three dimensions. But how would you make a four-dimensional graph? What about a 1000-dimensional graph? Visualization is a great way to understand your data, and it can also help you check the results of your analysis. Consider the chart in Figure 2. A higher-dimensional data set, which describes hospitals in the United States, was clustered and projected onto two dimensions. You can see that the clusters are grouped nicely, for the most part. If you are familiar with the US health-care system, you can also see that the outliers in the data make sense, because they are some of the best-regarded hospitals in the US! (Of course, just because an analysis makes sense to you does not guarantee that it is mathematically correct. However, some agreement between human and machine is usually a good thing.)

Figure 2: Sixteen Clusters of Hospitals Projected onto Two Dimensions Using a Dimension Reduction Technique

Figure 2: Sixteen Clusters of Hospitals Projected onto Two Dimensions Using a Dimension Reduction Technique

If these examples have caught your interest and you know want more information about PCA, tune into my webcast, Principal Component Analysis for Machine Learning, where I discuss PCA in greater detail, including the math behind it, and how to implement it using SAS®. If you are a SAS® Enterprise MinerTM user, you can even try the hospital example for yourself with the code we’ve placed in one of our GitHub repos.

Data science training - is it possible?

Ok, so the title is a little provocative, but some people are dubious that data science training is even possible, because they believe data science entails skills one can learn only on the job and not in a classroom. I am not in that camp, although I do believe that data science is something you learn by doing, with an emphasis on both the learning and the doing. So how and where can you learn to do data science, if you want to become data scientist lady? data scientist lady

There is no agreed-upon definition of data science, but I like to think of it as three legs of a stool - strong quantitative foundation, excellent programming skills, and keen understanding of business and communication. A quantitative foundation is most often learned in university, but as I've written previously the disciplines studied can range from statistics to archaeology, and it can pay to go recruiting outside the traditional academic disciplines. A solid academic background is an invaluable start, although I hear business complain that many graduates are not prepared for real life problems, where data can be messy and/or sparse and you may not have the luxury of an elegant solution. More and more academic programs incorporate case studies, practicums, internships, etc. Earlier this year Tom Davenport hosted a Tweetchat on the top ten ways businesses can influence analytics education, which is a good read if you want to influence the graduate pipeline. There are even emerging PhD programs in Data Science.

Programming skills are a second leg of the stool. While there are many classes in universities that incorporate the use of software, more and more people seek to learn on their own, whether they be current students or working professionals. MOOCs have become increasingly popular, with many turning first to a source like Coursera to find the content they want. SAS jumped into this game with the launch last year of SAS® University Edition. This new offering was designed to address the demand we hear for those who want to learn SAS, as well as those who want to hire students with SAS skills, This offering has proven very popular — as of today it has been downloaded over 407,000 times, from Afghanistan to Zimbabwe. While it is called University Edition, it is available for anyone seeking to learn for non-commercial purposes. The SAS Analytics U Community offers a ton of free resources to help your learning, including tutorials, e-learning, a discussion board, etc. It's a powerful offering, with no limitations on data you use and accessible as a downloadable package of selected SAS products that runs on Windows, Linux and Mac.

The third leg of the stool is business acumen and the ability to communicate well. These are the skills that are hardest to pick up in a university program and may be best learned on the job. One shortcut could be the SAS Academy for Data Science, which is  an intensive data science certification program that combines hands-on learning and case studies in a collaborative environment. In addition to covering key topics like machine learning, time series forecasting, and optimization, students will learn important programming skills in a blended approach with SAS, Hadoop, and open source technologies. There's even a module on Communicating Technical Findings with a Non-Technical Audience. The Academy covers all the content necessary to sit for the new Big Data Certification and Data Science Certification that SAS is offering.

If these topics are of interest to you and you'll be attending the SAS Analytics 2015 Conference in Las Vegas October 26-27 you're in luck! On Monday Dr. Jennifer Priestley of Kennesaw State University is giving a talk on Is It Time for a PhD in Data Science? On Tuesday afternoon Cat Truxillo will be talking about our new certifications for data science in a table talk called World-Class Data Science Certification From the Experts. And my colleague Sharad Prabhu and I will be leading a table talk on Tuesday afternoon on SAS® University Edition – Connecting SAS® Software in New Ways to Develop More SAS® Users. If you're there come join us!

 image credit: photo by nraden // attribution by creative commons

Pitching analytics: recommendations on how to sell your story (part 2)

throwing strikes

My last post, Pitching analytics: recommendations on how to sell your story, discussed the steps I consider when winding up for an analytics pitch. In part 2 of this series I share the tips and tricks I have acquired for throwing strikes for during your analytics pitch. Like everyone, sometimes I throw more balls than strikes, so the post concludes with some of the potential pitfalls.

How to Ace the Delivery

1. Craft a story about why the problem is important: I tell Ph.D. and masters students to create two pitches for the same research, one to industry and one to academics. My academic research used secondary market college football tickets to answer some pricing questions. The two pitches were:

To Academics: A monopolist pricing a differentiated good to a heterogeneous population should be able to charge marginal valuation on quality sufficiently high to induce optimal sorting. I use secondary ticket market data to test this hypothesis.

Industry: If the Mets need to figure out how to price their lower level and upper level seats to their playoff games (Go Mets!), let’s go scrape some data from Stubhub.com to see how the market pays for better seats.

2. Interpret for the purpose: You ran your model for a reason. Make sure you explain it with that in mind. If you run a regression to estimate a marginal effect, be sure to print your results in that form and talk about that part of your model.

3. Conclusions and Counterfactuals: A strong conclusion will set the tone for implementation and next steps. The most important component of the conclusion and the sale will be the counterfactual. That is, in the absence of intervention, what would have happened? E.g. The statistical forecast had X% less excess inventory than the judgmental forecast.


How to walk the batter: Surefire ways to lose your audience

Here are some potential pitfalls, from my own experience.

1. Talk about the brand of software you used: SAS, Stata, R, MATLAB, Octave, ILOG CPLEX, SPSS, ArcView, eViews, etc can all do most statistical or optimization routines. Some might be faster or simpler to use for a particular purpose, but this won’t sway your audience. Instead, talk about the methods and the solutions.

2. Talk about the complexity or elegance of the solution: Only academics care about the difficulty of the solution or how “elegant” it is. In fact, as long as the objective is met, the simpler the solution the better.

3. Only talk about work you, and you alone, did in entirety: Present the work with a sense of ownership. You must present the work from a position of authority. Be sure to highlight that multiple people worked on the project but present your findings on behalf of your team. Your audience needs to know that YOU own the results. This skill may take time to develop. It is NOT dishonest to take ownership of a project that was a team effort. Most statistical results I present were created by someone on my team. My contribution tends to be in problem definition, data acquisition or interpretation. This is a skill that the best analytics leaders have mastered. Excel at pitching others work. (Graduate school training tends to discourage the development of this skill.)

4. Talk about how smart you are: This arrogance is something that many people lose in graduate school. For others, this mental invincibility persists. Instead, when presenting work, be confident but humble. Those letters behind your name mean you know what you are talking about. Your audience doesn’t need to be reminded.

5. Use “jargon” specific to only your discipline: This is the most important warning I offer. Instead, if you are talking about an optimization solution, do not say minimax but rather, “here we attempt to minimize the maximum loss” or similar. A smart but lay audience understands the latter.

Well, that was my list of ways to throw more strikes than balls. What did you like? What tips did I miss? Please feel free to include your favorite tips in the comments sections below. I will be in Las Vegas next week for the Analytics 2015 conference, where I'll be leading a table talk on selling analytics to management. If you are attending the conference, please stop by and let me know if this helped and what I might want to add to my list.

Image credit: photo by Tom Thai // attribution by creative commons

How analytical web services can help scale your machine learning

Have you been in your attic lately? Or maybe cleaned out that closet that all of your “stuff” seems to gravitate to? Sure, mostly you’ll just find old junk that is no longer useful or purely nostalgic, but every once in a while you come across those long lost treasures and think “why haven’t I been using this?” No – I’m not talking about that Def Leppard Union Jack sleeveless shirt or the strobe light (though I’ll admit, the latter has some serious entertainment value). Think baseball glove, Trivial Pursuit game, electronic keyboard, desk lamp…value is subjective, but whatever it is for you, you have to admit it does happen once in a while. Which is what happened to me recently when I learned about analytical web services in the form of SAS BI Web Services.

The capability and the ability to deploy advanced analytics as analytical web services that can be accessed from a wide variety of clients or entry points is a treasure in the attic worth bringing downstairs. This capability has been offered by SAS for over a decade, and while some people have certainly taken advantage of it over the years, it doesn’t seem to be as widely used as I would expect in today’s analytics environments, where flexibility, customization, and remote access are all desired, if not expected. Deploying analytics, and in particular predictive models, as analytical web services is all the rage these days, and for good reason. There’s no need to forsake scalable analytic solutions that employ advanced machine learning capabilities merely to remain working in your programming comfort zone. SAS’ web services framework is a bridge from common programming languages to SAS® High Performance Analytics running in a distributed computing environment such as Hadoop.

I recently worked with a small team of folks here at SAS to re-assess our analytical web services in light of these greater expectations (not to mention competitor pitches claiming it to be a differentiator). We discovered first-hand how straightforward it is to create custom applications that invoke SAS’ advanced analytics, running in a high performance, distributed computing environment (backed by Hadoop). In some sense, it’s the ideal API, exposing any potential input variables and data streams and providing any potential desired output parameters and data streams – leaving all of the plumbing to the SAS BI Web Services infrastructure. Java, HTML5, Python, etc. - the client interface you choose for your application is immaterial as long as it can compose and send an http request to the REST endpoint URI for your web service. Developing custom applications tailored to your specific business problem and backed by a distributed analytics infrastructure is fairly straightforward.

SAS Advanced Analytics running on a Hadoop cluster, invoked from an iPython notebook

SAS Advanced Analytics running on a Hadoop cluster, invoked from an iPython notebook

I’ll be talking about these analytical web services and our work to scale them for machine learning at the SAS Analytics 2015 conference in Las Vegas. If you happen to be attending, stop in to my session on “Machine Learning at Scale Using SAS Tools and Web Services” to hear more about it – you might just learn something cool and useful that you never even knew you had. You might find a treasure that can revive or inspire your creativity in your approach to analytics. And I might (or might not) be wearing the Def Leppard shirt.....