6 machine learning resources for getting started

If you turned in for my recent webinar, Machine Learning: Principles and Practice, you may have heard me talking about some of my favorite machine learning resources, including recent white papers and some classic studies.

As I mentioned in the webinar, machine learning is not new. SAS has been pursuing machine learning in practice since the early 1980s. Over the decades, professionals at SAS have created many machine learning technologies and have learned how to apply machine learning to create value for organizations. This webinar series is one of many resources that can help you understand what machine learning is and how to use it. I'll also be at the MLConf in San Francisco on November 13, so stop by our booth to say if you're there and I'd be glad to show you any of these resources in person.

SAS Resources:

Machine Learning with SAS Enterprise Miner

See how a team of SAS Enterprise Miner developers used machine learning techniques to predict customer churn in a famous telecom dataset.

An Overview of Machine Learning with SAS Enterprise Miner

This technical white paper includes SAS code examples for supervised learning from sparse data, determining the number of clusters in a dataset, and deep learning.

Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners

Written for corporate leaders, and technology and marketing executives, this book shows how organizations can harness the power of high performance computing architectures and data mining, text analytics, and machine learning algorithms.

External Resources:

Statistical Modeling: The Two Cultures

The grandfather of machine learning, Leo Breiman, outlines the fundamental ideas and philosophies of the discipline and discusses two different approaches to modeling and data analysis.

7 common mistakes of machine learning

Whether you’re a seasoned pro or a noob, machine learning is tricky. Save yourself some time by avoiding these common mistakes.

11 clever Methods of overfitting and how to avoid them

Probably the most common mistake in machine learning, and one of the hardest to avoid, is overfitting your training data. This article highlights some common (and not so common) practices that can lead to overfitting.


Principal Component Analysis for Dimensionality Reduction

When you work with big data, you often deal with both a large number of observations and a large number of features. When the number of features is large, they can be highly correlated, resulting in significant amount of redundancy in the data. Principal component analysis can be a very effective method in your toolbox in a situation like this.

Consider a facial recognition example, in which you train algorithms on images of faces. If training is on 16x16 grayscale images, you will have 256 features, where each feature corresponds to the intensity of each pixel. Because the values of adjacent pixels in an image are highly correlated, most of the row features are redundant. This redundancy is undesirable, because it can significantly reduce the efficiency of most machine learning algorithms. Feature extraction methods such as principal component analysis (PCA) and autoencoder networks enable you to approximate the row image by using a much lower-dimensional space, often with very little error.

PCA is an algorithm that transforms a set of possibly-correlated variables into a set of uncorrelated linear combinations of those variables; these combinations are called principal components. PCA finds these new features in such a way that most of the variance of the data is retained in the generated low-dimensional representation. Even though PCA is one of the simplest feature extraction methods (compared to other methods such as kernel PCA, autoencoder networks, independent component analysis, and latent Dirichlet allocation), it can be very efficient in reducing dimensionality of correlated high-dimensional data.

For the facial recognition problem described above, suppose you reduce the dimension to 18 principal components while retaining 99% of the variation in the data. Each principal component corresponds to an “eigenface” as shown in Figure 1 which is a highly representative mixture of all the faces in the training data. Using the 18 representative faces generated by principal components, you can represent each image in your training set by an 18-dimensional vector of weights (\(w_{1}\),...,\(w_{18}\)) that tells you how to combine the 18 eigenfaces, instead of using the original 256-dimensional vector of raw pixel intensities.

Figure 1: Eigenfaces method for facial recognition

Figure 1: Eigenfaces method for facial recognition

Now suppose you have a new image, and you wonder if this image belongs to a person in your training set. You simply need to calculate the Euclidian distance between this new image’s weight vector (\(w_{1}\),...,\(w_{18}\)) and the weight vectors of the images in your training set. If the smallest Euclidian distance is less than some predetermined threshold value, voilà – facial recognition! Tag this new image as the corresponding face in your training data; otherwise tag it as an unrecognized face. If you want to learn more about face recognition, see this famous paper, Face Recognition Using Eigenfaces, for your enjoyment.

Labeling and suggesting tags in images are common uses of reduced dimensional data. Similarly this approach could be used for analyzing audio data for speech recognition or in text mining for web search or spam detection. Perhaps a more common application of dimension reduction is in predictive modeling.   You can feed your reduced-dimensional data into a supervised learning algorithm, such as a regression, to generate predictions more efficiently (and sometimes even more accurately).

Another major use of dimension reduction is to visualize your high-dimensional data, which you might not be able to otherwise visualize. It’s easy to see one, two, or three dimensions. But how would you make a four-dimensional graph? What about a 1000-dimensional graph? Visualization is a great way to understand your data, and it can also help you check the results of your analysis. Consider the chart in Figure 2. A higher-dimensional data set, which describes hospitals in the United States, was clustered and projected onto two dimensions. You can see that the clusters are grouped nicely, for the most part. If you are familiar with the US health-care system, you can also see that the outliers in the data make sense, because they are some of the best-regarded hospitals in the US! (Of course, just because an analysis makes sense to you does not guarantee that it is mathematically correct. However, some agreement between human and machine is usually a good thing.)

Figure 2: Sixteen Clusters of Hospitals Projected onto Two Dimensions Using a Dimension Reduction Technique

Figure 2: Sixteen Clusters of Hospitals Projected onto Two Dimensions Using a Dimension Reduction Technique

If these examples have caught your interest and you know want more information about PCA, tune into my webcast, Principal Component Analysis for Machine Learning, where I discuss PCA in greater detail, including the math behind it, and how to implement it using SAS®. If you are a SAS® Enterprise MinerTM user, you can even try the hospital example for yourself with the code we’ve placed in one of our GitHub repos.

Data science training - is it possible?

Ok, so the title is a little provocative, but some people are dubious that data science training is even possible, because they believe data science entails skills one can learn only on the job and not in a classroom. I am not in that camp, although I do believe that data science is something you learn by doing, with an emphasis on both the learning and the doing. So how and where can you learn to do data science, if you want to become data scientist lady? data scientist lady

There is no agreed-upon definition of data science, but I like to think of it as three legs of a stool - strong quantitative foundation, excellent programming skills, and keen understanding of business and communication. A quantitative foundation is most often learned in university, but as I've written previously the disciplines studied can range from statistics to archaeology, and it can pay to go recruiting outside the traditional academic disciplines. A solid academic background is an invaluable start, although I hear business complain that many graduates are not prepared for real life problems, where data can be messy and/or sparse and you may not have the luxury of an elegant solution. More and more academic programs incorporate case studies, practicums, internships, etc. Earlier this year Tom Davenport hosted a Tweetchat on the top ten ways businesses can influence analytics education, which is a good read if you want to influence the graduate pipeline. There are even emerging PhD programs in Data Science.

Programming skills are a second leg of the stool. While there are many classes in universities that incorporate the use of software, more and more people seek to learn on their own, whether they be current students or working professionals. MOOCs have become increasingly popular, with many turning first to a source like Coursera to find the content they want. SAS jumped into this game with the launch last year of SAS® University Edition. This new offering was designed to address the demand we hear for those who want to learn SAS, as well as those who want to hire students with SAS skills, This offering has proven very popular — as of today it has been downloaded over 407,000 times, from Afghanistan to Zimbabwe. While it is called University Edition, it is available for anyone seeking to learn for non-commercial purposes. The SAS Analytics U Community offers a ton of free resources to help your learning, including tutorials, e-learning, a discussion board, etc. It's a powerful offering, with no limitations on data you use and accessible as a downloadable package of selected SAS products that runs on Windows, Linux and Mac.

The third leg of the stool is business acumen and the ability to communicate well. These are the skills that are hardest to pick up in a university program and may be best learned on the job. One shortcut could be the SAS Academy for Data Science, which is  an intensive data science certification program that combines hands-on learning and case studies in a collaborative environment. In addition to covering key topics like machine learning, time series forecasting, and optimization, students will learn important programming skills in a blended approach with SAS, Hadoop, and open source technologies. There's even a module on Communicating Technical Findings with a Non-Technical Audience. The Academy covers all the content necessary to sit for the new Big Data Certification and Data Science Certification that SAS is offering.

If these topics are of interest to you and you'll be attending the SAS Analytics 2015 Conference in Las Vegas October 26-27 you're in luck! On Monday Dr. Jennifer Priestley of Kennesaw State University is giving a talk on Is It Time for a PhD in Data Science? On Tuesday afternoon Cat Truxillo will be talking about our new certifications for data science in a table talk called World-Class Data Science Certification From the Experts. And my colleague Sharad Prabhu and I will be leading a table talk on Tuesday afternoon on SAS® University Edition – Connecting SAS® Software in New Ways to Develop More SAS® Users. If you're there come join us!

 image credit: photo by nraden // attribution by creative commons

Pitching analytics: recommendations on how to sell your story (part 2)

throwing strikes

My last post, Pitching analytics: recommendations on how to sell your story, discussed the steps I consider when winding up for an analytics pitch. In part 2 of this series I share the tips and tricks I have acquired for throwing strikes for during your analytics pitch. Like everyone, sometimes I throw more balls than strikes, so the post concludes with some of the potential pitfalls.

How to Ace the Delivery

1. Craft a story about why the problem is important: I tell Ph.D. and masters students to create two pitches for the same research, one to industry and one to academics. My academic research used secondary market college football tickets to answer some pricing questions. The two pitches were:

To Academics: A monopolist pricing a differentiated good to a heterogeneous population should be able to charge marginal valuation on quality sufficiently high to induce optimal sorting. I use secondary ticket market data to test this hypothesis.

Industry: If the Mets need to figure out how to price their lower level and upper level seats to their playoff games (Go Mets!), let’s go scrape some data from Stubhub.com to see how the market pays for better seats.

2. Interpret for the purpose: You ran your model for a reason. Make sure you explain it with that in mind. If you run a regression to estimate a marginal effect, be sure to print your results in that form and talk about that part of your model.

3. Conclusions and Counterfactuals: A strong conclusion will set the tone for implementation and next steps. The most important component of the conclusion and the sale will be the counterfactual. That is, in the absence of intervention, what would have happened? E.g. The statistical forecast had X% less excess inventory than the judgmental forecast.


How to walk the batter: Surefire ways to lose your audience

Here are some potential pitfalls, from my own experience.

1. Talk about the brand of software you used: SAS, Stata, R, MATLAB, Octave, ILOG CPLEX, SPSS, ArcView, eViews, etc can all do most statistical or optimization routines. Some might be faster or simpler to use for a particular purpose, but this won’t sway your audience. Instead, talk about the methods and the solutions.

2. Talk about the complexity or elegance of the solution: Only academics care about the difficulty of the solution or how “elegant” it is. In fact, as long as the objective is met, the simpler the solution the better.

3. Only talk about work you, and you alone, did in entirety: Present the work with a sense of ownership. You must present the work from a position of authority. Be sure to highlight that multiple people worked on the project but present your findings on behalf of your team. Your audience needs to know that YOU own the results. This skill may take time to develop. It is NOT dishonest to take ownership of a project that was a team effort. Most statistical results I present were created by someone on my team. My contribution tends to be in problem definition, data acquisition or interpretation. This is a skill that the best analytics leaders have mastered. Excel at pitching others work. (Graduate school training tends to discourage the development of this skill.)

4. Talk about how smart you are: This arrogance is something that many people lose in graduate school. For others, this mental invincibility persists. Instead, when presenting work, be confident but humble. Those letters behind your name mean you know what you are talking about. Your audience doesn’t need to be reminded.

5. Use “jargon” specific to only your discipline: This is the most important warning I offer. Instead, if you are talking about an optimization solution, do not say minimax but rather, “here we attempt to minimize the maximum loss” or similar. A smart but lay audience understands the latter.

Well, that was my list of ways to throw more strikes than balls. What did you like? What tips did I miss? Please feel free to include your favorite tips in the comments sections below. I will be in Las Vegas next week for the Analytics 2015 conference, where I'll be leading a table talk on selling analytics to management. If you are attending the conference, please stop by and let me know if this helped and what I might want to add to my list.

Image credit: photo by Tom Thai // attribution by creative commons

How analytical web services can help scale your machine learning

Have you been in your attic lately? Or maybe cleaned out that closet that all of your “stuff” seems to gravitate to? Sure, mostly you’ll just find old junk that is no longer useful or purely nostalgic, but every once in a while you come across those long lost treasures and think “why haven’t I been using this?” No – I’m not talking about that Def Leppard Union Jack sleeveless shirt or the strobe light (though I’ll admit, the latter has some serious entertainment value). Think baseball glove, Trivial Pursuit game, electronic keyboard, desk lamp…value is subjective, but whatever it is for you, you have to admit it does happen once in a while. Which is what happened to me recently when I learned about analytical web services in the form of SAS BI Web Services.

The capability and the ability to deploy advanced analytics as analytical web services that can be accessed from a wide variety of clients or entry points is a treasure in the attic worth bringing downstairs. This capability has been offered by SAS for over a decade, and while some people have certainly taken advantage of it over the years, it doesn’t seem to be as widely used as I would expect in today’s analytics environments, where flexibility, customization, and remote access are all desired, if not expected. Deploying analytics, and in particular predictive models, as analytical web services is all the rage these days, and for good reason. There’s no need to forsake scalable analytic solutions that employ advanced machine learning capabilities merely to remain working in your programming comfort zone. SAS’ web services framework is a bridge from common programming languages to SAS® High Performance Analytics running in a distributed computing environment such as Hadoop.

I recently worked with a small team of folks here at SAS to re-assess our analytical web services in light of these greater expectations (not to mention competitor pitches claiming it to be a differentiator). We discovered first-hand how straightforward it is to create custom applications that invoke SAS’ advanced analytics, running in a high performance, distributed computing environment (backed by Hadoop). In some sense, it’s the ideal API, exposing any potential input variables and data streams and providing any potential desired output parameters and data streams – leaving all of the plumbing to the SAS BI Web Services infrastructure. Java, HTML5, Python, etc. - the client interface you choose for your application is immaterial as long as it can compose and send an http request to the REST endpoint URI for your web service. Developing custom applications tailored to your specific business problem and backed by a distributed analytics infrastructure is fairly straightforward.

SAS Advanced Analytics running on a Hadoop cluster, invoked from an iPython notebook

SAS Advanced Analytics running on a Hadoop cluster, invoked from an iPython notebook

I’ll be talking about these analytical web services and our work to scale them for machine learning at the SAS Analytics 2015 conference in Las Vegas. If you happen to be attending, stop in to my session on “Machine Learning at Scale Using SAS Tools and Web Services” to hear more about it – you might just learn something cool and useful that you never even knew you had. You might find a treasure that can revive or inspire your creativity in your approach to analytics. And I might (or might not) be wearing the Def Leppard shirt.....

Multi-stage modeling delivers the ROI for internet of things

Gartner has stated that there are nearly five billion connected devices throughout the world today and predicts that there will be more than 25 billion by 2020, making the potential of this technology unlimited. The connected devices in industrial settings, in personal devices, and in our homes are creating a great potential for improving business outcomes and enhancing personal lives. Many companies have defined strategic plans for collecting and exploiting the data coming from these connected devices. We are at the onset of a revolution due to this new internet of things (IoT) reality. Exploiting the value in all these connected devices requires new tricks of the trade to deal with unique issues such as capturing the right amount of data in the right aggregations, building statistical models to capture stable system operations, and being able to score these models on streaming data to create alerts for degradations or upcoming failures. Multi-stage modeling is one way to ensure you get the return on investment out of the IoT. Building multi-stage models means your predictive analytical models run in the cloud and update with the data streaming in, impacting a wider set of decisions more rapidly, more frequently, and more automatically.

IoT is the new frontier in scale, speed, and predictive modeling for analytics. From the collection of the data, to streaming data analytics, to big data analytics in the cloud, IoT is destined to change the way we manage our information: from health and fitness to the maintenance of large capital industrial equipment in oil and gas, energy, transportation, and manufacturing. Many industrial use cases involve early detection of system degradation and then optimizing predictive maintenance to avoid costly down times.

To make the scale of the data off the stream concrete, consider the commercial airplanes of today. A typical airplane currently has ~ 6000 sensors and creates 2.5 terabytes of data per day. By 2020, this number may triple or quadruple to over 7.5 terabytes. Such planes generate an enormous amount of data, which is a good thing, since analytics based on these data will likely allow us to detect aircraft issues before a life threatening problem manifests and/or before major damage to expensive parts occur that would cause costly downtime. These are very sophisticated machines with high availability expectations that are costly to build, fly, repair, etc. Any intelligence that can be gained from multi-stage modeling to avoid problems proactively can generate huge ROI.

Event stream processing (ESP) enables real-time analytics from the data streaming in on the edge, meaning right at the point of capture, as opposed to data that has been accumulated and sent to the cloud. ESP therefore allows for analysis, filtering, and aggregation of sensor or machine data much closer to the occurrence of events and offers low latency.  The ability to filter and aggregate event data on the edge is critical, given that the incoming volume will likely be too large to go to feasibly process in the cloud.  Also, analyzing event data on the edge in sub-second timeframes enables opportunities otherwise lost due to latencies associated with analyzing it in the compute cloud. Multi-stage modeling is modeling on the edge, in the cloud, and/or wherever makes the most sense for the situation.

New analytical models for high-frequency machine (or sensor) data analysis will enhance the value created by ESP and enable that all-important ROI.  Analytics answers questions about which sensor(s) give us predictive power, which aggregations need to be kept for in-depth modeling, whether we should keep the original sensor data or a transformation of it (change, lag, etc.), and whether we have an early indicator of degradation to come. Given the huge volumes of data, intelligent dimension reduction is one of the preliminary steps adding analytical value.

Fournier transformation plot

Working with customer data from industries like oil and gas, manufacturing, automotive, energy, and health wearables has allowed my team to develop new methods to exploit the value coming from streaming data. For example, we have developed unique extensions to principal components, singular value decomposition, stability monitoring, optimal lag detection, and time frequency analysis for data that are timestamped every (sub)second-every minute, with hundreds of variables (sensors) and unique events and modes of failure. For example, the plot above shows short-time Fourier transform of voltage the day of the failure compared to the day before and a week after (click the image to enlarge). We focus our research on multi-stage modeling we can score with the streaming data. Although we can capture the models of a ‘stable’ system in the history, we need to be able to identify alerts or degradations on the data that is streaming in.

I’ll be talking more about multi-stage modeling at the SAS Analytics 2015 Conference in Rome in November. If you’ll be there I welcome you at my talk on The Power of Advanced Analytics Applied to Streaming Data.

Ensemble modeling for machine learning – it just makes sense

I’ve often heard people say about weather forecasters “they have the best job…they just report what the models are telling them, and when they’re wrong they can always blame it on Mother Nature throwing us a curve.” While that’s true, this glass-half-empty crowd is failing to appreciate how amazing the science and technology are to accurately predict the temperature, amount of precipitation, and path of a storm over an extended period of time, given all the atmospheric variation, complexity, and potential instability that exists. When I can know on Tuesday that there’s a good chance my trip to the beach will be a rain out on Saturday, I can start to make alternate plans. Thank you Mr. Weatherman and your super smart modeling algorithm. And thank you ensemble modeling.

Because it turns out we actually have the concept of ensemble modeling to thank for those generally accurate forecasts (as my local Raleigh weatherman Greg Fishel will profess as often as you will let him). It’s not a single model providing those predictions but a combination of the results of many models that serve to essentially average out the variance. The National Hurricane Center Storm Track Forecast Cone shown below is a perfect example; this uncertainty cone is composed by creating an ensemble of many different models forecasting the storm’s path, each one considering different atmospheric conditions in that time period.

ensemble weather charts

While weather forecasting is one of the more prominent applications of ensemble modeling, this technique is used extensively in many domains. In general, ensemble modeling is all about using many models in collaboration to combine their strengths and compensate for their weaknesses, as well as to make the resulting model generalize better for future data. And that last point is key. The goal of building these models is not to fit the existing data perfectly (running the risk of overfitting) but to identify and represent the overall trend of the data such that future events can be predicted accurately.

To that end, combining predictions from diverse models just makes sense - just like diversifying your investment portfolio and investing in mutual funds for a stronger overall yield. Given that market conditions are in constant flux (just as new data to be scored on a model deviates from training data), you know it’s smart to diversify with a combination of stocks from a variety of industries, company sizes, and types. Just as timing the market to buy and sell the right stocks at the right time is an overwhelming and time-consuming task, determining the single most effective machine learning algorithm (and its tuning parameters) to use for a given problem domain and data set is daunting and often futile — even for experts. Ensemble modeling can take some of that weight off your shoulders and gives you peace of mind that the predictions are the result of a collaborative effort among models trained either from (a) different algorithms that approach the problem from different perspectives or (b) the same algorithm applied to different samples and/or with different tuning parameter settings.

A lot of smart people have formulated algorithms specifically around this concept over many decades – from general bagging and boosting approaches (and the many variants derived from them) to more specific algorithms such as the popular and effective random forest algorithm developed by Leo Breiman and Adele Cutler. These ensemble modeling algorithms give machine learning practitioners powerful tools for generating more robust and accurate models. My colleagues at SAS wrote a great paper describing ensemble modeling in SAS Enterprise Miner. For more on the topic of ensemble modeling, tune in to the webcast, Ensemble Modeling for Machine Learning, where I cover some common applications of ensemble modeling and go into more details on various implementations of ensemble modeling. This webinar is one installment in a series on machine learning techniques – starting with Machine Learning: Principles and Practice, and including other topics such as Principal Component Analysis for Machine Learning and an upcoming one on clustering.

Using panel data to assess the impact of legalized gambling on crime

The SAS Analytics 2015 Conference is coming soon. It is my first time attending, so when I discovered that the conference is in Las Vegas, I must admit I became more than a little excited to partake in some casual gambling. My thing is sports betting, specifically college football, and you can be sure I’ll be wagering a few dollars on the various games the weekend preceding the conference. Note to readers: As of this writing, the N.C. State Wolfpack are 4-0 against the spread. But how does my interest in sports and betting relate to panel data?

gambling sign photoWhen I’m not busy analyzing point spreads and over/unders, I work on the econometrics team developing software for panel data analysis at SAS, mainly PROC PANEL in SAS/ETS. My colleague Ken Sanford already linked panel analysis to sports in a post last year on using panel data to assess the economic impact of the Super Bowl. So the upcoming conference reminded me of a paper I read about a panel data analysis that measured whether crime rates were affected by the presence of nearby casinos.

The paper is by Falls and Thompson (2014), who performed a panel-data analysis of Michigan’s 83 counties over the years 1994-2010. Following the passage of the Indian Gaming Regulatory Act of 1988, nineteen Native American (and three additional ``non-tribal’’) casinos were established in Michigan, so the goal of the paper was to analyze how this change affected crime rates. The authors fit random-effects (RE) models of the form

Read More »

Pitching analytics: recommendations on how to sell your story (part 1)

Ken little leagueI routinely speak with executives who tell me that the ability to “sell” analytical results is just as important as producing them. In this post I will share some of what I have learned in several years of presenting complicated analytical results to audiences, both technical and lay. Some of my tips address the material and some the audience. Why do they care about what you have to say? What is the opportunity cost of inaction? These are just some of the topics that any strong analytics pitch should address. Since we are well into the Major League Baseball season, and the division-leading New York Mets use SAS analytical tools improve their decision making on and off the field, I will frame our analytics pitch with inspiration from America’s pastime. Plus, pitching comes naturally to me, as you can see in the photos I dug out.

The Wind Up: Prepare these topics before the pitch.

1.The Project: How do I support my methodology?

Each analytics project involves some methodological choices, so it is important to articulate these when pitching analytics. Some of my points of emphasis are:

    • Prepare to talk about a model: Whether you are optimizing, estimating or predicting, thinking in terms of a model provides a basis for a well-conceived project. It forms the basis for all data cleansing, transformation and enrichment. There is no way to do analytics without some notion of a model. Writing it down will provide value, saving time and budget.
    • Simplify: Have you made all simplifications possible to solve your objective? Do you know why you chose to simplify? Your solution does not need elegance, unless it materially improves the outcome.

2. The Data: What data am I using and why those data? Some important features of your data might be:

    • Data Generating Process (DGP): What created the observations? Are your data actually telling you about consumer behavior or do your data actually represent poor data collection processes? For instance, is my transactional data system blending retail items that gradually change qualities? This might happen with new products. Will those data ever enlighten me about the product life cycle? Or does B2B data from a warehouse actually tell us anything about downstream consumer behavior? Having an understanding of how the data were created will aid in evaluating their suitability to answer certain questions.
    • Data source: Are these from a transactional database or a third-party? Have you enriched the data? Can you?
    • Perfect data: What would perfect data look like? Understanding the perfect data will help to build a mental model for the project. Concessions can be made later. While being enamored with the academically “best” answer can be time consuming and wasteful, having an idea of the perfect data set will help you prioritize what your efforts around data cleaning and acquisition.

3. The Audience: Who am I presenting to? Understanding the audience is the single best chance you have for adoption when pitching analytics. What can you do to improve your chances? Do a little light research (don’t be stalker!!). LinkedIn and quick internet searches may reveal some useful info such as:

    • Highest Degree Obtained: A recent Ph.D. or Masters in a technical discipline will indicate considerable comfort in concepts of regression and optimization. This may not indicate comfort with certain “jargon.”
    • Field of study: This is the most important item I learn. Why? It tells me exactly how much and what type of math/statistics has been taught and what jargon is used. Only formal statisticians, actuaries, and biostatisticians use the term GLM (generalized linear model). Other disciplines might call a GLM with a Poisson distribution simply a Poisson regression.
    • Publications and presentations: Look on a person’s CV or LinkedIn for publications or presentations. It will indicate where their passion is. Can you motivate your chosen methodology with an example they can identify with?

Do you have any other suggestions on preparing your pitch? In my next blog post I will focus on how you develop materials, deliver a strike, and potential missteps on your path to a successfully pitching analytics. If you do, please leave me a comment below or look for me at the Analytics 2015 conference, where I'll be leading a table talk on selling analytics to management.

5 Challenges of cloud analytics

arizona cloudsThe need for fast and easy access to high-powered analytics has never been greater than it is today. Fortunately, cloud processing still holds the promise of making analytics more transparent and ubiquitous than ever before. Yet, a significant number of challenges still exist that prevent more widespread adoption of cloud analytics.

Broadly speaking, most modern cloud deployments don’t suffer from a lack of hardware resource availability but rather are compromised due to poor software architecture and design. As with in-memory distributed computing, software has to be written specifically to take advantage of the way that cloud systems need to work, otherwise gains in productivity and cost reductions often will fail to materialize. Many cloud analytics adopters have found themselves locked into poorly functioning cloud environments because they didn’t ask the right questions up front related to software architecture and associated dependencies.[1]

The most important issues that cloud analytics processing systems must address are:

  1. Guaranteeing security.
  2. Optimizing work throughput through the support of different processing paradigms.
  3. Ensuring high availability in spite of required maintenance.
  4. Allowing tracking and charge-back of individual units of work.
  5. Transparency around the total cost of ownership (TCO) and what are often concerned as ‘hidden’ costs.

Indirectly these challenges speak to the maturity of any software or application system and really reflect the amount of design effort that has been implemented by a specific vendor. More directly, any data processing system, analytics enabled or not, that does not support these five capabilities fails to exhibit robust cloud resiliency and exposes potential liabilities in terms of long-term adoption. Let’s look at each in more detail – and what to watch for when addressing them to ensure the success of your cloud analytics effort.

Perhaps the biggest issue on most cloud users’ minds is security. Since cloud supporters don’t want private information to be exposed or stolen, cloud-based software must support built-in flexibilities that allow it to work easily with a multitude of popular security tools. Software vendors need to be adept at identifying popular and emerging trends and design their software to work among a variety of different proven security protocols.

The next big obstacle to implementing a well-functioning cloud analytics application is to have software that can run on practically any operating system or hardware configuration; this is where the concept of dynamic throughput becomes important. Ideally, analytics software should not be confined to a specific processing strategy but rather have built-in “smart” capabilities that allow it to distinguish between different processing scenarios (topologies) and dynamically choose how the analytics are executed without “fork-lifting,” or moving the data to the analytics. This is not a trivial task. It means that the software has to be literally self-aware of what resources it has at its disposal, maybe choosing between highly distributed MPP in-memory networks, in-stream processing, a grid environment, a single-machine (SMP) instantiation, or even a slower single-threaded processor, depending on what’s available. The ability to switch between different processing paradigms, while at the same time scaling up or down in resources, and without requiring any user intervention or code modification, is key to having modern sustainable cloud analytics application systems in the future.

The next major challenge for cloud systems involves the issue of maintainability and governance. IT departments are legitimately concerned about their business users installing software they don’t have any control over and cannot support. Add to these worries the fact that most mission-critical apps need nearly 100 percent uptime.[2] From an availability perspective, it is no longer acceptable to shut down a server in order to replace or upgrade it. Ideally, redundancies need to be built into the software so that duplication of control and processing exists, making maintenance easier.

Charge-back is the fourth roadblock on the “cloud minders” list. Generally there exists a consensus that costs need to be generated on basic units of work in order that resource usage can be recovered and/or assigned to different business units. If the software doesn’t allow for the possibility of cost assignment to the most basic units of work (usually associated with individual processes), then the cloud environment cannot support a la carte pay-as-you-go pricing.

The final hurdle for cloud analytics is being able to assess total ownership costs. A lot of on-line analytics vendors provide services that appear to be cheap at the beginning until you actually want to use your results. And don’t think about making any mistakes where you might have to repeat the process. What is promoted as low cost on the front-end actually represents the accumulated costs for data storage, database access, transfer of data (bandwidth), memory allocations, numbers of users, row-level scoring, and a variety of other tasks and resources used (like consulting and IT support). In order to assess profitability, the entire analytics lifecycle, including deployment costs, actually needs to be quantified and assessed. Cloud analytics needs to be able to compartmentalize each of these costs up-front so there are no “surprises” later on. This will allow potential cloud users need to know if there use of the technology is actually cheaper than using the software on a network or on a server.

The dawn of cloud analytics computing is still just beginning. Vendors are struggling with the challenges listed above and how to architect their software to accommodate the vision and needs of a true cloud environment. The good news is that SAS has a vision for the future that will meet and exceed all of these requirements. Learn more about cloud analytics from SAS.

[1] http://www.cio.com/article/2825257/cloud-computing/cio-face-cloud-computing-challenges-pitfalls.html

[2] http://www.entrepreneur.com/article/241449

I live in Arizona, so I picked a cloud image from our big southwest sky. Image credit: photo by Umberto Salvagnin // attribution by creative commons