I've seen the future of data science....

"I've seen the future of data science, and it is filled with estrogen!" This was the opening remark at a recent talk I heard. If only I'd seen that vision of the future when I was in college. You see, I’ve always loved math (and still do). My first calculus class in college was at 8 a.m. on Mondays, Wednesdays and Fridays, and I NEVER missed a class. I showed up bright-eyed and bushy tailed, sat on the front row, took it all in, and aced every test. When class was over, I’d all but sprint back to my dorm room to do homework assignments for the next class. The same was true for all my math and statistics classes. But despite this obsession, I never considered it a career option. I don’t know why, maybe because I didn’t know any other female mathematicians or statisticians, or I didn’t know what job opportunities even existed. Estrogen wasn't visible in the math side of my world in those days; I didn't see myself as part of the future of data science.

Fast forward (many) years later, and I find myself employed at SAS in a marketing and communications capacity working closely with colleagues who are brilliant mathematical and analytical minds, many of whom are women. They are definitely the future of data science!

Several of these colleagues helped establish the NC Chapter of the Women in Machine Learning and Data Science (WiMLDS) Meetup that just held its inaugural gathering a couple of weeks ago. The Meetup was founded to facilitate stronger collaboration, networking and support among women in machine learning and data science communities as well as grow the talent base in these fields of study. In other words, build the future of data science and populate it with women! The NC chapter plans to host quarterly, informal Meetup events that will feature presentations from practitioners and the academic community, as well as tutorials and learning opportunities.

Jennifer smallThis inaugural event featured guest speaker Jennifer Priestley, Professor of Applied Statistics and Data Science at Kennesaw State University, who greeted the estrogen-filled audience. She talked at length about the field of data science and the talent gap, and she made the case for getting a PhD in data science.

She said she’s starting to see PhD programs recognize data science as a unique discipline or field of study. She referred to data science as “the science of understanding data; the science of how to translate massive amounts of data into meaningful information to find patterns, engage in research and solve business problems.”

Priestley attributes the rise of data science to the 3 Vs – volume, velocity and variety – across all industries and sectors. She said companies that wouldn’t have classified themselves as data companies a few years ago do now, and they require skilled labor to help them manage that data and use it to make business decisions.

To help fill this talent gap, she talked about the need for PhD programs in data science, but explained that such programs needed to be “21st-century” programs built around applied curriculum. That is how they've built their own Ph.D. Program in Analytics and Data Science at Kennesaw State University.

As I sat in the back listening in, I wondered what would have happened had I been exposed to a network like this during my early college days when I was trying to pick a major and think about a career. Would I have been part of the future of data science? Maybe I’d have made a different decision. Who knows – maybe it’s not too late. It’s certainly not too late to inspire another woman to tap into a supportive network like this.

For more information, visit the NC WiMLDS Meetup website.

 

 

Self-service analytics with SAS – and what we should NOT borrow from the ancient Egyptians

I recently read the book "Die Zahl die aus der Kälte kam" (which would be The NumPyramidsber That Came in from the Cold in English) written by the Austrian mathematician Rudolf Taschner. He is ingenious at presenting complex mathematical relationships to a broader audience. One of his examples deals with the power of the Egyptian high priests in the ancient Egypt. While reading that chapter I came to a realization about self-service analytics.

The High Priests in the Ancient Egypt would have certainly forbidden SAS Visual Analytics!

HighpriestWhy? The High Priest in ancient Egypt had a lot of power. This power was derived from a very important fact: they knew how to calculate. Using their calculations they were able to “predict” the cycle of the flood of the Nile. To ordinary people this ability seemed to be preternatural and superhuman, so they were very thankful for instructions the high priests gave about sowing and the harvest.

It is therefore no surprise that the high priests definitely had no interest having their knowledge spread among the population, as this would have reduced their power significantly. They were most definitely opposed to self-service analytics.

The expansion of analytics in companies and organizations

In companies and organizations, data exploration and model creation was limited to a small group of people for many years. While this small group did not necessarily have the status of the High Priests, many people were excluded from this circle. They could only passively receive results but were not able to perform analyses on their own.

With SAS, self-service analytics and democratisation of analytics becomes reality!

MapBrazilWith solutions like SAS Visual Analytics and SAS Visual Statistics, SAS offers the possibility that business users can explore their own data, generate results, and test analytical models on their own first.

Is this a good thing? Sure! Business experts know the history and the business background of the data much better than anyone else. They can assess results and findings from a business point of view and put them in context with the analysis question.

For example:

  • Finding contradictions in the data that remain undetected in basic data quality profiling.
  • Identifying noticeable facts in the data that should be used as important explanatory variables in a predictive model.
  • Detecting relationships in the data that are analyzed in detail with a statistician in an analytical project.

Should analytical people have to fear for their jobs?

No. Because appetite comes with eating. The more people in an organization are dealing with data analysis, the more knowledge and ideas are generated. Consequently, more analytic expertise is needed for important decisions in the company.

In companies and organization it will however be necessary to establish the right "Analytic Culture." Those who detect relationships and abnormalities in the data need a platform where they can communicate their findings and receive feedback.

Analytic Culture – The SAS Analytics 2015 Conference in Rome

"Analytic Culture:" this was the slogan of the SAS Analytics 2015 conference in Rome. Top-class presenters and more the 700 attendees made the conference to the top analytic event of the year in Europe. My presentation on “Discovery Analytics with SAS Visual Analytics and SAS Visual Statistics" can be downloaded at the SAS Community Website.

This blog contribution is also available in German at the SAS Mehr Wissen Blog.

Image credit: photo by Dennis Jarvis and V Manninen  // attribution by creative commons

Of Big Data Competitions, Sports Analytics, and the Internet of Bugs

It is said that everything is big in Texas, and that includes big data. During my recent trip to Austin I had the privilege of being a judge in the final round of the Texata Big Data World Championship, a fantastic example of big data competitions.

It felt fitting that I arrived on the day of a much-anticipated University of Texas Longhorns game and witnessed the city awash with college students proudly wearing burnt-orange shirts. Their enthusiasm notwithstanding, my personal sport of choice is not really football but rather big data competitions! And I saw plenty of competitiveness in this particular venue.

Texata is quite new, this being only its second year, but it has already generated significant buzz among big data competitions. Competitors come from all over the world and tend to be seasoned professionals or graduate students from prestigious university programs. By the time they reach the finals, they have already undergone two intense rounds of question answering, data exploration and coding. Unlike other competitions where teams have months to play with a dataset (often preprocessed and curated), and get ranked based on very specific quantitative criteria, in the Texata finals each individual participant is given a real dataset on the spot and only four hours to work with it and extract some sort of meaningful value. This year’s dataset was a collection of customer support tickets and chat logs from Cisco, in whose facility the finals took place. This closely resembles the real world of a data scientist... messy unstructured data, open problem definitions, and a running clock.

Not having a leaderboard, as some other big data competitions do, means that the judges must evaluate the candidates and pick the winner. This was a tough choice, given twelve very talented people, many having traveled from other continents, who put in a large effort and gave their very best. All of us on the judging panel took the responsibility very seriously. At the same time, it was sheer fun to see how each candidate took their own approach, reaching into a large toolbox: latent semantic analysis, clustering, multi-dimensional scaling, and graph algorithms, for example. Some contestants focused on categorizing, others on visualizing, and yet others on inferring causal relations. Every single solution yielded some unique and valuable insight into the data.

At the end of the day, the winner was Kristin Nguyen from Vietnam. Her analysis had the best balance between technical soundness of the code, variety of techniques and presentation clarity. Plus, in 2014 she had already placed second, so this was no fluke. Well deserved, Kristin!

As an added treat, on the following day I got to speak at the companion Texata Summit event. That gave me the chance to show off some exciting examples of using SAS in sports analytics, such as season ticket customer retention in football (both the American and European versions of it!). Also baseball – remember the 2011 movie Moneyball? Scouting continues to be a major application of analytics, allowing small teams to punch well above their weight. Many other sports use analytics, from basketball to Olympic rowing.

Perhaps most exciting of all, there are novel frontier areas identified in the comprehensive report “Analytics in Sports: The New Science of Winning” by Thomas Davenport. For instance, image and video data can be used for crowd management in a stadium, or to track players in the field. In other cases, athlete performance monitoring is of interest. This allowed me to slightly lift the veil on new R&D work related to images, video and wearable sensors:

slide1 slide2

 

Thanks to SAS’s ongoing collaboration with Prof. Edgar Lobaton and Prof. Alper Bozkurt of North Carolina State University, involving multiple groups within the Advanced Analytics division of R&D at SAS, I am now aware that golf is actually a rather stressful sport ! By looking at EKG activity, it is apparent that the heart rate goes up to enormous levels in the moments before a swing. Also, while wearables and the Internet of Things are hot topics right now, we should all keep an eye on Profs. Lobaton and Bozkurt’s other work - I like to call it the Internet of Bugs:

slide3

As featured in the New Scientist and Popular Science, these cyborg-augmented hissing cockroaches can be instrumental in search and rescue operations. Responders can steer them by applying electrical impulses to their antennas and locate potential survivors in rubble via directional microphones and positioning sensors. SAS has very strong tools that are uniquely suited for processing and analyzing this type of streaming data – for example, SAS® Event Stream Processing can acquire real-time sensor signals, while SAS® Forecast Server and SAS® Enterprise Miner™ can perform signal filtering, detect cycles and spikes, and analyze the aggregate position coordinates of the insects, for example to map a structure and find locations of interest.

After the presentation, #Texata and @SASSoftware Twitter traffic contained multiple variations of the words “weird” and “awesome” - which is a fitting description of data science itself. Truly you never know where your data will come from!

 

Decision tree learning: What economists should know

As an economist, I started at SAS with a disadvantage when it comes to predictive modeling. After all, like most economists, I was taught how to estimate marginal effects of various programs, or treatment effects, with non-experimental data. We use a variety of identification assumptions and quasi-experiments to make causal interpretations, but we rarely focus on the quality of predictions. That is, we care about the “right-hand side” (RHS) of the equation. For those trained in economics in the past 20 years, predictive modeling was regarded as “data mining,” a dishonorable practice.

Since beginning my journey at SAS I have been exposed to many practical applications of predictive modeling that I believe would be valuable for economists. In this and a series of future blogs, I will write about how I think about the “left hand side” (LHS) of the equation with respect to the tools of predictive modeling. Up first: decision tree learning.

tree

Decision tree learning using the HPSPLIT procedure in SAS/STAT

Decision trees are not unfamiliar to economists. If fact, almost all economists have used trees as they learned about game theory. Game theory courses use trees to illustrate sequential games, that is, where one agent moves first. Once such as example is Stackelberg competition in which firms sequentially compete in quantities with the first mover earning greater profits. We use trees to understand sequence of decisions. The use of trees in data analysis has many similarities but some important differences. First, what are decision trees?

Decision tree learning is very simple data categorization tool that can also happens to have great predictive power. How do they work? My colleague Barry de Ville and SAS author provides a nice introduction to basics of decision tree algorithms. If you think about what these algorithms actually do, they are trying to separate data into homogenous clusters, where each cluster has highly similar explanatory covariates and similar outcomes. You can find a great guide to all many of those algorithms and more in my colleague Padraic Neville’s primer on decision trees.

So why do trees work for prediction? They subset data into observations that are highly similar on a number of dimensions. The algorithms choose to use certain explanatory factors (X’s, covariates, features), as well as interactions of those factors, to create homogeneous groups. At that point, the prediction equation is derived by some pretty complicated math……a simple average!

That’s right, once the data set is broken down into subsets (a process known as ‘splitting’ and ‘pruning’) the fancy prediction math is nothing but a simple average. And the prediction equation. Equally simple. It is a series of if-then statements following the path of the tree eventually leading to that calculated sample average. So what does the decision tree help me to do as an economist?  Here are my top 3 things to love about a decision tree:

  1. Like regression, decision tree output can be interpreted. There are no coefficients but the results read like if-then-else business rules.
  2. Decision trees inform about predictive power of variables WITH concerns for redundancy. Variables will be split on if they matter for creating homogeneous groups and discarded otherwise. One caveat, however, is that only one of two highly collinear variables might be chosen.
  3. They inform about interaction effects for later regression analysis. A split tells us that an interaction effect matters in prediction. This could be useful for controlling for various forms of unobserved heterogeneity or for turning continuous variables into categorical variables.

So that is my list. What else should economists know about decision trees? Do you feel strongly that trees are a better exploratory data analysis tool than predictive tool?

Next time we will augment this discussion of decision trees by talking about, What Economists should know about … random forests.

The future of analytics – top 5 analytics predictions for 2016

Bean crystal ball small

My view of the world is shaped by where I stand, but from this spot the future of analytics for 2016 looks pretty exciting! Analytics has never been more needed or interesting.

  1. Machine learning established in the enterprise

Machine learning dates back to at least 1950 but until recently has been the domain of elites and subject to “winters” of inattention. I predict that it is here to stay, because large enterprises are embracing it. In addition to researchers and digital natives, these days established companies are asking how to move machine learning into production. Even in regulated industries, where low interpretability of models has historically choked their usage, practitioners are finding creative ways to use machine learning techniques to select variables for models, which can then be formulated using more commonly accepted techniques. Expect greater interest across academic disciplines, because machine learning benefits from many different approaches. Consider the popular keynote from the INFORMS Annual Meeting last year, where Dimitris Bertsimas talked about “Statistics and Machine Learning via a Modern Optimization Lens.” My colleague Patrick Hall offers his own perspective about "Why Machine Learning? Why Now?"

  1. Internet of Things hype hits reality

The Internet of Things (IoT) is at the peak of the Gartner Hype Cycle, but in 2016 I expect this hype to hit reality. One real barrier is plumbing – there’s a lot of it! One of my colleagues is analyzing the HVAC system on our newest building as an IoT test project. The building is replete with sensors, but getting to the data was not easy. Facilities told him data are the domain of IT, who then sent him to the manufacturer, because while the HVAC system collects the data, it is sent to the manufacturer. “Data ownership” is an emerging issue – you produce the data but may not have access to it. An even larger challenge for IoT will be to prove its value. There are limited implementations of IoT in full production at the enterprise level. The promise of IoT is fantastic, so in 2016 look to early adopters to work out the kinks and deliver results.

  1. Big data moves beyond hype to enrich modeling

Big data has moved beyond hype to provide real value. Modelers today can access a wider then ever range of data types (e.g., unstructured data, geospatial data, images, voice), which offer great opportunities to enrich models. Another new gain from big data is due to competitions, which have moved beyond gamification to provide real value via crowdsourcing and data sharing. Consider the Prostate Cancer DREAM Challenge, where teams were challenged to address open clinical research questions using anonymized data provided by four different clinical trials run by multiple providers, much of it publicly available for the first time. An unprecedented number of teams competed, and winners beat existing models developed by the top researchers in the field.

  1. Cybersecurity improved via analytics

And as IoT grows, the growing use of sensors must thrill cybercriminals, who use these devices to hack in using a slow but insidious Trojan Horse approach. Many traditional fraud detection techniques do not apply, because detection is no longer seeking one rare event but requires understanding an accumulation of events in context. Similar to IoT, one challenge of cybersecurity involves data, because streaming data is managed and analyzed differently. I expect advanced analytics to shed new light on detection and prevention as our methods catch up with the data. Unfortunately, growing methods for big data collaboration are off limits, because we don’t want the bad guys to know how we’ll find them, and much of the best work is done behind high security clearance. But that won't stop SAS and others from focusing heavily on cybersecurity in 2016.

  1. Analytics drives increased industry-academic interaction

The Institute for Advanced Analytics (IAA) at NC State University tracks the growth in analytics masters programs, and new programs seem to pop up daily. Industry demand for recruits fuels this growth, but I see increased interest in research. More companies are setting up academic outreach with an explicit interest in research collaborations. Sometimes this interest goes beyond partnership and into direct hiring of academic superstars, who either take sabbaticals, work on the side, or even go back and forth. For example, top machine learning researcher Yann LeCun worked at Bell Labs, became a professor at NYU, was the founding director of the NYU Center for Data Science, and now leads Artificial Intelligence Research at Facebook. INFORMS supports this academic-industry collaboration by providing academics a resource of teaching materials related to analytics. In 2016 INFORMS will offer industry a searchable database of analytics programs to facilitate connections and the new Associate Certified Analytics Professional credential to help vet recent graduates.

I wrote a shorter version of this piece for INFORMS, and it first appeared in Information Management earlier this week.

Image credit: photo by Chris Pelliccione // attribution by creative commons

Vector autoregressive models for generating economic scenarios

varmax blog

Diagnostic information from PROC VARMAX

Macroeconometrics is not dead: (and I wish I had paid better attention in my time series course):

I wrote this on the way to see one of our manufacturing clients in Austin, Texas, anticipating a discussion how to use vector autoregressive models in process control. It is a typical use case, especially in the age of the “Internet of Things,” to use multiple sensors on a device to detect mechanical issues early. Industrial engineers often use rolling window principal component analysis or a dynamic factor model to detect trends. These are fairly common applications. In my customer engagements, these approaches had been the primary applications of multivariate time series methods. As an applied microeconomist, I believed that my encounters confirmed my decision during graduate school to shirk time series econometric courses and instead focus on cross-sectional and panel data econometrics. Until I visited a banking customer in Dallas this fall....

See, for the past several years, the banking industry has been busy hiring quants for a regulatory overhaul called CCAR/DFAST. Quantitative analysts are being employed to estimate various forms of loss models to meet CCAR/DFAST regulations, which require various economic scenarios to be factors in these loss models. Some of these models resemble cross-sectional models and some of these autoregressive time series models, but both of these models are univariate in nature. My colleague Christian Macaro and I wrote a paper, "Incorporating External Economic Scenarios into Your CCAR Stress Testing Routines," that provides an overview of both methods.  Although some of these models “could” use multivariate vector autoregressive models, as my buddy Rajesh Selukar, developer of state space models at SAS (PROC SSM) says, ”If you don’t have to use multivariate time series methods, don’t use them,” conveying the complexity of modeling these interactions. Up to this point, my shirking had paid off. No PROC VARMAX needed.

Then came my trip to Dallas. At this bank I met a new type of CCAR/DFAST team, whose mission is to create macroeconomic forecasts of the US and international economies to be consumed by CCAR/DFAST modeling teams. Until now, most CCAR/DFAST economic scenarios have been generated by either the Federal Reserve or by one of several small consulting firms specializing in this analysis. These groups have long used complicated multivariate techniques such as dynamic stochastic general equilibrium (DSGE) or vector autoregressive models (VAR), but they tend to use smaller niche software tools in their consulting work. This new internal economic scenario team was charged with bringing economic scenario generation in house. I had heard that the Fed has become increasingly critical of relying on purchased forecasts. Tier 1 banks are now being required to generate these forecasts on their own, and the customer I met with is one of these banks.

Well darn. I guess I should have paid more attention to Dr. Ali’s ECO703 course! One of these days I will get back to the basics of multivariate time series analytics. After all, PROC SSM and VARMAX are two procedures in SAS/ETS® that work with multivariate time series models. In fact, when I get around to it, I will likely utilize a new book by Anders Milhøj, Multiple Time Series Modeling Using the SAS VARMAX Procedure, which provides both a theoretical and practical introduction to VAR models in SAS. If you already know a bit about multivariate time series and just wanted to get started with SAS, try the example programs with the VARMAX and SSM procedures and SAS Studio using free access to SAS OnDemand for Academics, whether you are a student, professor, or independent learner.

Pattern recognition software - is it for the birds?

hermit thrush

Can pattern recognition software tell us if it is a Hermit Thrush or a Swainson's Thrush we've seen? A few of us have been debating an identification question at work, because we agreed to help Fulbright Scholar and Duke University PhD student Natalia Ocampo-Peñuela with research she is doing related to bird collisions with windows. A sad little band of us at SAS spent three weeks this fall doing daily perambulations of multiple buildings on the SAS campus to look around the perimeter for dead birds, casualties of run-ins with our shiny pretty glass buildings. We recorded the species if possible (sometimes predators left us scanty evidence), hence the need for identification. I can tell you that a Hermit Thrush and Swainson's Thrush look very similar. As an avid birdwatcher myself, I have several field guide apps on my iPhone, but it got me wondering what algorithmic magic was behind the search tools most of these apps now have. You input features like state, the month, size, color, etc. and the app returns a filtered list of possibilities likely to be seen. But a new app, Merlin Bird Photo ID, developed in collaboration with the Cornell Lab of Ornithology and others, takes this flow a step further using machine learning techniques from computer vision to help identify birds. You upload an image of the bird you've seen, and Merlin compares features in the photo to those expected to be seen on that day in your location, based on a data set supplied by birders who report their sightings to a site called eBird (it's decently large data - 9.5 million observations were reported in the month of May alone!).

A quick search on pattern recognition software identified many papers on machine learning for bird identification. Improved Automatic Bird Identification through Decision Tree based Feature Selection and Bagging uses audio recordings instead of images for identification. Two researchers at Queen Mary University London argue that Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning, and they are even raising money in Kickstarter to build Warblr, a birdsong recognition app. It will use machine learning to help you figure out which bird just serenaded you (or a prospective mate, really, but you can consider it a gift anyhow). In their larger study they trained and tested a random forest classifer, which is more than ironic given that certainly many of the birdsongs were recorded in forests! Of course, birdsong doesn't work in the case of our task identifying a limp little bird, but many birds are more commonly heard than seen, so this approach offers great advantages.

Technical challenges include noise (you can't exactly get birds into a sound studio) and scalability, given the computational intensity. But some kinds of pattern identification pose even greater challenges. What if you just had a footprint to use? The bird survey at SAS was initiated by a connection with the great folks at Wildtrack, who use JMP Software from SAS to analyze data for their Footprint Identification Technique, a non-invasive method  used to track elusive endangered animals. Wildtrack's Zoe Jewell and Sky  Allibhai have partnered with researchers from NC State University to improve upon footprint identification, and some of their work includes a Manifold learning approach to curve identification with applications to footprint segmentation. It's a tough nut but they keep working to crack it.

My own colleagues in SAS Advanced Analytics R&D are doing interesting work on pattern recognition.  Patrick Hall, Ilknur Kaynar Kabul, and Jorge Silva used PROC NEURAL in SAS Enterprise Miner to extract representative features from a training set for digit recognition, a specific challenge for pattern recognition software to tackle. They built a stacked denoising autoencoder from the Mixed National Institute of Standards and Technologies (MNIST) digits data, which they describe in this paper on Machine Learning in SAS Enterprise Miner. The code is in Patrick's GitHub repo. Now if I can just get them interested in bird recognition maybe we'll be able to settle the debate about Hermit vs. Swainson's Thrush......

Additional resources

Image credit: photo by Kelly Colgan Azar // attribution by creative commons

6 machine learning resources for getting started

If you turned in for my recent webinar, Machine Learning: Principles and Practice, you may have heard me talking about some of my favorite machine learning resources, including recent white papers and some classic studies.

As I mentioned in the webinar, machine learning is not new. SAS has been pursuing machine learning in practice since the early 1980s. Over the decades, professionals at SAS have created many machine learning technologies and have learned how to apply machine learning to create value for organizations. This webinar series is one of many resources that can help you understand what machine learning is and how to use it.

SAS Resources:

Machine Learning with SAS Enterprise Miner

See how a team of SAS Enterprise Miner developers used machine learning techniques to predict customer churn in a famous telecom dataset.

An Overview of Machine Learning with SAS Enterprise Miner

This technical white paper includes SAS code examples for supervised learning from sparse data, determining the number of clusters in a dataset, and deep learning.

Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners

Written for corporate leaders, and technology and marketing executives, this book shows how organizations can harness the power of high performance computing architectures and data mining, text analytics, and machine learning algorithms.

External Resources:

Statistical Modeling: The Two Cultures

The grandfather of machine learning, Leo Breiman, outlines the fundamental ideas and philosophies of the discipline and discusses two different approaches to modeling and data analysis.

7 common mistakes of machine learning

Whether you’re a seasoned pro or a noob, machine learning is tricky. Save yourself some time by avoiding these common mistakes.

11 clever Methods of overfitting and how to avoid them

Probably the most common mistake in machine learning, and one of the hardest to avoid, is overfitting your training data. This article highlights some common (and not so common) practices that can lead to overfitting.

 

Principal Component Analysis for Dimensionality Reduction

When you work with big data, you often deal with both a large number of observations and a large number of features. When the number of features is large, they can be highly correlated, resulting in significant amount of redundancy in the data. Principal component analysis can be a very effective method in your toolbox in a situation like this.

Consider a facial recognition example, in which you train algorithms on images of faces. If training is on 16x16 grayscale images, you will have 256 features, where each feature corresponds to the intensity of each pixel. Because the values of adjacent pixels in an image are highly correlated, most of the row features are redundant. This redundancy is undesirable, because it can significantly reduce the efficiency of most machine learning algorithms. Feature extraction methods such as principal component analysis (PCA) and autoencoder networks enable you to approximate the row image by using a much lower-dimensional space, often with very little error.

PCA is an algorithm that transforms a set of possibly-correlated variables into a set of uncorrelated linear combinations of those variables; these combinations are called principal components. PCA finds these new features in such a way that most of the variance of the data is retained in the generated low-dimensional representation. Even though PCA is one of the simplest feature extraction methods (compared to other methods such as kernel PCA, autoencoder networks, independent component analysis, and latent Dirichlet allocation), it can be very efficient in reducing dimensionality of correlated high-dimensional data.

For the facial recognition problem described above, suppose you reduce the dimension to 18 principal components while retaining 99% of the variation in the data. Each principal component corresponds to an “eigenface” as shown in Figure 1 which is a highly representative mixture of all the faces in the training data. Using the 18 representative faces generated by principal components, you can represent each image in your training set by an 18-dimensional vector of weights (\(w_{1}\),...,\(w_{18}\)) that tells you how to combine the 18 eigenfaces, instead of using the original 256-dimensional vector of raw pixel intensities.

Figure 1: Eigenfaces method for facial recognition

Figure 1: Eigenfaces method for facial recognition

Now suppose you have a new image, and you wonder if this image belongs to a person in your training set. You simply need to calculate the Euclidian distance between this new image’s weight vector (\(w_{1}\),...,\(w_{18}\)) and the weight vectors of the images in your training set. If the smallest Euclidian distance is less than some predetermined threshold value, voilà – facial recognition! Tag this new image as the corresponding face in your training data; otherwise tag it as an unrecognized face. If you want to learn more about face recognition, see this famous paper, Face Recognition Using Eigenfaces, for your enjoyment.

Labeling and suggesting tags in images are common uses of reduced dimensional data. Similarly this approach could be used for analyzing audio data for speech recognition or in text mining for web search or spam detection. Perhaps a more common application of dimension reduction is in predictive modeling.   You can feed your reduced-dimensional data into a supervised learning algorithm, such as a regression, to generate predictions more efficiently (and sometimes even more accurately).

Another major use of dimension reduction is to visualize your high-dimensional data, which you might not be able to otherwise visualize. It’s easy to see one, two, or three dimensions. But how would you make a four-dimensional graph? What about a 1000-dimensional graph? Visualization is a great way to understand your data, and it can also help you check the results of your analysis. Consider the chart in Figure 2. A higher-dimensional data set, which describes hospitals in the United States, was clustered and projected onto two dimensions. You can see that the clusters are grouped nicely, for the most part. If you are familiar with the US health-care system, you can also see that the outliers in the data make sense, because they are some of the best-regarded hospitals in the US! (Of course, just because an analysis makes sense to you does not guarantee that it is mathematically correct. However, some agreement between human and machine is usually a good thing.)

Figure 2: Sixteen Clusters of Hospitals Projected onto Two Dimensions Using a Dimension Reduction Technique

Figure 2: Sixteen Clusters of Hospitals Projected onto Two Dimensions Using a Dimension Reduction Technique

If these examples have caught your interest and you know want more information about PCA, tune into my webcast, Principal Component Analysis for Machine Learning, where I discuss PCA in greater detail, including the math behind it, and how to implement it using SAS®. If you are a SAS® Enterprise MinerTM user, you can even try the hospital example for yourself with the code we’ve placed in one of our GitHub repos.

Data science training - is it possible?

Ok, so the title is a little provocative, but some people are dubious that data science training is even possible, because they believe data science entails skills one can learn only on the job and not in a classroom. I am not in that camp, although I do believe that data science is something you learn by doing, with an emphasis on both the learning and the doing. So how and where can you learn to do data science, if you want to become data scientist lady? data scientist lady

There is no agreed-upon definition of data science, but I like to think of it as three legs of a stool - strong quantitative foundation, excellent programming skills, and keen understanding of business and communication. A quantitative foundation is most often learned in university, but as I've written previously the disciplines studied can range from statistics to archaeology, and it can pay to go recruiting outside the traditional academic disciplines. A solid academic background is an invaluable start, although I hear business complain that many graduates are not prepared for real life problems, where data can be messy and/or sparse and you may not have the luxury of an elegant solution. More and more academic programs incorporate case studies, practicums, internships, etc. Earlier this year Tom Davenport hosted a Tweetchat on the top ten ways businesses can influence analytics education, which is a good read if you want to influence the graduate pipeline. There are even emerging PhD programs in Data Science.

Programming skills are a second leg of the stool. While there are many classes in universities that incorporate the use of software, more and more people seek to learn on their own, whether they be current students or working professionals. MOOCs have become increasingly popular, with many turning first to a source like Coursera to find the content they want. SAS jumped into this game with the launch last year of SAS® University Edition. This new offering was designed to address the demand we hear for those who want to learn SAS, as well as those who want to hire students with SAS skills, This offering has proven very popular — as of today it has been downloaded over 407,000 times, from Afghanistan to Zimbabwe. While it is called University Edition, it is available for anyone seeking to learn for non-commercial purposes. The SAS Analytics U Community offers a ton of free resources to help your learning, including tutorials, e-learning, a discussion board, etc. It's a powerful offering, with no limitations on data you use and accessible as a downloadable package of selected SAS products that runs on Windows, Linux and Mac.

The third leg of the stool is business acumen and the ability to communicate well. These are the skills that are hardest to pick up in a university program and may be best learned on the job. One shortcut could be the SAS Academy for Data Science, which is  an intensive data science certification program that combines hands-on learning and case studies in a collaborative environment. In addition to covering key topics like machine learning, time series forecasting, and optimization, students will learn important programming skills in a blended approach with SAS, Hadoop, and open source technologies. There's even a module on Communicating Technical Findings with a Non-Technical Audience. The Academy covers all the content necessary to sit for the new Big Data Certification and Data Science Certification that SAS is offering.

If these topics are of interest to you and you'll be attending the SAS Analytics 2015 Conference in Las Vegas October 26-27 you're in luck! On Monday Dr. Jennifer Priestley of Kennesaw State University is giving a talk on Is It Time for a PhD in Data Science? On Tuesday afternoon Cat Truxillo will be talking about our new certifications for data science in a table talk called World-Class Data Science Certification From the Experts. And my colleague Sharad Prabhu and I will be leading a table talk on Tuesday afternoon on SAS® University Edition – Connecting SAS® Software in New Ways to Develop More SAS® Users. If you're there come join us!

 image credit: photo by nraden // attribution by creative commons