Econometric and statistical methods for spatial data analysis

We live in a complex world that overflows with information. As human beings, we are very good at navigating this maze, where different types of input hit us from every possible direction. Without really thinking about it, we take in the inputs, evaluate the new information, combine it with our experience and previous knowledge, and then make decisions (hopefully, good informed decisions). If you think about the process and types of information (data) we use, you quickly realize that most of information we are exposed to contains a spatial component (a geographical location), and our decisions often include neighborhood effects. Are you shopping for a new house? In the process of choosing the right one, you will certainly consider its location, neighboring locations, schools, road infrastructure, distance from work, store accessibility, and many other inputs (Figure 1). Going on a vacation abroad? Visiting a small country with a low population will probably be very different from visiting a popular destination surrounded by densely populated larger countries. All these are examples that illustrate the value of econometric and statistical methods for spatial data analysis.

Econometric and statistical methods for spatial data analysis

Figure 1: Median Listing Price of Housing Units in U.S. Counties (Source: http://www.trulia.com/home_prices/; retrieved on June 6th 2016)

We are all exposed to spatial data, which we use in our daily lives almost without thinking about it. Not until recently have spatial data become popular in formal econometric and statistical analysis. Geographical information systems (GIS) have been around since the early 1960s, but they were expensive and not readily available until recently. Today every smart phone has a GPS, cars have tracking devices showing their locations, and positioning devices are used in many areas including aviation, transportation, and science. Great progress has also been made in surveying, mapping, and recording geographical information in recent years. Do you want to know the latitude and longitude of your house? Today that information might not be much further away than typing your address into a search engine.

Thanks to technological advancement, spatial data are now only a mouse-click away. Though their variety and volume might vary, data of interest for econometric and statistical methods for spatial data analysis can be divided into three categories: spatial point-referenced data, spatial point-pattern data, and spatial areal data.  The widespread use of spatial data has put spatial methodology and analysis front and center. Currently, SAS enables you to analyze spatial point-referenced data and spatial point-pattern data with the KRIGE2D and VARIOGRAM procedures and spatial point-pattern data with the SPP procedure, all of which are in SAS/STAT. The next release of SAS/ETS (version 14.2) will include a new SPATIALREG procedure that has been developed for analyzing spatial areal data. This type of data is the focus of spatial econometrics.

Spatial econometrics was developed in the 1970’s in response to the need for a new methodological foundation for regional and urban econometric models. At the core of this new methodology are the principles on which modern spatial econometrics is based. These principles essentially deal with two main spatial aspects of the data: spatial dependence and spatial heterogeneity. Simply put, spatial econometrics concentrates on accounting for spatial dependence and heterogeneity in the data under the regression setting. This is important because the ignorance of spatial dependence and heterogeneity could lead to biased/inefficient parameter estimates or flawed inference. Unlike standard econometric models, spatial econometric models do not assume that observations are independent. In addition, the quantification of spatial dependence and heterogeneity is often characterized by the proximity of two regions, which is represented by a spatial weights matrix in spatial econometrics. The idea behind such quantification resonates with the first law of geography—“Everything is related to everything else, but near things are more related than distant things.”

In spatial econometric modeling, the key challenge often centers on how to choose a model that well describes the data at hand. As a general guideline, model specification often starts with understanding where spatial dependence and heterogeneity come from, which is often problem-specific. Some examples of such problems are pricing policies in marketing research, land use in agricultural economics, and housing prices in real estate economics. As an example, it is likely that car sales from one auto dealership might depend on sales from a nearby dealership either because the two dealerships compete for the same customers or because of some form of unobserved heterogeneity common to both dealerships. Based on this understanding, you proceed with a particular model that is capable of addressing the spatial dependence and heterogeneity that the data exhibit. Following that, you revise the model until you identify one that meets certain criteria, such as Akaike’s information criterion (AIC) or the Schwarz-Bayes criterion (SBC). Three types of interaction contribute to spatial dependence and heterogeneity: exogenous interaction, endogenous interaction, and interaction among the error terms. Among a wide range of spatial econometric models, some are known to be good for one type of interaction effect, whereas others are good for other alternatives. If you don’t choose your model properly, your analysis will provide false assurance and flawed inference about the underlying data.

In the next blog post, we’ll talk more about econometric and statistical methods for spatial data analysis by discussing spatial econometric analysis that uses the SPATIALREG procedure. In particular, we’ll discuss some useful features in the SPATIALREG procedure (such as parameter estimation, hypothesis testing, and model selection), and we’ll demonstrate these features by analyzing a real-world example. In the meantime, you can also read more in our 2016 SAS Global Forum paper, How Do My Neighbors Affect Me? SAS/ETS® Methods for Spatial Econometric Modeling.

Optimization for machine learning and monster trucks

beloved monster trucks

My son with a beloved monster truck

Optimization for machine learning is essential to ensure that data mining models can learn from training data in order to generalize to future test data. Data mining models can have millions of parameters that depend on the training data and, in general, have no analytic definition. In such cases, effective models with good generalization capabilities can only be found by using optimization strategies.  Optimization algorithms come in all shapes and sizes, just like anything in life. Attempting to create a single optimization algorithm for all problems would be as foolhardy as seeking to create a single motor vehicle for all drivers —there is a reason we have semi-trucks, automobiles, motorcycles, etc. Each prototype is developed for a specific use case. Just as in the motor vehicle world, optimization for machine learning has its super stars that receive more attention than other approaches that may be more practical in certain scenarios, so let me explain some ways to think about different algorithms with an analogy related to my personal experience with monster trucks.

My son fell in love with monster trucks sometime after he turned two years old. He had a favorite monster truck that he refused to put down, day or night. When he turned two, I unwisely thought he would enjoy a monster truck rally and purchased tickets, imagining father and son duo making great memories together. If you have never been to a monster truck rally, I don’t think I can convey to you how loud the sporadically-revved monster truck engine can truly be. The sound feels like a physical force hitting squarely in the chest, sending waves of vibrations outward.  Despite purchasing sound suppression “monster truck tire” ear muffs for my son, he was not a happy camper and refused to enter the stadium. Instead he repeatedly asked to go home. I circled the outer ring of the stadium, where convenience stands sell hot dogs, turkey drumsticks, and monster truck paraphernalia. I naïvely hoped he would relax and decide to go inside to watch the show. I even resorted to bribery — “whatever monster truck you want I’ll buy if you just go inside the stadium with me.” But within 20 minutes of the rally starting and relentless loop of “I want to go home, Daddy,” the mighty father and son duo passed perplexed ticket-taking staff on our retreat to the parking lot in search of our humble but reliable 2010 Toyota Corolla.

lots of monster trucks

Just a few of his monster trucks

My biggest fear was that my son’s deep love and fascination with monster trucks had been destroyed. Fortunately, he continued watching Hard Hat Harry monster truck episodes incessantly with his current favorite monster truck snuggled safe beside him. So I know more fun facts about monster trucks than I care to admit. For example, monster trucks can actually drive on water, have front and back wheel steering, windows at the driver’s feet for wheelie control, engines that shut off remotely, are capable of consecutive back-flips, can jump over 230 feet without injury, and put away seven gallons of gas traveling less than a mile. When not on exhibitions, monster trucks require constant tuning and tweaking, such as shaving rubber from their gigantic tires to reduce weight and permit them to fly as high as possible on jumps. Monster trucks are their owners' passion, and they don’t mind spending 24/7 maintaining these (I now realize) technological marvels created by human ingenuity and perhaps too much spare time.

Despite how much my son loves them, I have zero desire to own a monster truck. I am a busy dad and value time even more than money (to my wife’s lament). Though my son would be overjoyed if I brought home a truck with tires that towered over him, I personally prioritize factors of safety, fuel efficiency, reliability, and simplicity.

There are reliable and powerful motor vehicles that we use to make our lives easier, and then there are even more powerful ones that astound and enlighten us at competitions. We similarly have superstar algorithms that do optimization for machine learning that can be used to train rich and expressive deep learning models to perform amazing feats of learning,  such as the recent showdown between Atari Games and a deep learning tool. One of the stars of today's machine learning show is stochastic gradient methods and the dual coordinate descent variants.  These methods:

  • are surprisingly simple to describe in serial, while requiring eloquent mathematics to explain convergence and in practice
  • can train a model remarkably fast (with careful initialization)
  • are currently producing world record results.

This trifecta makes stochastic gradient descent (SGD) the ideal material for journal articles. Prototype approaches can be created relatively quickly, requiring publishable mathematics to prove convergence, pounded home with surprisingly good results. Like a pendulum swinging, my personal opinion is that the stochastic gradient algorithms are (1) awesome, (2) incredibly useful, practical, and relevant, and (3) may be eclipsing classes of algorithms that for many end users could actually be more practical and relevant when applicable.

As my career has progressed and I have been immersed in diverse optimization groups, this experience has impressed upon me the philosophy that for every class of algorithms there exists a class of problems where the given algorithm appears superior. This is sometimes referred to as the no free lunch theorem of optimization. For end users, this simply means that true robustness cannot come from any single best class of algorithms, no matter how loud its monster engine may roar.  Robustness must start with diversity.

Figure 1 Objective value for different choices of learning rate specified by the user. Beforehand it is almost impossible to know what learning rate range works. The user must hunt and choose.

Figure 1 Objective value for different choices of learning rate specified by the user. Beforehand it is almost impossible to know what learning rate range works. The user must hunt and choose.

An example of someone thinking outside the current mainstream is the relatively recent work on Hessian-free optimization by James Martens. Deep learning models are notoriously hard to solve. Martens’ shows that adapting the second-order methodology of Newton’s method in the Hessian-free context creates an approach that can be used almost out-of-the box. This realization is in stark contrast to many existing approaches, which work only after careful option tuning and initializations. Note that a veteran data miner may spend weeks repeatedly training the same model with the same data in search of the needle in the haystack golden — a set of solver options that result in the best solve time and solution. For example, Figure 1 shows how the choice of the so-called “learning rate” can decide whether an algorithm converges to a meaningful solution or goes nowhere.

This is why I am excited about this Hessian-free approach, because although it is currently not mainstream, and it lacks the rock star status of stochastic gradient descent (SGD) approaches, it has the potential to save the user significant user processing time. I am certainly also excited about SGD, which is a surprisingly effective rock star algorithm being rapidly developed and expanded as part of the SAS software portfolio. Ultimately, our focus is to provide users an arsenal of options to approach any future problem and ideally, like a monster truck, readily roll over any challenge undaunted. We are also working to provide an auto-tune feature that permits the efficient use of the cloud to find the best tool in for any given problem.

Stochastic gradient descent (SGD) approaches dominate the arena these days, with their power proving itself in the frequency with which they are used in competitions. However, not typically reported or advertised is the amount of time the data miner(s) spent tuning both the model and solver options, often called hyper-parameters. Typically the time reported for resulting models only measures the training time using these best parameter, and left unsaid are the long hours of concerted effort hunting, accumulated over weeks, for these hyper-parameters. This consuming tuning process is a well-known property of stochastic gradient methods, because the solvers surface a set of user options that typically need to be tuned for each new problem.

But on a day-to-day basis data miners seek to solve real data problems for their respective companies and organizations and not ascend the leaderboard in a competition. Not all problems are big data problems, and not all problems require the full power and flexibility of SGD. Two questions that all data miners must ask are:

  1. What model configuration to use?
  2. What solver should train the model?

The user searches for the unquantifiable best model, best model options, best solver, and best solver options. Combinatorically speaking there are billions of potential combinations one might try, and at times one needs a human expert to narrow down the choices based on as-yet-undefinable and inexplicable human intuition (another artificial intelligence opportunity!). A primary concern when picking a model and a corresponding training algorithm is solve time. How long must the user wait to obtain a working model? And what is the corresponding solution quality? In general users want the best model in the least time.

Personally when I consider solver time I think of two separate quantities:

  1. User processing time.  How much time does the user need to babysit the procedure sitting in front of a computer?
  2. Computer processing time.  How much time does the user need to wait for a given procedure to finish?

Often times we only can quantify (2) and end up leaving (1) unreported. However, a SAS customer’s satisfaction likely hinges on (1) being relatively small, unless this is their day job. I personally love algorithm development and enjoy spending my waking hours seeking to make algorithms faster and more robust. Because of this I jealously guard my time when it comes to other things in my life, such as the car I drive. I don’t own a monster truck and would never be interested in the required a team of mechanics and constant maintenance to achieve world record results. I just want a vehicle that is stable, reliable, and gets me to work, so I can focus entirely on what interests me most.

Figure 2 There is very little communication needed when training models in parallel and thus a great fit for the cloud or any available cluster.

Figure 2 There is very little communication needed when training models in parallel and thus a great fit for the cloud or any available cluster.

At SAS we want to support all types of customers. We work with experts who prefer total control and are happy to spend significant time manually searching for the best hyper-parameters. But we also work with other practitioners who may be new to data mining or have time constraints and prefer a set-it-and-forget scenario, where SAS searches for options for them. Now, in a cloud environment, we can train many models in parallel for you in the time it takes to build a single model. Figure 2 shows an example where we train up to 32 models concurrently, using four workers for each training session (resulting in a total of 128 workers used) to create 1800 different candidate models.  Imagine attempting to run all 1800 models sequentially, manually deciding which options to try next based on the previous solve performance. I have done this for weeks at a time and found it cumbersome, circular (I would try the same options multiple times), and exhausting. Once presented this parallel capability I obtained new records in hours (instead of weeks) in exchange of me making a single click. Further, I am free to take care of other things while the auto-tune procedure works constantly in the background for me.

Out with the old, in with the new

Out with the old, in with the new

Another way to cut down on user processing time is using alternative algorithms that require less tuning but perhaps longer run times. Basically, shifting time from (1) to (2), such as with auto-tuning in parallel. This approach may mean substituting rock-star algorithms like SGD for algorithms that are slower and less exciting, such as James Martens’ Hessian-free work mentioned earlier. His may never be a popular approach in competitions, where data mining experts prefer the numerous controls offered by alternative first-order algorithms, because ultimately they need only report computer processing time for the final solve. However, the greatest strength of Martens’ work is the potential gain in reducing the amount of user processing time. In a cloud computing environment we can always conduct multiple training sessions at the same time. So, we don’t necessarily have to choose a favorite when we can run everything at the same time. A good tool box is not one that offers a single tool used to achieve record results but a quality and diverse ensemble of tools that cover all possible use cases. For that reason I think it best to offer both SGD and Hessian-Free, as well as other standard solvers. By having a diverse portfolio, you increase the chance that you have a warm start on the next great idea/algorithm. For example, my son, now five and half, spends more time at the toy store looking at power rangers than monster trucks.

My sturdy and reliable Toyota Corolla, like second-order algorithms, may not ever be as carefully and lovingly tuned to win competitions, but practitioners may be far more satisfied with the results and lesser maintenance time. After all, judging by the vehicles we see on the road, most of us, most of the time, will prefer a practical choice for our transportation rather than a race car or monster truck.

I’ll be expanding on optimization for machine learning later this week at the International Conference on Machine Learning on Friday, June 24. My paper, “Unconventional iterative methods for nonconvex optimization in a matrix-free environment,” is part of a workshop on Optimization Methods for the Next Generation of Machine Learning, and I’m pleased to be joined by such rock stars as Yoshua Bengio of the University of Montreal  and Leon Bottou of Facebook AI Research, who are just a few of the stellar people presenting on such topics during the event.

 

Analytics, OR, data science and machine learning: what's in a name?

data science and machine learning feud but shouldn't

Casa di Giulietta balcony, where Juliet supposed stood while Romeo declared his love

Analytics, statistics, operations research, data science and machine learning - with which term do you prefer associate? Are you from the House of Capulet or Montague, or do you even care? Shakespeare's Juliet derides excess identification with names in the famous play, Romeo and Juliet.

"What's in a name? That which we call a rose
By any other name would smell as sweet."

Romeo was from the house of Montague, Juliet, the house of Capulet, and this distinction that meant that their families were sworn enemies. The play is a tragedy, because by the end the two lovers end up dead as a result of this long-running feud. Statistics, data science and machine learning are but a few of the "houses" that feud today over names, and while to my knowledge no deaths have resulted from this debate the competing camps have nearly come to blows.

"Operations Research? Management Science? Analytics? What’s in a brand name? How has the emerging field of Analytics impacted the Operations Research Profession? Is Analytics part of OR or the other way around? Is it good, bad, relevant, a nuisance or an opportunity for the OR profession? Is OR just Prescriptive or is it something more? In this panel discussion, we will explore these topics in a session with some of the leading thinkers in both OR and Analytics. Be sure to attend to have your questions answered on these highly complementary and valuable fields.” This was the abstract for a panel at the INFORMS Annual Meeting last year that included two past presidents of INFORMS and other long-time members. To many long-time members of INFORMS this abstract was provocative indeed, because not all embrace of the term "analytics" to describe what they do.

Interestingly, the American Statistical Association (ASA) is host to a very similar debate. Last fall the ASA released a statement on the Role of Statistics in Data Science. There’s a whole wing of data science practitioners who are downright hostile to statisticians. This camp makes assertions like "sampling is obsolete,” due to computational advances for processing big data. Popular blogger Vincent Granville has even said "Data Science without statistics is possible, even desirable," extending the obsolescence to statisticians themselves. INFORMS member and self-described statistical data scientist Randy Bartlett, who is both a Certified Analytics Professional (the CAP certification offered by INFORMS) and an Accredited Professional Statistician (the PSTAT certification from ASA), has written about this statistics denial in an excellent series of blog posts he publishes from his LinkedIn page. In the face of such direct attacks on their profession it is no wonder the ASA felt a need to take a stance.

Many statisticians assert "Aren't We Data Science?," as Marie Davidian (professor of statistics at NC State University) did in in 2013 in an article published during her tenure as president of the ASA. More recently David Donoho (professor of statistics at Stanford University) makes a similar but complex argument in a long-form piece, "50 Years of Data Science," which he released last fall (after a presentation on it at the Tukey Centennial workshop). Donoho is equally dismayed at much of the current data science movement. As he puts it, "The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers." Donoho points out the harm in the huge oversight of the contributions of statistics while also exhorting academic statisticians to expand beyond a narrow focus on theoretical statistics and "fancy-seeming methods." He proposes a definition of data science, based on people who are "learning from data," drawing upon a remarkably prescient article John Tukey published more than 50 years ago, "The future of data analysis," in The Annals of Mathematical Statistics. In making his case, Donoho also points to subsequent essays by John Chambers (of Bell Labs and co-developer of the S language), William Cleveland (also of Bell Labs and arguably the one who coined the term data science in 2001), and Leo Breiman (of the University of California at Berkeley). These gentleman together argue for addressing a wider portion of the lifecycle of analysis, such as the preparation of data as well as its presentation, and the importance of prediction (and not just inference). I heartily recommend reading Donoho's excellent analysis.

While I haven’t seen a similar assault on operations research, there are those within the OR/MS community who see terms like analytics as a threat to the survival of OR. Several years ago, when INFORMS began exploring whether to embrace the term analytics, researchers surveyed the INFORMS membership and published their results in an article in Interfaces entitled "INFORMS and the Analytics Movement: The View of the Membership." Member views of the relationship between OR and analytics were roughly divided into three camps: OR is a subset of analytics, analytics is a subset of OR, and analytics is the intersection of OR and analytics. While the numbers may have shifted I doubt they have yet converged into a clear definition or consensus among members. There will always be naysayers - 6% of the membership surveyed at the time thought there was no relationship between OR and analytics, and for that matter there are statisticians who see no association with data science or analytics at all. Today INFORMS embraces the term analytics, describing itself as "the largest society in the world for professionals in the field of operations research (O.R.), management science, and analytics."  By adding the word analytics, instead of replacing operations research, INFORMS has shown that this is not an either/or question - it values its roots while acknowledging a present that includes those who describe their work as "analytics."

Trying to wrangle a clear definition of analytics, statistics, data science, machine learning, and even operations research can be as messy as cleaning up a typical data set! These are distinct disciplines, related but not the same. Increasingly analytics is used synonymously with data science, which is derived in large part from statistics, which also is a foundation for machine learning, which relies upon from optimization techniques, which I’d argue are part of analytics. These days I observe increased cross-fertilization, like my operations research peers clamoring to attend machine learning conferences and machine learning talks that are standing-room-only at the Allied Social Sciences Annual Meeting (where the economists gather).

We can parse terminology all day, but instead we should invest our energy in the opportunity at hand and drive towards increased adoption. Some of these buzzy terms get people’s attention and provide an incredible opportunity to use mathematically-based methods to make a significant impact. Last year my friend Jack Levis accepted my invitation to give a keynote at an analytics conference SAS hosted. He spoke about ORION, the OR project he leads at UPS that Tom Davenport has said is "arguably the world's largest operations research project in the world." While few of the conference attendees likely understood in great detail the operations research methods employed, all were amazed at the impact his team has had on saving miles, time, and money for UPS, happily tweeting their excitement during his talk. No doubt this impact is why Jack's team won the 2016 INFORMS Edelman Competition, which some call "the Super Bowl of OR."

The important advances in research presented at conferences like the Joint Statistical Meetings and the INFORMS Annual Meeting pave the way for progress that enables success in practice at places like UPS. We need academics and other researchers to continue to invest in advancing the unique approaches of their disciplines. Most INFORMS members would call what Jack's team does operations research. Most attendees at the conference where Jack spoke probably thought of it as analytics. Does it really matter what we call it, if people value what was done, want to share the story, and pave the path for adoption through their enthusiasm ? If it leads to the expansion of OR I don’t care if this application of operations research is referred to popularly as analytics, because after all, in the words of the bard, "a rose by any other name would smell as sweet." Each of the historic disciplines has a chance to have not only a bigger slice of the pie but a bigger slice of a bigger pie if analytics is embraced at large. I understand the value and pride in disciplines like operations research and statistics, and their contributions to data science and machine learning, which are the knowledge base analytics draws upon. We don't have to throw out older terms to embrace new ones. The houses of Capulet and Montague can celebrate their unique heritage but put an end to their feuds. This is not an either/or proposition, but I do believe it is our opportunity to squander. Instead of feuding let us learn more, put that learning into practice, and make the world a better, smarter place.

 

Image credit: photo by Adam W // attribution by creative commons

Understanding data mining clustering methods

When you go to the grocery store, you see that items of a similar nature are displayed nearby to each other.  When you organize the clothes in your closet, you put similar items together (e.g. shirts in one section, pants in another). Every personal organizing tip on the web to save you from your clutter suggests some sort of grouping of similar items together. Even we don't notice it, we are involved in grouping similar objects together in every aspect of our life. This is called clustering in machine learning, so in this post I will provide an overview of data mining clustering methods.

In machine learning or data mining, clustering assigns similar objects together in order to discover structures in data that doesn't have any labels. It is one of the most popular unsupervised machine learning techniques with varied use cases. When you have a huge volume of information and want to turn this information into a manageable stack cluster analysis is a useful approach. It can be used to summarize the data as well as prepare the data for other techniques. For instance, assume you have a large number of images and wish to organize them based on their content. In this case, you first identify the objects in the images and then apply clustering to find meaningful groups. Consider this situation: you have an enormous amount of customer data and want to identify a marketing strategy for your product, as illustrated in figure 1. Again, you could first cluster your customer data into groups that have similar structures and then plan different marketing campaigns for each group.

data mining clustering based on similarities

Figure 1. Clustering: group the data based on the similarities.

Defining similarity / dissimilarity is an important part in clustering, because it affects the structure of your groups. You need to change your measure based on the application that you have. One measure may be good for continuous numeric variables but not good for nominal variables. Euclidean distance, the geometric distance in multidimensional space, is one of the most popular distance measures used in distance-based clustering. When the dimension of the data is high, due to the issue called the ‘curse of dimensionality’, the Euclidean distances between any pair of the data loses its validity as a distance metric, because the distances will all be close to each other. In such cases other metrics like the cosine similarity and the Jaccard correlation coefficient are among the most popular alternatives.

k-means clustering, a data mining clustering method

Figure 2 - K-means clustering

One of the most popular distance-based clustering algorithms is ‘k-means’. K-means is conceptually simpler and computationally relatively faster compared to other clustering algorithms, making k-means one of the most widely used clustering algorithms. While k-means is centroid-based, it is more efficient when the clusters are in globular shapes. Figure 2 shows clustering results yielded by k-means when the clusters in the data are in globular shapes.

Density-based algorithms can help discover any shaped clusters in the data. In density-based clustering, clusters are areas of higher density than the other parts of the data set. Objects in these sparse areas - which are required to separate clusters - are usually considered to be border points corrupted by noise. The most popular density-based clustering method is DBSCAN. Figure 3 shows the results yielded by DBSCAN on some data with non-globular clusters.

DBSCAN algorithm, a data mining clustering method

Figure 3. DBSCAN algorithm.

For both the k-means and DBSCAN clustering methods mentioned above, each data point is supposed to be assigned to only one cluster. But consider this kind of situation: when a streaming music vendor tries to categorize its customers into groups for better music recommendations, a customer who likes songs from Reba McEntire may be suitable for the “country music lovers” category. But at the same time, s/he has considerable purchase records for pop music as well, indicating the “pop music lovers” category is also appropriate. Such situations suggest a ‘soft clustering,’ where multiple cluster labels can be associated with a single data point, and each data point is assigned a probability of its association with the various clusters. A Gaussian mixture model with expectation-maximization algorithm (GMM-EM) is a prominent example of probabilistic clusters modeling.

So far, all the clustering methods discussed above involve specifying the number of clusters beforehand, for example, the ‘k’ in the k-means (as shown in figure 4). As a typical unsupervised learning task, the number of clusters is unknown and needs to be learned during the clustering. This is actually why ‘nonparametric methods’ are used in the practice of clustering. A classic application of nonparametric clustering is to replace the probabilistic distribution with some stochastic process in a Gaussian mixture model. The stochastic process can help to discover the number of clusters that yield the highest probability for the data under analysis. In other words, the most suitable number of clusters for the data.

uncertainty in number of data mining clusters

Figure 4. The uncertainty in determining the number of clusters in data.

Besides, if you are interested in the internal hierarchy of the clusters in the data, then a hierarchical clustering algorithm can be applied, as shown in figure 5.

heat matrix and hierarchy of data mining clusters

Figure 5. An example of a heat matrix and hierarchy of clusters in a hierarchical clustering.

There are many different ways to cluster data. Each method tries to cluster the data from the unique perspective of its intended application. As mentioned, clustering discovers patterns in data without explaining why the patterns exist. It is our job to dig into the clusters and interpret them. That is why clusters should be profiled extensively to build their identity, to understand what they represent, and to learn how they are different from each other.

Clustering is a fundamental machine learning practice to explore properties in your data. The overview presented here about data mining clustering methods serves as an introduction, and interested readers may find more information in a webinar I recorded on this topic, Clustering for Machine Learning.

I am grateful to my colleague Yingjian Wang, who provided invaluable assistance to complete this article.

All images courtesy of author but last image, where credit goes to Phylogeny Figures  // attribution by creative commons

I've seen the future of data science....

"I've seen the future of data science, and it is filled with estrogen!" This was the opening remark at a recent talk I heard. If only I'd seen that vision of the future when I was in college. You see, I’ve always loved math (and still do). My first calculus class in college was at 8 a.m. on Mondays, Wednesdays and Fridays, and I NEVER missed a class. I showed up bright-eyed and bushy tailed, sat on the front row, took it all in, and aced every test. When class was over, I’d all but sprint back to my dorm room to do homework assignments for the next class. The same was true for all my math and statistics classes. But despite this obsession, I never considered it a career option. I don’t know why, maybe because I didn’t know any other female mathematicians or statisticians, or I didn’t know what job opportunities even existed. Estrogen wasn't visible in the math side of my world in those days; I didn't see myself as part of the future of data science.

Fast forward (many) years later, and I find myself employed at SAS in a marketing and communications capacity working closely with colleagues who are brilliant mathematical and analytical minds, many of whom are women. They are definitely the future of data science!

Several of these colleagues helped establish the NC Chapter of the Women in Machine Learning and Data Science (WiMLDS) Meetup that just held its inaugural gathering a couple of weeks ago. The Meetup was founded to facilitate stronger collaboration, networking and support among women in machine learning and data science communities as well as grow the talent base in these fields of study. In other words, build the future of data science and populate it with women! The NC chapter plans to host quarterly, informal Meetup events that will feature presentations from practitioners and the academic community, as well as tutorials and learning opportunities.

Jennifer smallThis inaugural event featured guest speaker Jennifer Priestley, Professor of Applied Statistics and Data Science at Kennesaw State University, who greeted the estrogen-filled audience. She talked at length about the field of data science and the talent gap, and she made the case for getting a PhD in data science.

She said she’s starting to see PhD programs recognize data science as a unique discipline or field of study. She referred to data science as “the science of understanding data; the science of how to translate massive amounts of data into meaningful information to find patterns, engage in research and solve business problems.”

Priestley attributes the rise of data science to the 3 Vs – volume, velocity and variety – across all industries and sectors. She said companies that wouldn’t have classified themselves as data companies a few years ago do now, and they require skilled labor to help them manage that data and use it to make business decisions.

To help fill this talent gap, she talked about the need for PhD programs in data science, but explained that such programs needed to be “21st-century” programs built around applied curriculum. That is how they've built their own Ph.D. Program in Analytics and Data Science at Kennesaw State University.

As I sat in the back listening in, I wondered what would have happened had I been exposed to a network like this during my early college days when I was trying to pick a major and think about a career. Would I have been part of the future of data science? Maybe I’d have made a different decision. Who knows – maybe it’s not too late. It’s certainly not too late to inspire another woman to tap into a supportive network like this.

For more information, visit the NC WiMLDS Meetup website.

 

 

Self-service analytics with SAS – and what we should NOT borrow from the ancient Egyptians

I recently read the book "Die Zahl die aus der Kälte kam" (which would be The NumPyramidsber That Came in from the Cold in English) written by the Austrian mathematician Rudolf Taschner. He is ingenious at presenting complex mathematical relationships to a broader audience. One of his examples deals with the power of the Egyptian high priests in the ancient Egypt. While reading that chapter I came to a realization about self-service analytics.

The High Priests in the Ancient Egypt would have certainly forbidden SAS Visual Analytics!

HighpriestWhy? The High Priest in ancient Egypt had a lot of power. This power was derived from a very important fact: they knew how to calculate. Using their calculations they were able to “predict” the cycle of the flood of the Nile. To ordinary people this ability seemed to be preternatural and superhuman, so they were very thankful for instructions the high priests gave about sowing and the harvest.

It is therefore no surprise that the high priests definitely had no interest having their knowledge spread among the population, as this would have reduced their power significantly. They were most definitely opposed to self-service analytics.

The expansion of analytics in companies and organizations

In companies and organizations, data exploration and model creation was limited to a small group of people for many years. While this small group did not necessarily have the status of the High Priests, many people were excluded from this circle. They could only passively receive results but were not able to perform analyses on their own.

With SAS, self-service analytics and democratisation of analytics becomes reality!

MapBrazilWith solutions like SAS Visual Analytics and SAS Visual Statistics, SAS offers the possibility that business users can explore their own data, generate results, and test analytical models on their own first.

Is this a good thing? Sure! Business experts know the history and the business background of the data much better than anyone else. They can assess results and findings from a business point of view and put them in context with the analysis question.

For example:

  • Finding contradictions in the data that remain undetected in basic data quality profiling.
  • Identifying noticeable facts in the data that should be used as important explanatory variables in a predictive model.
  • Detecting relationships in the data that are analyzed in detail with a statistician in an analytical project.

Should analytical people have to fear for their jobs?

No. Because appetite comes with eating. The more people in an organization are dealing with data analysis, the more knowledge and ideas are generated. Consequently, more analytic expertise is needed for important decisions in the company.

In companies and organization it will however be necessary to establish the right "Analytic Culture." Those who detect relationships and abnormalities in the data need a platform where they can communicate their findings and receive feedback.

Analytic Culture – The SAS Analytics 2015 Conference in Rome

"Analytic Culture:" this was the slogan of the SAS Analytics 2015 conference in Rome. Top-class presenters and more the 700 attendees made the conference to the top analytic event of the year in Europe. My presentation on “Discovery Analytics with SAS Visual Analytics and SAS Visual Statistics" can be downloaded at the SAS Community Website.

This blog contribution is also available in German at the SAS Mehr Wissen Blog.

Image credit: photo by Dennis Jarvis and V Manninen  // attribution by creative commons

Of Big Data Competitions, Sports Analytics, and the Internet of Bugs

It is said that everything is big in Texas, and that includes big data. During my recent trip to Austin I had the privilege of being a judge in the final round of the Texata Big Data World Championship, a fantastic example of big data competitions.

It felt fitting that I arrived on the day of a much-anticipated University of Texas Longhorns game and witnessed the city awash with college students proudly wearing burnt-orange shirts. Their enthusiasm notwithstanding, my personal sport of choice is not really football but rather big data competitions! And I saw plenty of competitiveness in this particular venue.

Texata is quite new, this being only its second year, but it has already generated significant buzz among big data competitions. Competitors come from all over the world and tend to be seasoned professionals or graduate students from prestigious university programs. By the time they reach the finals, they have already undergone two intense rounds of question answering, data exploration and coding. Unlike other competitions where teams have months to play with a dataset (often preprocessed and curated), and get ranked based on very specific quantitative criteria, in the Texata finals each individual participant is given a real dataset on the spot and only four hours to work with it and extract some sort of meaningful value. This year’s dataset was a collection of customer support tickets and chat logs from Cisco, in whose facility the finals took place. This closely resembles the real world of a data scientist... messy unstructured data, open problem definitions, and a running clock.

Not having a leaderboard, as some other big data competitions do, means that the judges must evaluate the candidates and pick the winner. This was a tough choice, given twelve very talented people, many having traveled from other continents, who put in a large effort and gave their very best. All of us on the judging panel took the responsibility very seriously. At the same time, it was sheer fun to see how each candidate took their own approach, reaching into a large toolbox: latent semantic analysis, clustering, multi-dimensional scaling, and graph algorithms, for example. Some contestants focused on categorizing, others on visualizing, and yet others on inferring causal relations. Every single solution yielded some unique and valuable insight into the data.

At the end of the day, the winner was Kristin Nguyen from Vietnam. Her analysis had the best balance between technical soundness of the code, variety of techniques and presentation clarity. Plus, in 2014 she had already placed second, so this was no fluke. Well deserved, Kristin!

As an added treat, on the following day I got to speak at the companion Texata Summit event. That gave me the chance to show off some exciting examples of using SAS in sports analytics, such as season ticket customer retention in football (both the American and European versions of it!). Also baseball – remember the 2011 movie Moneyball? Scouting continues to be a major application of analytics, allowing small teams to punch well above their weight. Many other sports use analytics, from basketball to Olympic rowing.

Perhaps most exciting of all, there are novel frontier areas identified in the comprehensive report “Analytics in Sports: The New Science of Winning” by Thomas Davenport. For instance, image and video data can be used for crowd management in a stadium, or to track players in the field. In other cases, athlete performance monitoring is of interest. This allowed me to slightly lift the veil on new R&D work related to images, video and wearable sensors:

slide1 slide2

 

Thanks to SAS’s ongoing collaboration with Prof. Edgar Lobaton and Prof. Alper Bozkurt of North Carolina State University, involving multiple groups within the Advanced Analytics division of R&D at SAS, I am now aware that golf is actually a rather stressful sport ! By looking at EKG activity, it is apparent that the heart rate goes up to enormous levels in the moments before a swing. Also, while wearables and the Internet of Things are hot topics right now, we should all keep an eye on Profs. Lobaton and Bozkurt’s other work - I like to call it the Internet of Bugs:

slide3

As featured in the New Scientist and Popular Science, these cyborg-augmented hissing cockroaches can be instrumental in search and rescue operations. Responders can steer them by applying electrical impulses to their antennas and locate potential survivors in rubble via directional microphones and positioning sensors. SAS has very strong tools that are uniquely suited for processing and analyzing this type of streaming data – for example, SAS® Event Stream Processing can acquire real-time sensor signals, while SAS® Forecast Server and SAS® Enterprise Miner™ can perform signal filtering, detect cycles and spikes, and analyze the aggregate position coordinates of the insects, for example to map a structure and find locations of interest.

After the presentation, #Texata and @SASSoftware Twitter traffic contained multiple variations of the words “weird” and “awesome” - which is a fitting description of data science itself. Truly you never know where your data will come from!

 

Decision tree learning: What economists should know

As an economist, I started at SAS with a disadvantage when it comes to predictive modeling. After all, like most economists, I was taught how to estimate marginal effects of various programs, or treatment effects, with non-experimental data. We use a variety of identification assumptions and quasi-experiments to make causal interpretations, but we rarely focus on the quality of predictions. That is, we care about the “right-hand side” (RHS) of the equation. For those trained in economics in the past 20 years, predictive modeling was regarded as “data mining,” a dishonorable practice.

Since beginning my journey at SAS I have been exposed to many practical applications of predictive modeling that I believe would be valuable for economists. In this and a series of future blogs, I will write about how I think about the “left hand side” (LHS) of the equation with respect to the tools of predictive modeling. Up first: decision tree learning.

tree

Decision tree learning using the HPSPLIT procedure in SAS/STAT

Decision trees are not unfamiliar to economists. If fact, almost all economists have used trees as they learned about game theory. Game theory courses use trees to illustrate sequential games, that is, where one agent moves first. Once such as example is Stackelberg competition in which firms sequentially compete in quantities with the first mover earning greater profits. We use trees to understand sequence of decisions. The use of trees in data analysis has many similarities but some important differences. First, what are decision trees?

Decision tree learning is very simple data categorization tool that can also happens to have great predictive power. How do they work? My colleague Barry de Ville and SAS author provides a nice introduction to basics of decision tree algorithms. If you think about what these algorithms actually do, they are trying to separate data into homogenous clusters, where each cluster has highly similar explanatory covariates and similar outcomes. You can find a great guide to all many of those algorithms and more in my colleague Padraic Neville’s primer on decision trees.

So why do trees work for prediction? They subset data into observations that are highly similar on a number of dimensions. The algorithms choose to use certain explanatory factors (X’s, covariates, features), as well as interactions of those factors, to create homogeneous groups. At that point, the prediction equation is derived by some pretty complicated math……a simple average!

That’s right, once the data set is broken down into subsets (a process known as ‘splitting’ and ‘pruning’) the fancy prediction math is nothing but a simple average. And the prediction equation. Equally simple. It is a series of if-then statements following the path of the tree eventually leading to that calculated sample average. So what does the decision tree help me to do as an economist?  Here are my top 3 things to love about a decision tree:

  1. Like regression, decision tree output can be interpreted. There are no coefficients but the results read like if-then-else business rules.
  2. Decision trees inform about predictive power of variables WITH concerns for redundancy. Variables will be split on if they matter for creating homogeneous groups and discarded otherwise. One caveat, however, is that only one of two highly collinear variables might be chosen.
  3. They inform about interaction effects for later regression analysis. A split tells us that an interaction effect matters in prediction. This could be useful for controlling for various forms of unobserved heterogeneity or for turning continuous variables into categorical variables.

So that is my list. What else should economists know about decision trees? Do you feel strongly that trees are a better exploratory data analysis tool than predictive tool?

Next time we will augment this discussion of decision trees by talking about, What Economists should know about … random forests.

The future of analytics – top 5 analytics predictions for 2016

Bean crystal ball small

My view of the world is shaped by where I stand, but from this spot the future of analytics for 2016 looks pretty exciting! Analytics has never been more needed or interesting.

  1. Machine learning established in the enterprise

Machine learning dates back to at least 1950 but until recently has been the domain of elites and subject to “winters” of inattention. I predict that it is here to stay, because large enterprises are embracing it. In addition to researchers and digital natives, these days established companies are asking how to move machine learning into production. Even in regulated industries, where low interpretability of models has historically choked their usage, practitioners are finding creative ways to use machine learning techniques to select variables for models, which can then be formulated using more commonly accepted techniques. Expect greater interest across academic disciplines, because machine learning benefits from many different approaches. Consider the popular keynote from the INFORMS Annual Meeting last year, where Dimitris Bertsimas talked about “Statistics and Machine Learning via a Modern Optimization Lens.” My colleague Patrick Hall offers his own perspective about "Why Machine Learning? Why Now?"

  1. Internet of Things hype hits reality

The Internet of Things (IoT) is at the peak of the Gartner Hype Cycle, but in 2016 I expect this hype to hit reality. One real barrier is plumbing – there’s a lot of it! One of my colleagues is analyzing the HVAC system on our newest building as an IoT test project. The building is replete with sensors, but getting to the data was not easy. Facilities told him data are the domain of IT, who then sent him to the manufacturer, because while the HVAC system collects the data, it is sent to the manufacturer. “Data ownership” is an emerging issue – you produce the data but may not have access to it. An even larger challenge for IoT will be to prove its value. There are limited implementations of IoT in full production at the enterprise level. The promise of IoT is fantastic, so in 2016 look to early adopters to work out the kinks and deliver results.

  1. Big data moves beyond hype to enrich modeling

Big data has moved beyond hype to provide real value. Modelers today can access a wider then ever range of data types (e.g., unstructured data, geospatial data, images, voice), which offer great opportunities to enrich models. Another new gain from big data is due to competitions, which have moved beyond gamification to provide real value via crowdsourcing and data sharing. Consider the Prostate Cancer DREAM Challenge, where teams were challenged to address open clinical research questions using anonymized data provided by four different clinical trials run by multiple providers, much of it publicly available for the first time. An unprecedented number of teams competed, and winners beat existing models developed by the top researchers in the field.

  1. Cybersecurity improved via analytics

And as IoT grows, the growing use of sensors must thrill cybercriminals, who use these devices to hack in using a slow but insidious Trojan Horse approach. Many traditional fraud detection techniques do not apply, because detection is no longer seeking one rare event but requires understanding an accumulation of events in context. Similar to IoT, one challenge of cybersecurity involves data, because streaming data is managed and analyzed differently. I expect advanced analytics to shed new light on detection and prevention as our methods catch up with the data. Unfortunately, growing methods for big data collaboration are off limits, because we don’t want the bad guys to know how we’ll find them, and much of the best work is done behind high security clearance. But that won't stop SAS and others from focusing heavily on cybersecurity in 2016.

  1. Analytics drives increased industry-academic interaction

The Institute for Advanced Analytics (IAA) at NC State University tracks the growth in analytics masters programs, and new programs seem to pop up daily. Industry demand for recruits fuels this growth, but I see increased interest in research. More companies are setting up academic outreach with an explicit interest in research collaborations. Sometimes this interest goes beyond partnership and into direct hiring of academic superstars, who either take sabbaticals, work on the side, or even go back and forth. For example, top machine learning researcher Yann LeCun worked at Bell Labs, became a professor at NYU, was the founding director of the NYU Center for Data Science, and now leads Artificial Intelligence Research at Facebook. INFORMS supports this academic-industry collaboration by providing academics a resource of teaching materials related to analytics. In 2016 INFORMS will offer industry a searchable database of analytics programs to facilitate connections and the new Associate Certified Analytics Professional credential to help vet recent graduates.

I wrote a shorter version of this piece for INFORMS, and it first appeared in Information Management earlier this week.

Image credit: photo by Chris Pelliccione // attribution by creative commons

Vector autoregressive models for generating economic scenarios

varmax blog

Diagnostic information from PROC VARMAX

Macroeconometrics is not dead: (and I wish I had paid better attention in my time series course):

I wrote this on the way to see one of our manufacturing clients in Austin, Texas, anticipating a discussion how to use vector autoregressive models in process control. It is a typical use case, especially in the age of the “Internet of Things,” to use multiple sensors on a device to detect mechanical issues early. Industrial engineers often use rolling window principal component analysis or a dynamic factor model to detect trends. These are fairly common applications. In my customer engagements, these approaches had been the primary applications of multivariate time series methods. As an applied microeconomist, I believed that my encounters confirmed my decision during graduate school to shirk time series econometric courses and instead focus on cross-sectional and panel data econometrics. Until I visited a banking customer in Dallas this fall....

See, for the past several years, the banking industry has been busy hiring quants for a regulatory overhaul called CCAR/DFAST. Quantitative analysts are being employed to estimate various forms of loss models to meet CCAR/DFAST regulations, which require various economic scenarios to be factors in these loss models. Some of these models resemble cross-sectional models and some of these autoregressive time series models, but both of these models are univariate in nature. My colleague Christian Macaro and I wrote a paper, "Incorporating External Economic Scenarios into Your CCAR Stress Testing Routines," that provides an overview of both methods.  Although some of these models “could” use multivariate vector autoregressive models, as my buddy Rajesh Selukar, developer of state space models at SAS (PROC SSM) says, ”If you don’t have to use multivariate time series methods, don’t use them,” conveying the complexity of modeling these interactions. Up to this point, my shirking had paid off. No PROC VARMAX needed.

Then came my trip to Dallas. At this bank I met a new type of CCAR/DFAST team, whose mission is to create macroeconomic forecasts of the US and international economies to be consumed by CCAR/DFAST modeling teams. Until now, most CCAR/DFAST economic scenarios have been generated by either the Federal Reserve or by one of several small consulting firms specializing in this analysis. These groups have long used complicated multivariate techniques such as dynamic stochastic general equilibrium (DSGE) or vector autoregressive models (VAR), but they tend to use smaller niche software tools in their consulting work. This new internal economic scenario team was charged with bringing economic scenario generation in house. I had heard that the Fed has become increasingly critical of relying on purchased forecasts. Tier 1 banks are now being required to generate these forecasts on their own, and the customer I met with is one of these banks.

Well darn. I guess I should have paid more attention to Dr. Ali’s ECO703 course! One of these days I will get back to the basics of multivariate time series analytics. After all, PROC SSM and VARMAX are two procedures in SAS/ETS® that work with multivariate time series models. In fact, when I get around to it, I will likely utilize a new book by Anders Milhøj, Multiple Time Series Modeling Using the SAS VARMAX Procedure, which provides both a theoretical and practical introduction to VAR models in SAS. If you already know a bit about multivariate time series and just wanted to get started with SAS, try the example programs with the VARMAX and SSM procedures and SAS Studio using free access to SAS OnDemand for Academics, whether you are a student, professor, or independent learner.