Big data has been a hot topic recently, but more often than not the topic is covered from an IT perspective. What do the analysts, data miners and statisticians think? I recall the old days discussing with statisticians what data mining is and how it fundamentally differs from statistics. In those days, the data mining community argued that data mining is about analyzing large heuristic data sets, while statistics focuses on the analysis of smaller data sets, especially prepared for a specific statistical design.
This debate is returning . The question is, from a statistical point of view, does the size of the data provide any additional benefit? Is bigger truly better?
In general, we analysts are used to inferring useful insights from a sample and projecting these onto a population. This is what statistics is all about. It’s how the census and election estimations work. It’s how clinical trials work. It is simply not feasible to get data from the entire population, so statisticians come up with clever ways of inferring information from a sample that can be extended to the entire population. Some statisticians believe the statistical sampling process is also applicable to "big data" … and I agree. Sampling will always have its place.
However, current technological advances allow us to consider applying more advanced modeling techniques to larger data. Previously, the users of these types of algorithms faced performance constraints when working on larger data, due primarily to the need to pass through the data repeatedly in order to derive the optimal solution. In many real-life business contexts today, techniques like running multi-layer perceptron neural networks, or multi-pass decision tree algorithms (such as gradient boosting) on large data are helping to extract signals from noisy data that had been hard to detect before.
Sampling is really about reducing the “vertical size” of a data set, since we typically select rows from a traditional rows-by columns data set. By definition, sampling can only reduce one dimension of the data set – the number of rows. However, the number of columns is also growing rapidly. More and more data attributes are being collected and stored, not to mention the countless ways of creating derived variables from the raw input variables, such as ratios, trends, interactions, etc.
Big analytics opens doors for analysts to think about a new level of variable selection by lifting restrictions on the number of columns that can be calculated. In terms of creating and storing these derived variables, high-performance analytics can help by dynamically calculating these derived variables incredibly fast, instead of needing to store thousands and thousands of variables in a database. I believe that this area will see breakthroughs in the future through research both in analytics and computing to derive new and clever ways of identifying the optimal variables for predictive models from “super-wide” tables with thousands of attributes.
Other opportunities for big data include analytical projects that have been previously restricted by performance, such as applying complex algorithms to large complex data (for example the analysis of the human genome data set). This research might uncover new ways of analysis not previously considered. The overnight success of social media networks, and the subsequent analysis of these networks, is another example. Additionally, some specific analysis techniques don’t really lend themselves to sampling (such as hierarchical market basket analysis, web path analysis or clustering). Especially in isolating rare events, when we want to find an interesting, hard-to-find outlier that might lead us to new insights, searching through the entire data is the only way. Fraud detection follows this paradigm.
In times of increasing online, non-face-to-face transactions, fraud is becoming a growing threat for all enterprises, including the public sector. Being able to look at all the data in incredibly fast ways with “high-tech microscopes” should improve the chances for analysts to find the literal needle in the haystack. And since this is an ever changing needle with new fraud schemes constantly being devised, it could provide invaluable help to organizations in keeping up with these ever changing conditions.
Since high-performance analytics helps to score more data than ever before, even more raw data can be refined through analytics. For fraud detection, this has the happy secondary effect that more known fraud detection cases become available, easing the challenge of rare events modeling. This does not necessarily mean that there is more fraud occurring, but that more fraud cases can be automatically detected as all transactions are scanned (and not only a subset).
These are some of the ideas that we see, based on interactions with hundreds of our customers and their business problems over the years. What do you see?
Thanks to Andrew Pease from the SAS Technology Practice who helped a lot in writing this blog post.