How does in-memory analytics really work? What is the simplest and most effective way to work with Hadoop? Why is billion the new million? In this interview, Oliver Schabenberger, Lead Developer for SAS High-Performance Analytics answers those questions and explains how his team has helped SAS evolve to tackle complex big data problems for customers in many industries.
Alison Bolen: We hear about in-memory computing and in-memory analytics everywhere. What does it mean to you?
Oliver Schabenberger: In-memory computing is nothing new—that is how we have done analytics all along; you don't add numbers on disk. The new angle is that in-memory computing now has evolved into using large distributed systems of fairly inexpensive commodity blade systems. A blade with 24 cores and 96 gigabytes of RAM is commonplace today. A system with many blades offers incredible computing power and memory at the same time.
Data movement and I/O operations continue to be a major performance killer. In order to avoid I/O when solving problems with massive data, we need to make massive amounts of memory available. Once the data are loaded in to memory, we need to parallelize the analytics in order to take advantage of the multi-core architecture.
When large amounts of memory are available in a distributed system on multiple machines, the task is not to address memory across machine boundaries, but to find the right model for collectively using the memory and computing cores. You do this with carefully chosen strategies for minimizing network communication in the distributed system, and by distributing and parallelizing the work, taking advantage of local resources where possible. And you extend these principles to all aspects—from loading the data, to transforming the data, to analytics of any complexity.
If the analytic software takes advantage of the resources on a distributed, massively parallel system, then you can, in principle, scale it to solve problems of any size. And because blade systems have become so affordable, solving massive problems without breaking the bank is now a reality.
Bolen: What are some of the SAS in-memory platforms and solutions you helped develop?
Our approach to high-performance analytics is a natural extension of our approach to all analytic problems. We are extremely customer driven and build on expertise. The analytic problems where our customers experienced performance problems fall into different categories:
- High-end analytic problems from statistical modeling, predictive analytics, text analytics and forecasting. The solutions require very complex algorithms and very advanced mathematical and statistical techniques.
- Exploration of large data sets with millions or billions of rows and hundreds or thousands of observations.
Well-established software solutions exist for these types of problems, but increasing data volumes render those solutions impractical and powerless. Plus, usage patterns to solve these complex problems are quite different, and a one-size-fits-all approach does not work in high-performance analytics.
You might think I am stating the obvious here, but it is worth pointing out: If your approach to high-performance analytics is rooted in a single technological approach, then you will struggle to reach the depth and breadth of analytic problems that your business problems demand.
It is dangerous to be guided by the technology you want to employ instead of finding the right technological solution that solves the individual problem. As a result, we have developed several approaches of solving in-memory problems to fit different needs and usage patterns. Let me point out two of them:
- The analytics procedures in the SAS High-Performance Analytics product perform high-end, in-memory analytics—the most difficult analytic problems, if you will. They do that by loading data into memory, perform the analytic task, report the results, and release the memory. The processes are transient in the distributed environment. While the procedures are running, we expect prolonged and high CPU utilization because of the complexity of the algorithms. These high-performance procedures typically run on high-performance appliances with low concurrency.
- The SAS LASR Analytic Server is an in-memory analytic platform where data are loaded in memory and persist there for as long as the user chooses. The platform is built for high concurrency, for example, when hundreds of users are accessing the in-memory tables do perform data exploration and visualization. Any single request made to a SAS LASR Analytic Server will use CPU for a very short period of time and execute incredibly quickly.
Other in-memory solutions draw on these models. For example, SAS Visual Analytics is built to leverage capabilities of the SAS LASR Analytic Server. We are developing industry- or problem-specific software solutions, such as high-performance retail, on the first model. Finally, our High-Performance Risk solution incorporates aspects of both usage models: a CPU intensive pricing phase with low concurrency is followed by a high-concurrency querying phase with comparatively low CPU usage per query.
Alison Bolen: During a recent conference for industry analysts you were quoted as saying "billion is the new million.” What did you mean by that?
When we first started working with billion-record data sets, folks scratched their heads as if this was a really unusual thing to do. Performing substantial analytics on such large data sets was quite revolutionary. Analytic platforms to perform a logistic regression with variable selection and classification factors on a billion records, for example, were lacking. At the same time, there was incredible need to do just that. Customers were not the ones thinking that it was an unusual thing to do; they were scratching their heads about how to get it done.
With the high degree of parallelism in data manipulation and analytics, and the extreme performance of our in-memory platforms, we have reached the point where you can analyze a table with 1 billion rows as easily as you could analyze a table with 1 million rows two years ago—because we can provide the parallel processing to provide a 1,000-fold speed-up.
In a few years we possibly can have this conversation again and maybe then the slogan will be, "A gazillion is the new billion."
Bolen: We are hearing a lot about Hadoop. What is it and how does SAS software integrate with it?
Schabenberger: Hadoop is an open-source distributed file system with add-on tools and applications. Like the massively parallel databases, it provides data replication and hardware failover, which is key in distributed commodity hardware systems, because, let's face it, things are going to break. Disks are going to go bad, DIMM chips will need to be replaced.
Because Hadoop can potentially scale to very large systems, because it is free, and because it is trendy, there is much interest in this data platform. As someone once told me, “Hadoop is free like a puppy is free.”
Many of our larger customers are starting to work with Hadoop, and many of them are struggling to make Hadoop work in a way that fits into an enterprise-class IT infrastructure and into an existing enterprise-class analytics environment.
One industry approach has been to stack solutions, such as MapReduce, HBase, or Mahout, on top of Hadoop. I look at it differently. We can make HDFS, the Hadoop Distributed File System, work incredibly well for us in high-performance analytics by leveraging our expertise together with what HDFS does well. We are achieving incredible performance this way. The customer interacts with HDFS indirectly, through SAS and through the SAS LASR Analytic Server.
Bolen: Engineering a high-performance application is a daunting task. How did you go about it?
You have to think about every element of the solution and every link in the chain, whether it is the operating system you are running on, the type of hardware, the speed of the network connection, the communication protocol between the nodes, and so on. And then you have to re-engineer the analytic code and the interfaces.
I believe that we have all major pieces in place now and the Hadoop work we did in support of SAS Visual Analytics and the SAS LASR Analytic Server was really a highlight for me. It proved how the pieces we had already built, and how the expertise we had acquired through the ongoing high-performance work could be leveraged to deliver a powerful and clever solution quickly that performs exceedingly well.
What sets our approach apart is that we seek high-performing, enterprise-class solutions in useful environments. High-performance analytics is not about setting the world record in logistic regressions.
You can write clever parallel algorithms for a specific problem and cobble together a hardware environment that outperforms the fastest application out there by 25% or more. But that is not the point. First, we are aiming for transformational performance gains. The goal is to take an analytic process that takes 20 hours down to, say, a minute. Because that allows you to apply the analytics in a new, more interesting, way; to develop better models, to run "what-if" scenarios, and so on. Whether you do it in a minute or 48 seconds then does not matter much. Unless, you can achieve another transformational step and perform the same analysis in, say, half a second—but we are not there yet.
Meanwhile, until we can solve all problems in actual real time, we need to solve them in the right time, and with hardware/software environments that fit organically into the IT ecosystem of our customers.
Whenever I attack a problem for which we have an enterprise-class solution, such as high-performance marketing optimization (HP-MO), I strive that to keep the user interface, the front-end, as stable as possible. When a customer migrates from Marketing Optimization to HP-MO, the transition is simple. We change the performance characteristics, but we do not change the way you interact with the software. This is also true for our high-performance enablement of SAS Enterprise Miner, our high-performance analytics procedures, and other high-performance analytics solutions we are rolling out.
On the other hand, when we do something new, like our exciting SAS Visual Analytics product, I love it when we break the mold and surprise our customers with something different and powerful.
You can learn more from other "big data" experts in this special 32-page report on high-performance analytics.