Data asset management and analytic processing associated with big data were main topics of interest at the recent Strata conference in Santa Clara, California. Hosted in Silicon Valley, the conference attracted some of the brightest and most intelligent data scientists from America’s top research and academic institutions. Yet, to my ears, the real buzz was around the terms “machine learning” and “deep analytics,” meaning that the conference attendees seemed keenly interested in learning more about how to run predictive analytics on their large collections of data.
SAS debuted new statistical modelling capabilities for Hadoop at the conference. This offering, SAS In-Memory Statistics for Hadoop, is targeted specifically at the data science community. It offers a familiar environment for interactive programming, backed by the advanced statistical algorithms of SAS. Wayne Thompson, SAS’ Chief Data Scientist, gave a great presentation and demo, and other staff members had a busy time in the demo center discussing it and other SAS offerings for Hadoop.
From my own conversations, I talked with many people who have been trying different versions of open-source languages like R, Mahout and iPython to run advanced predictive models on big data. If you're working with these open source solutions, the SAS environment can be integrated as an ideal platform for driving the entire analytical lifecycle from data preparation, discover, modeling, and deployment. Our offerings are able to integrate with open source tools to extend existing capabilities. For example, SAS just released a new node in Enterprise Miner (our flagship data mining product) which is specifically designed to incorporate R models within a competitive model tournament scenario. For any model that wins the tournament, SAS generates score code (even for R models), making model deployment a lot faster and easier.
There was a lot of buzz around iPython and Scikit-learn capabilities to run machine learning algorithms against large data sets. A few iPython start-ups were showcasing their in-memory benchmarks against R batch-processing algorithms, and it appears that some of the new routines are 7 to 10 times faster than what traditional R programs can achieve. From an open-source technology perspective, I predict the industry is in store for another seismic shift in coding preferences toward iPython because of these efficiencies.
SAS has lightning-fast, in-memory technology for Hadoop, and continues to provide opportunities for side-by-side model comparisons for open source algorithms. But, SAS is also committed to being the leader in developing new machine learning algorithms that set the industry standards. For example, SAS is the only software vendor that has a fully implementable random forest algorithm that currently captures an unlimited number of splits for a forest of up to 10,000 trees - a deployment capability that is essential for real-time scoring. With our commitment to providing an open architecture that other external interfaces can easily access, I believe SAS is well-positioned for the new era of in-memory computing in which we find ourselves. Exciting times are ahead!
Photo credit: O'Reilly Conferences
1 Comment
Phil, thanks for sharing your experience from Strata. I'm glad to hear it went well. In addition to the new hadoop offering, SAS High-Performance Analytics In-memory computing has been available on several hadoop distributions and MPP appliances for more than 18 months.
To find out more visit http://www.sas.com/en_us/software/high-performance-analytics.html