How does SAS integrate with Hadoop? That's a question we often hear from people new to SAS who want to learn more about using SAS with Hadoop. At Strata + Hadoop World earlier this month, we were able to not only talk to attendees to answer their questions -- we were able to show them how SAS takes a visual approach with products such as SAS Visual Analytics and SAS Data Loader for Hadoop.
To answer the Hadoop and SAS integration question, I like to think of it in the context of “How does SAS treat Hadoop?”
As a data source for traditional SAS systems
SAS is able to process data directly on each node in the Hadoop cluster, treating each like a data source for traditional SAS systems. SAS can then process that data and write the results directly back to the node.
As an analytics platform
By embedding our analytics inside Hadoop, SAS enables you to interactively explore billions of rows of data (structured or unstructured) in seconds.
At Strata + Hadoop World, Dan Zaratsian analyzed the #StrataHadoop twitter feed at one of our demo stations. Using event stream processing, he was able to see all of the hot topics. Those passing by the booth got to see the data being processed in real-time, giving us an inside look at conversations and happenings as they were happening!
As a data management platform
Since our data management technologies are embedded inside Hadoop, with SAS you can profile, transform and cleanse data on Hadoop (or anywhere else it may reside). Access to Hadoop data is quick and easy; and once you’re there, performing data quality on that data will ensure that it’s clean and accurate (no more “garbage in / garbage out”).
At our Strata + Hadoop World booth, Derek Hardison and Clark Bradley were showing off our latest data management product SAS Data Loader for Hadoop. It’s a great product because business users like me can cleanse data on an intuitive user interface, while the more technical data analysts are able to run SAS code on Hadoop for even better performance.
Gary Spakes, lead for SAS’ America’s Enterprise Architecture Practice, breaks it down into three categories in his blog post “Conjunction Junction, What’s Your Function? Or, 3 Ways to interact with Hadoop”:
“From” is accessing and extracting data from Hadoop for processing and writing any results back to Hadoop. I classify this as “business as usual.” In other words, you can use Hadoop as a data repository and computer systems perform their traditional operational activity.
By utilizing the SAS/ACCESS to Hadoop technology, or storing your SAS datasets in SPDE format on the Hadoop cluster (new feature in SAS’ latest release 9.4M2), SAS can operate “business as usual.” Data moves from storage to compute for processing.
“With” is accessing and processing Hadoop data while keeping the data and computations massively parallel. I classify this as moving data to compute, not through a “straw,” but from each Hadoop node simultaneously.
Some of the SAS In-Memory solutions can “lift” data in a massively parallel way into SAS-managed memory for computation. Visual Analytics, Visual Statistics, In-Memory Statistics for Hadoop, and the High Performance Analytic procedures are examples of SAS working “with” Hadoop.
“In” is processing data directly in the Hadoop cluster. In other words, leveraging the cycles on the Hadoop cluster to perform work.
SAS Code Accelerator for Hadoop, SAS Data Quality Accelerator for Hadoop are examples of SAS processing data directly on each node in the Hadoop cluster. By submitting “work” to a lightweight SAS engine on each node, SAS is able to process, manipulate, transpose, impute, classify (I could go on) data and then write the results directly back to the node.
For those who attended Strata + Hadoop World NYC – I hope you enjoyed your time in New York City and found value and insights out of your experiences at the conference.
For those who did not attend, considering heading to the west coast for the Strata + Hadoop World in San Jose March 29-31, 2016.
If you’d like to learn more about our SAS Solutions for Hadoop. We provide tools throughout the entire analytic lifecycle: simplified data management eases time-consuming data prep | visual data discovery helps you quickly spot what's relevant | in-memory analytics and machine learning techniques lead you to ask the right questions and get better answers.