Thanks for returning to learn more about this critical technology. Following yesterday’s overview post on the new SAS Hadoop support, we’ll dig a little deeper today and consider the following:
- Under the Hood: A Peek at the Technology
- SAS Hadoop Value Summary
- A Note About the Future
Under the Hood: A Peek at the Technology
Bring the power of SAS® Analytics to Hadoop
The SAS/ACCESS Interface to Hadoop offers seamless and transparent data access to Hadoop via HiveDB. SAS users access Hive tables as if they were native SAS data sets. Analytic or data processes can be performed using SAS tools while optimizing run-time execution using the appropriate Hadoop or SAS environment.
The SAS/ACCESS Interface to Hadoop enables Hadoop users to tap into the power of SAS by extending support for the complete analytics life cycle to Hadoop, including discovery, data preparation, modeling and deployment. Of particular importance to many organizations is the ability to:
- Visually analyze or explore data in Hadoop as the precursor to more in-depth analytics via SAS Visual Analytics Explorer capabilities.
- Leverage text mining and analytics capability based on data stored in Hadoop.
- Use SAS Metadata Server to create and manage metadata relating to data that is stored in Hadoop.
- LIBNAME statement makes Hive tables look like SAS data sets.
- PROC SQL provides the ability to execute explicit HiveQL commands in Hadoop.
- SAS procedures (including PROC FREQ, PROC RANK, PROC REPORT, PROC SORT, PROC SUMMARY, PROC MEANS and PROC TABULATE) are supported.
Leverage Hadoop’s Distributed Processing Capability
SAS Hadoop support allows execution of Hadoop functionality, enabling MapReduce programming, scripting support and the execution of HDFS commands from within the SAS environment. This complements SAS/ACCESS capabilities provided for Hive by extending support for Pig, MapReduce and HDFS commands.
More Technical Details
- PROC HADOOP support allows you to submit MapReduce, scripting and HDFS commands from the SAS execution environment. This includes support for Pig, MapReduce and HDFS commands.
- External file references are supported, which provides the ability for Hadoop files to be referenced from any SAS component. Parameters necessary to process the file, such as delimiters, are externalized, which makes it convenient to work with a Hadoop file.
Augment Hadoop using SAS® Information Management
One of the issues plaguing Hadoop is the lack or relative immaturity of tools that can be used to develop and manage Hadoop deployments. SAS data management and analytics management offerings can help organizations quickly derive value from Hadoop using fewer resources. Some examples of this include an intuitive graphical user interface to develop Hadoop capability, the ability to create data management and analytic code and deploy it within Hadoop, or the ability to register and manage Hadoop files via the SAS Management Console. This makes it easy to work with Hadoop within SAS, and extends SAS metadata, data lineage, impact analysis and security capability to Hadoop environments.
Still More Technical Details
SAS® Data Integration Studio
- SAS Data Integration Studio includes a set of standard transforms and a job flow builder that can be used with Hadoop data. The transforms support common functionality, such as the ability to load, unload, extract, reformat, read/write multiple files, reference external files, etc.
- SAS Data Integration Studio provides the ability to integrate Hadoop code, including Pig, MapReduce and HDFS commands in-line with a data job flow.
- SAS Data Integration Studio provides an editor for Pig and Hive, which provides visual editing capability, including a syntax checker, for developing Pig and Hive.
- SAS Data Integration Studio provides the ability to submit HiveQL via PROC SQL capability that can also be surfaced through Base SAS and other SAS components.
- Since Hadoop is treated as a SAS data source, data quality capabilities that are provided by SAS and DataFlux can be leveraged to process data that is coming in or out of Hadoop.
Hadoop Function Support
- SAS provides the ability to create UDFs that can be deployed within HDFS. This includes the ability to use SAS Enterprise Miner to take analytical scoring code and produce a UDF that can be deployed within HDFS. These UDFs can then be accessed by Hive, Pig or MapReduce code.
Metadata, Lineage & Security
- Using Hadoop within SAS provides the benefit of data lineage (including impact analysis) and additional security. All SAS processing that is done with Hadoop is tracked, and the existing data lineage functionality can be used to better manage Hadoop usage.
- Ability to register Hive Server using SAS Management Console so that any SAS capability can easily reference Hadoop (via a FILENAME statement, leverage parameters to better interact with Hadoop, identify delimiters so files can be parsed on the fly, etc.). This makes it possible for the entire SAS stack (BI, DI, SAS/STAT, etc.) to work with Hadoop data. It provides the ability to track what tables are in Hadoop, and provides the basis for lineage.
- SAS honors the underlying security provided by Hadoop. For instance, SAS will not bypass Hadoop security and allow a user to read data without the proper Hadoop permissions. In addition to the underlying security provided by Hadoop, SAS will allow you to further restrict access to Hadoop based on the standard SAS security capabilities.
- The SAS Metadata Server, a component of Base SAS software, provides the ability to generate metadata based on data that is stored in Hadoop. SAS provides flexible parsing support that is not restricted to a preset data definition, allowing support for any custom definition. Once defined, the metadata can be used to optimize interaction with the data stored in Hadoop.
- Support for popular Hadoop distributions such as Cloudera, Hortonworks, EMC Greenplum, etc.
SAS® Hadoop Value Summary
The SAS approach marries the power of world-class analytics with Hadoop’s ability to leverage commodity-based storage and Hadoop’s ability to perform distributed processing.
The SAS Hadoop integration provides the following value to organizations looking to get the most from their big data assets:
- SAS both simplifies and augments Hadoop. An ability to abstract the complexity of Hadoop by making it function as another data source brings the power of SAS and its well-established community to Hadoop implementations. This is critical, given the skills shortage and the complexity involved with Hadoop. In addition, boosting Hadoop with world-class analytics, along with metadata, security and lineage capabilities, helps ensure that Hadoop will be ready for enterprise expectations.
- SAS provides total Hadoop leverage. Because SAS support for Hadoop spans the entire information management life cycle, SAS management supports metadata, lineage, monitoring, federation and security augmentation. These areas are pervasive through the entire data-to-decision life cycle.
How do enterprises benefit from the distinctive SAS Analytics and SAS Data Integration offerings?
- SAS provides a robust, comprehensive, information management life cycle approach to Hadoop that includes data management and analytics management support. This is a huge advantage over other products that focus primarily on moving data in and out of Hadoop.
- SAS delivers optimal solutions for each organization’s specific mix of technologies. SAS Data Integration supports Hadoop alongside other data storage and processing technologies. This offers greater flexibility than other vendor-specific products that only use Hadoop as a vehicle for landing more information on certain database or hardware platforms.
A Note About the Future
The exciting news is that this is just the start – we'll be discussing additional topics, such as data governance and Hadoop, MDM and Hadoop, SAS embedded processing on Hadoop nodes and other topics of interest to the SAS community. Please check back to hear more about how to best build your information assets.
Let the SAS Hadoop hype continue!