SAS offers a rich collection of features for working with Hadoop quickly and efficiently. This post will provide a brief run-through of the various technologies used by SAS to get to data in Hadoop and what’s needed to get the job done.
Working with text files
Base SAS software has the built-in ability to communicate with Hadoop. For example, Base SAS can work directly with plain text files in HDFS using either PROC HADOOP or the FILENAME statement. For this to happen, you need:
* TAKE NOTE! The “merged” XML file is manually created by you! The idea is that you must take the relevant portions of the various Hadoop “-site.xml” files (such as hdfs-site.xml, hive-site.xml, yarn-site.xml, etc.) and concatenate the contents into one syntactically correct XML file.
Working with SPDE data
The SAS Scalable Performance Data Engine (SPDE) functionality is also built into Base SAS. SAS can write SPDE tables directly to HDFS to take advantage of its multiple IO paths for reading and writing data to disk. You need:
Working with data in SASHDAT
SASHDAT is a SAS proprietary data format optimized for high-performance environments. The software pieces required are:
Working with data in Hive
Hadoop supports Hive as a database warehouse infrastructure. SAS/ACCESS technology can get to that data. You need:
Working with SAS In-Database Technology
The SAS In-Database technology achieves the goal of bringing the statistics to the data as a more efficient approach for working with very large volumes. In particular, the SAS Embedded Process is deployed into the Hadoop cluster to work directly where the data resides, performing the requested analysis and returning the results.
SAS In-Database technology for Hadoop is constantly evolving and adding new features. With SAS 9.4, the SAS Embedded Process provides SQL passthrough, code and scoring acceleration capabilities, as well as support for SAS High-Performance procedures. To get started, you need:
But that’s not all. The SAS Embedded Process is incredibly sophisticated and can offer something else: asymmetric parallel data load to the SAS High-Performance Analytics Environment. To enable that, you also need:
Note that SPDE is on that last list. Without the Embedded Process, SAS can stream data from SPDE tables using the serial approach to the LASR Root Node. When the Embedded Process is available, then it can coordinate the direct transfer of the data from each of the Hadoop data nodes to their counterpart LASR (or other HPAE) worker nodes – that is, it enables concurrent, parallel data streams between the two services!
For more information on which technique SAS is employing to move data in a given situation, refer to “Determining the Data Access Mode” in the Base SAS 9.4 Procedures Guide: High-Performance Procedures.