SAS offers a rich collection of features for working with Hadoop quickly and efficiently. This post will provide a brief run-through of the various technologies used by SAS to get to data in Hadoop and what’s needed to get the job done.
Working with text files
Base SAS software has the built-in ability to communicate with Hadoop. For example, Base SAS can work directly with plain text files in HDFS using either PROC HADOOP or the FILENAME statement. For this to happen, you need:
- Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
- Hadoop configuration in a “merged” XML file*
- Base SAS PROC HADOOP or FILENAME statement
* TAKE NOTE! The “merged” XML file is manually created by you! The idea is that you must take the relevant portions of the various Hadoop “-site.xml” files (such as hdfs-site.xml, hive-site.xml, yarn-site.xml, etc.) and concatenate the contents into one syntactically correct XML file.
Working with SPDE data
The SAS Scalable Performance Data Engine (SPDE) functionality is also built into Base SAS. SAS can write SPDE tables directly to HDFS to take advantage of its multiple IO paths for reading and writing data to disk. You need:
- Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
- Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH)
- Base SAS LIBNAME specifying SPDE engine
Working with data in SASHDAT
SASHDAT is a SAS proprietary data format optimized for high-performance environments. The software pieces required are:
- Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
- Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH)
- SAS High-Performance Analytics Environment (distributed mode)
- Distributed SAS LASR Analytic Server
- Co-located installation of supported Hadoop
- Base SAS LIBNAME specifying SASHDAT engine
Working with data in Hive
Hadoop supports Hive as a database warehouse infrastructure. SAS/ACCESS technology can get to that data. You need:
- Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
- Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH)
- Base SAS and SAS/ACCESS Interface to Hadoop
- LIBNAME specifying Hadoop engine
- LIBNAME specifying Hive2 subprotocol
Working with SAS In-Database Technology
The SAS In-Database technology achieves the goal of bringing the statistics to the data as a more efficient approach for working with very large volumes. In particular, the SAS Embedded Process is deployed into the Hadoop cluster to work directly where the data resides, performing the requested analysis and returning the results.
SAS In-Database technology for Hadoop is constantly evolving and adding new features. With SAS 9.4, the SAS Embedded Process provides SQL passthrough, code and scoring acceleration capabilities, as well as support for SAS High-Performance procedures. To get started, you need:
- Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
- Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH)
- In-Database Deployment Package for Hadoop (SAS Embedded Process)
- Base SAS
- SAS/ACCESS Interface to Hadoop
But that’s not all. The SAS Embedded Process is incredibly sophisticated and can offer something else: asymmetric parallel data load to the SAS High-Performance Analytics Environment. To enable that, you also need:
- Remote deployment of distributed LASR (or other HPAE)
- Data stored in Hadoop using:
- Hive
- Impala
- SPDE
Note that SPDE is on that last list. Without the Embedded Process, SAS can stream data from SPDE tables using the serial approach to the LASR Root Node. When the Embedded Process is available, then it can coordinate the direct transfer of the data from each of the Hadoop data nodes to their counterpart LASR (or other HPAE) worker nodes – that is, it enables concurrent, parallel data streams between the two services!
For more information on which technique SAS is employing to move data in a given situation, refer to “Determining the Data Access Mode” in the Base SAS 9.4 Procedures Guide: High-Performance Procedures.
16 Comments
Hi Rob!
thank you so much for this detailed info on how to set up connection of SPD engine to Hadoop server without SAS/Access. Sorry if this question does not make sense; I do not have IT background. I wanted to confirm if we have to copy Hadoop JAR files to a directory where SAS is installed. If that is true, I'm not sure how SAS will connect to Hadoop server.
Kay, to be clear, the only JAR files from your Hadoop cluster which need to be copied are those which provide the client interfaces for working with Hadoop. SAS acts as a Hadoop client - and so it connects to your Hadoop cluster through the client JARs you've copied and by referencing the Hadoop configuration files as well.
Thanks Rob! Is it possible if I can connect to Hive using SAS ODBC driver or Hortonworks ODBC driver? My SAS lic has SAS ODBC driver, but I am not sure which ODBC driver to use to connect to Hive.
Kay,
That's a good question - and something you might need to follow up on with your SAS Account Representative. The SAS ODBC Driver provides ODBC-style access to SAS data sources (like SAS/SHARE and SAS Scalable Performance Data Servers). It won't get you to Hive.
The reason I suggest talking to your account rep is so you can get the details of the SAS software licensed at your site. In order for SAS to get into Hive, you'll need to license either SAS/ACCESS Interface to Hadoop software -or- SAS/ACCESS Interface to ODBC. If you choose the SAS/ACCESS to ODBC product, then you'll be able to get data from any ODBC-compliant data provider... by working with that provider's specific ODBC driver. If your Hadoop distro is from Hortonworks, then that's where your Hortonworks ODBC driver would come into play.
More info:
- SAS/ACCESS Interface to ODBC: ODBC is a generic interface to access data from multiple providers.
- SAS/ACCESS Interface to Hadoop: More efficient and with more Hadoop-specific features than SAS/ACCESS Interface to ODBC software.
- SAS ODBC Driver: allows other processes to access SAS data via ODBC.
HTH,
Rob
Thank you Rob! This was very helpful!
Best wishes,
Kay
Hi Rob, I just wanted to let you know that with the third maintenance release of 9.4 (9.4m3), FILENAME and PROC HADOOP are utilize the SAS_HADOOP_CONFIG_PATH option removing the need to supply a “merged” XML file.
Hi Rob! Thanks for the post! It helped me a lot!!! Best regards, Benjamin.
Rob thank you for great blogs in this matter. My question is a bit simple. SAS ACCESS/HADOOP configuration I do not find much on install and configure part for SAS 9.3 except in what is in the Foundation configuration guides. Nothing like 9.4 ACCESS guide. The FILENAME and PROC Hadoop topics are disbursed but then mostly lists usage techniques. In short, the configuration steps listed in the Foundation configuration suffice?
Razi, that's a great question. My post was focused on SAS 9.4, but plenty of sites still rely on SAS 9.3 in their production environments.
When it comes to what you're looking for, it is roughly split up into 2 areas:
1. Base SAS + SAS/ACCESS Interface to Hadoop
2. SAS In-Database Embedded Process for Hadoop
It sounds like you've already found the documentation you need for #1 - the SAS/ACCESS® 9.3 for Relational Databases: Reference, Second Edition.
For #2, there is are a couple of documents you'll want to review:
- SAS® 9.3 In-Database Products: Administrator's Guide, Fourth Edition
- SAS® High-Performance Analytics Infrastructure: Installation and Configuration Guide
Make note that "In-Database Deployment Package for Hadoop" and "SAS Embedded Process" are often used interchangeably in reference to the product.
Also take care in your planning - for SAS 9.3, the SAS Embedded Process for Hadoop is only supported on Cloudera's CDH4 distribution of Hadoop.
And then there were fewer features offered in the SAS 9.3 era - the SAS Embedded Process for Hadoop was primarily intended to provide high-performance parallel data streaming from the Hadoop provider over to the SAS In-Memory solution components (such as SAS High-Performance Analytics Server). It doesn't offer many of the interesting and valuable features we now enjoy with SAS 9.4.
Hope this helps get you on track,
Rob
Noticing that you don' mention SPD Server. It seems to be a very well kept secret that from 5.2 you have the Hadoop options as well. I'm a bit puzzled that SPD Server still gets resources for development, but gets zero attention in SAS communication...
Linus,
I'm glad you've brought up SAS Scalable Performance Data Server 5.2 and its ability to store SPDS data files directly in HDFS. When I first wrote this post a while back, SPDS 5.2 had not been released yet. And since the SPDE component of Base SAS already did support storing its data files in HDFS, I didn't want to confuse matters by mentioning SPDS, too. As you know, SPDE and SPDS are two similar storage technologies but with different licensing and technical requirements.
More information about What's New in SPDS 5.2, including the new Hadoop functionality, can be found in the SAS® Scalable Performance Data Server 5.2: Administrator's Guide.
Rob, Would you state that the appliances with Teradata are ones SAS do not want anymore?
From 7.1 VAUG chptr 15:
"Analytic Server software is installed on the same hardware as the data provider. The currently supported data providers are the following:
-> SAS High-Performance Deployment of Hadoop (or a customer-supplied Hadoop cluster that has been configured to use the SAS services from SAS High-Performance Deployment of Hadoop)
-> Teradata Data Warehouse Appliance
-> Greenplum Data Computing Appliance"
Jaap,
SAS definitely still works with Teradata appliances - and very well.
The technical nature of that partnership has changed as we've come to better understand how our customers want to use their Teradata warehouse.
We no longer support "co-located" installation of the SAS High-Performance Analytics Environment (providing the LASR Analytic Server and support for the HPA PROCs) on a TD700. Instead, we work with a dedicated TD720 which hosts the SAS LASR (or other HPAE) software "next to" any existing TD appliances. This means that we will need to deploy the SAS In-Database Embedded Process to the TD appliance hosting the TD warehouse. And then the EP can be used for parallel data loads over to LASR (or other HPAE) on the TD720. The EP is also useful for in-database procedures as well as scoring and code acceleration (all under the idea of "taking the analytics to the data").
So definitely feel confident in pursuing a Teradata solution with SAS High-Performance Analytics. It's a very strong partnership and compelling technology stack.
Ronan,
Thanks for the feedback! Apologies for the internal-reference hyperlinks, we'll get those fixed asap.
The details of why Distributed SAS LASR Analytic Server is required to work with SASHDAT files is disguised in part behind the proprietary nature of the technology. However, page 3 of the HPAICG document does explain:
I agree it's not super direct and clear on the matter. However, the takeaway here is to know that for the current release, SASHDAT files are only possible on HDFS hosts where LASR is deployed.
I understand. Thanks a lot for taking the time to explain.
Hi Rob,
This is extremely useful. Thanks a lot.
In the paragraph about "Working with data in SASHDAT", the HLINK (HPAE, Libname) incorrectly point to a basename URL (http://supportprod.unx.sas.com/ ...) I couln't resove but the correction is obvious.
=> Why is "Distributed SAS LASR Analytic Server" required in order to work with SASHDAT table with HPAE ? <=
I coulnd't find any mention for specific requirement in the public documentation.
Eg https://support.sas.com/documentation/solutions/hpainfrastructure/29/hpaicg294.pdf
or https://support.sas.com/documentation/cdl/en/prochp/67530/HTML/default/viewer.htm#prochp_introcom_sect024.htm
BR
Ronan