How SAS gets to data in Hadoop

By Rob Collum on SAS Users May 29, 2015 Topics | Programming Tips SAS Administrators

SAS offers a rich collection of features for working with Hadoop quickly and efficiently. This post will provide a brief run-through of the various technologies used by SAS to get to data in Hadoop and what’s needed to get the job done.

Working with text files

Base SAS software has the built-in ability to communicate with Hadoop. For example, Base SAS can work directly with plain text files in HDFS using either PROC HADOOP or the FILENAME statement. For this to happen, you need:

Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
Hadoop configuration in a “merged” XML file*
Base SAS PROC HADOOP or FILENAME statement

* TAKE NOTE! The “merged” XML file is manually created by you! The idea is that you must take the relevant portions of the various Hadoop “-site.xml” files (such as hdfs-site.xml, hive-site.xml, yarn-site.xml, etc.) and concatenate the contents into one syntactically correct XML file.

Working with SPDE data

The SAS Scalable Performance Data Engine (SPDE) functionality is also built into Base SAS. SAS can write SPDE tables directly to HDFS to take advantage of its multiple IO paths for reading and writing data to disk. You need:

Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH)
Base SAS LIBNAME specifying SPDE engine

Working with data in SASHDAT

SASHDAT is a SAS proprietary data format optimized for high-performance environments. The software pieces required are:

Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH)
SAS High-Performance Analytics Environment (distributed mode)
- Distributed SAS LASR Analytic Server
- Co-located installation of supported Hadoop
Base SAS LIBNAME specifying SASHDAT engine

Working with data in Hive

Hadoop supports Hive as a database warehouse infrastructure. SAS/ACCESS technology can get to that data. You need:

Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH)
Base SAS and SAS/ACCESS Interface to Hadoop
- LIBNAME specifying Hadoop engine
- LIBNAME specifying Hive2 subprotocol

Working with SAS In-Database Technology

The SAS In-Database technology achieves the goal of bringing the statistics to the data as a more efficient approach for working with very large volumes. In particular, the SAS Embedded Process is deployed into the Hadoop cluster to work directly where the data resides, performing the requested analysis and returning the results.

SAS In-Database technology for Hadoop is constantly evolving and adding new features. With SAS 9.4, the SAS Embedded Process provides SQL passthrough, code and scoring acceleration capabilities, as well as support for SAS High-Performance procedures. To get started, you need:

Hadoop client JAR files (location specified in SAS_HADOOP_JAR_PATH)
Hadoop configuration files (location specified in SAS_HADOOP_CONFIG_PATH)
In-Database Deployment Package for Hadoop (SAS Embedded Process)
Base SAS
SAS/ACCESS Interface to Hadoop

But that’s not all. The SAS Embedded Process is incredibly sophisticated and can offer something else: asymmetric parallel data load to the SAS High-Performance Analytics Environment. To enable that, you also need:

Remote deployment of distributed LASR (or other HPAE)
Data stored in Hadoop using:
- Hive
- Impala
- SPDE

Note that SPDE is on that last list. Without the Embedded Process, SAS can stream data from SPDE tables using the serial approach to the LASR Root Node. When the Embedded Process is available, then it can coordinate the direct transfer of the data from each of the Hadoop data nodes to their counterpart LASR (or other HPAE) worker nodes – that is, it enables concurrent, parallel data streams between the two services!

For more information on which technique SAS is employing to move data in a given situation, refer to “Determining the Data Access Mode” in the Base SAS 9.4 Procedures Guide: High-Performance Procedures.

About Author

Rob Collum
Advisory Technical Architect

Rob Collum is an Advisory Technical Architect in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division. Rob identifies and develops proven practices for the successful architecture and deployment of high-performance SAS solutions at customer sites.

16 Comments

Kay on February 16, 2017 3:16 pm

Hi Rob!

thank you so much for this detailed info on how to set up connection of SPD engine to Hadoop server without SAS/Access. Sorry if this question does not make sense; I do not have IT background. I wanted to confirm if we have to copy Hadoop JAR files to a directory where SAS is installed. If that is true, I'm not sure how SAS will connect to Hadoop server.
- Rob Collum on February 17, 2017 10:36 am
  
  Kay, to be clear, the only JAR files from your Hadoop cluster which need to be copied are those which provide the client interfaces for working with Hadoop. SAS acts as a Hadoop client - and so it connects to your Hadoop cluster through the client JARs you've copied and by referencing the Hadoop configuration files as well.
  - Kay on February 22, 2017 1:16 pm
    
    Thanks Rob! Is it possible if I can connect to Hive using SAS ODBC driver or Hortonworks ODBC driver? My SAS lic has SAS ODBC driver, but I am not sure which ODBC driver to use to connect to Hive.
    - Rob Collum on February 22, 2017 2:43 pm
      
      Kay,
      
      That's a good question - and something you might need to follow up on with your SAS Account Representative. The SAS ODBC Driver provides ODBC-style access to SAS data sources (like SAS/SHARE and SAS Scalable Performance Data Servers). It won't get you to Hive.
      
      The reason I suggest talking to your account rep is so you can get the details of the SAS software licensed at your site. In order for SAS to get into Hive, you'll need to license either SAS/ACCESS Interface to Hadoop software -or- SAS/ACCESS Interface to ODBC. If you choose the SAS/ACCESS to ODBC product, then you'll be able to get data from any ODBC-compliant data provider... by working with that provider's specific ODBC driver. If your Hadoop distro is from Hortonworks, then that's where your Hortonworks ODBC driver would come into play.
      
      More info:
      - SAS/ACCESS Interface to ODBC: ODBC is a generic interface to access data from multiple providers.
      - SAS/ACCESS Interface to Hadoop: More efficient and with more Hadoop-specific features than SAS/ACCESS Interface to ODBC software.
      - SAS ODBC Driver: allows other processes to access SAS data via ODBC.
      
      HTH,
      Rob
      - Kay on February 22, 2017 3:34 pm
        
        Thank you Rob! This was very helpful!
        
        Best wishes,
        Kay
Salman Maher on September 11, 2015 7:56 pm

Hi Rob, I just wanted to let you know that with the third maintenance release of 9.4 (9.4m3), FILENAME and PROC HADOOP are utilize the SAS_HADOOP_CONFIG_PATH option removing the need to supply a “merged” XML file.
Benjamim Farah on September 11, 2015 2:31 pm

Hi Rob! Thanks for the post! It helped me a lot!!! Best regards, Benjamin.
Razi on August 9, 2015 8:44 pm

Rob thank you for great blogs in this matter. My question is a bit simple. SAS ACCESS/HADOOP configuration I do not find much on install and configure part for SAS 9.3 except in what is in the Foundation configuration guides. Nothing like 9.4 ACCESS guide. The FILENAME and PROC Hadoop topics are disbursed but then mostly lists usage techniques. In short, the configuration steps listed in the Foundation configuration suffice?
- Rob Collum on August 10, 2015 9:42 am
  
  Razi, that's a great question. My post was focused on SAS 9.4, but plenty of sites still rely on SAS 9.3 in their production environments.
  
  When it comes to what you're looking for, it is roughly split up into 2 areas:
  1. Base SAS + SAS/ACCESS Interface to Hadoop
  2. SAS In-Database Embedded Process for Hadoop
  
  It sounds like you've already found the documentation you need for #1 - the SAS/ACCESS® 9.3 for Relational Databases: Reference, Second Edition.
  
  For #2, there is are a couple of documents you'll want to review:
  - SAS® 9.3 In-Database Products: Administrator's Guide, Fourth Edition
  - SAS® High-Performance Analytics Infrastructure: Installation and Configuration Guide
  
  Make note that "In-Database Deployment Package for Hadoop" and "SAS Embedded Process" are often used interchangeably in reference to the product.
  
  Also take care in your planning - for SAS 9.3, the SAS Embedded Process for Hadoop is only supported on Cloudera's CDH4 distribution of Hadoop.
  
  And then there were fewer features offered in the SAS 9.3 era - the SAS Embedded Process for Hadoop was primarily intended to provide high-performance parallel data streaming from the Hadoop provider over to the SAS In-Memory solution components (such as SAS High-Performance Analytics Server). It doesn't offer many of the interesting and valuable features we now enjoy with SAS 9.4.
  
  Hope this helps get you on track,
  Rob
Linus Hjorth on June 4, 2015 6:27 am

Noticing that you don' mention SPD Server. It seems to be a very well kept secret that from 5.2 you have the Hadoop options as well. I'm a bit puzzled that SPD Server still gets resources for development, but gets zero attention in SAS communication...
- Rob Collum on June 4, 2015 2:11 pm
  
  Linus,
  
  I'm glad you've brought up SAS Scalable Performance Data Server 5.2 and its ability to store SPDS data files directly in HDFS. When I first wrote this post a while back, SPDS 5.2 had not been released yet. And since the SPDE component of Base SAS already did support storing its data files in HDFS, I didn't want to confuse matters by mentioning SPDS, too. As you know, SPDE and SPDS are two similar storage technologies but with different licensing and technical requirements.
  
  More information about What's New in SPDS 5.2, including the new Hadoop functionality, can be found in the SAS® Scalable Performance Data Server 5.2: Administrator's Guide.
jaap karman on June 2, 2015 2:18 pm

Rob, Would you state that the appliances with Teradata are ones SAS do not want anymore?
From 7.1 VAUG chptr 15:
"Analytic Server software is installed on the same hardware as the data provider. The currently supported data providers are the following:
-> SAS High-Performance Deployment of Hadoop (or a customer-supplied Hadoop cluster that has been configured to use the SAS services from SAS High-Performance Deployment of Hadoop)
-> Teradata Data Warehouse Appliance
-> Greenplum Data Computing Appliance"
- Rob Collum on June 2, 2015 3:04 pm
  
  Jaap,
  
  SAS definitely still works with Teradata appliances - and very well.
  
  The technical nature of that partnership has changed as we've come to better understand how our customers want to use their Teradata warehouse.
  
  We no longer support "co-located" installation of the SAS High-Performance Analytics Environment (providing the LASR Analytic Server and support for the HPA PROCs) on a TD700. Instead, we work with a dedicated TD720 which hosts the SAS LASR (or other HPAE) software "next to" any existing TD appliances. This means that we will need to deploy the SAS In-Database Embedded Process to the TD appliance hosting the TD warehouse. And then the EP can be used for parallel data loads over to LASR (or other HPAE) on the TD720. The EP is also useful for in-database procedures as well as scoring and code acceleration (all under the idea of "taking the analytics to the data").
  
  So definitely feel confident in pursuing a Teradata solution with SAS High-Performance Analytics. It's a very strong partnership and compelling technology stack.
ROB COLLUM on June 2, 2015 12:00 pm

Ronan,

Thanks for the feedback! Apologies for the internal-reference hyperlinks, we'll get those fixed asap.

The details of why Distributed SAS LASR Analytic Server is required to work with SASHDAT files is disguised in part behind the proprietary nature of the technology. However, page 3 of the HPAICG document does explain:

Some solutions, such as SAS Visual Analytics, rely on a SAS data store that is co-located with the SAS High-Performance Analytic environment on the analytics cluster. One option for this co-located data store is the SAS High-Performance Deployment for Hadoop. This is an Apache Hadoop distribution that is easily configured for use with the SAS High-Performance Analytics environment. It adds services to Apache Hadoop to write SASHDAT file blocks evenly across the HDFS filesystem. This even distribution provides a balanced workload across the machines in the cluster and enables SAS analytic processes to read SASHDAT tables at very impressive rates.
Alternatively, these SAS high-performance analytic solutions can use a pre-existing, supported Hadoop deployment or a Greenplum Data Computing Appliance.

I agree it's not super direct and clear on the matter. However, the takeaway here is to know that for the current release, SASHDAT files are only possible on HDFS hosts where LASR is deployed.
- ronan on June 2, 2015 12:17 pm
  
  I understand. Thanks a lot for taking the time to explain.
ronan on June 2, 2015 9:30 am

Hi Rob,

This is extremely useful. Thanks a lot.
In the paragraph about "Working with data in SASHDAT", the HLINK (HPAE, Libname) incorrectly point to a basename URL (http://supportprod.unx.sas.com/ ...) I couln't resove but the correction is obvious.
=> Why is "Distributed SAS LASR Analytic Server" required in order to work with SASHDAT table with HPAE ? <=
I coulnd't find any mention for specific requirement in the public documentation.
Eg https://support.sas.com/documentation/solutions/hpainfrastructure/29/hpaicg294.pdf
or https://support.sas.com/documentation/cdl/en/prochp/67530/HTML/default/viewer.htm#prochp_introcom_sect024.htm

BR
Ronan

Blogs

Blogs

How SAS gets to data in Hadoop

About Author

16 Comments