Using SAS PROC SQOOP

9

SAS 9.4 M3, introduces a new procedure named PROC SQOOP. This procedure enables users to access an Apache Sqoop utility from a SAS session to transfer data between a database and HDFS. Using SAS PROC SQOOP lets you submit Sqoop commands from within your SAS application to your Hadoop cluster.
PROC SQOOP is licensed with SAS/ACCESS® Interface for Hadoop, it’s not part of the Base SAS® license. PROC SQOOP is supported in UNIX and Windows SAS.

Sqoop commands are passed to the cluster using the Apache Oozie Workflow Scheduler for Hadoop. PROC SQOOP defines an Oozie workflow for your Sqoop task, which is then submitted to an Oozie server using a RESTFUL API.

PROC SQOOP works similarly to the Apache Sqoop command-line interface (CLI), using the same syntax. The procedure provides feedback as to whether the job completed successfully and where to get more details in your Hadoop cluster.

Database Connector

Sqoop can be used with any Java Database Connectivity (JDBC) compliant database and automatically supports several databases. In some cases, the database vendor’s JDBC driver (JAR file) might need to be installed in the “$sqoop_home/lib” path on the Sqoop client machine.

For example, if you are connecting to MySQL database, you place the mySQL connector JAR file under the /../lib/oozie/lib folder.

    [root@sascdh01 ~]# ls -l /usr/lib/oozie/lib/my*
    -rw-r–r– 1 root root 972009 Oct 15 02:48 /usr/lib/oozie/lib/mysql-connector-java-5.1.36-bin.jar

SQOOP setup

  • Download Sqoop-1 (1.4.x) from the Apache Sqoop website. Sqoop 1.4.5 is recommended. Be sure to get a Sqoop JAR file that is compatible with your Hadoop.
  • For each database that you plan to use Sqoop with, you must download a compatible JDBC driver or Sqoop connector from the associated database vendor. The connectors (JAR files) should be places in Sqoop lib and in the Oozie share lib in HDFS.
  • PROC SQOOP uses your Hadoop cluster configuration files. Set the environment variable SAS_HADOOP_CONFIG_PATH, which points to the location of your Hadoop configuration directory.
  • SQOOP JAR files are not required for the SAS client machine. The PROC SQOOP uses Apache Oozie, which provides REST API communication to the Hadoop cluster without local JAR files. Set the environment variable SAS_HADOOP_RESTFUL=1; to connect to the Hadoop server by using the WebHDFS or the HttpFS REST API.

To use PROC SQOOP, you must have the following information:

  • Database connection information; each vendor has its own connection options
  • Database user ID and password
  • HDFS file that contains database password ( for some database cases)
  • Hadoop user ID and password
  • Oozie URL ( host and port #)
  • NameNode service URL information (host and port #)
  • JobTracker/Resource manager service URL information (host and port #)
  • Oozie Workflow output Path
  • Sqoop command

Example
The following code example illustrates the data transfer from a MySQL database table to HDFS. The data transfer operation executed on the Hadoop cluster using Oozie workflow and MapReduce steps. The data is streamed directly from database server to the Hadoop cluster without routing through SAS workspace server. Under the PROC SQOOP statement, you provide environment properties where the data is located and the target location. Under the command section, you provide the native Sqoop statement for specific required actions.

OPTIONS SET=SAS_HADOOP_CONFIG_PATH="/opt/sas/thirdparty/Hadoop_Conf/CDH524";
OPTIONS SET=SAS_HADOOP_RESTFUL=1 ;

proc sqoop
 hadoopuser='sasdemo'
 dbuser='hdp' dbpwd='xxxxxx'
 oozieurl='http://xxxxxxx.xxxx.sas.com:11000/oozie'
 namenode='hdfs://xxxxxxx.xxxx.sas.com:8020'
 jobtracker='xxxxxxx.xxxx.sas.com:8032'
 wfhdfspath='hdfs://xxxxxxx.xxxx.sas.com:8020/user/sasdemo/myworkflow.xml'
 deletewf
 command=' import --connect jdbc:mysql://XXXXX.XXXX.sas.com/hdpdata --append -m 1 --table department --target-dir /user/sasdemo/department ';
 run;

Log extract from the above code execution
………….
…………………..
NOTE: SAS initialization used:
real time 0.03 seconds
cpu time 0.02 seconds

1 OPTIONS SET=SAS_HADOOP_CONFIG_PATH="/opt/sas/thirdparty/Hadoop_Conf/CDH524";
2 OPTIONS SET=SAS_HADOOP_RESTFUL=1 ;
3
4
5
6 proc sqoop
7 hadoopuser='sasdemo'
8 dbuser='hdp' dbpwd=XXXXXXXXX
9 oozieurl='http://xxxxx.xxxx.sas.com:11000/oozie'
10 namenode='hdfs://xxxxx.xxxx.sas.com:8020'
11 jobtracker=xxxxx.xxxx.sas.com:8032'
12 wfhdfspath='hdfs://xxxxx.xxxx.sas.com:8020/user/sasdemo/myworkflow.xml'
13 deletewf
14 command=' import --connect jdbc:mysql://xxxxx.xxxx.sas.com/hdpdata --append -m 1 --table department
--target-dir
14 ! /user/sasdemo/department ';
15 run;

NOTE: Job ID : 0000004-151015031507797-oozie-oozi-W
NOTE: Status : SUCCEEDED
NOTE: PROCEDURE SQOOP used (Total process time):
real time 55.89 seconds
cpu time 0.05 seconds
……….
………………..

On an Oozie web console, you can see the Oozie jobs that have been submitted and their status shows either running, killed, or succeeded.proc_sqoop_1
On a Yarn Resource Manager User interface, you can see the MapReduce tasks that have been submitted and executing.
proc_sqoop_2

On HDFS, you can see the data files that have been created per the –targetdir in the Sqoop command.

 

[root@sascdh01 ~]# hadoop fs -ls /user/sasdemo/department
Found 3 items
-rw-r--r-- 3 sasdemo supergroup 22 2015-10-15 10:56 /user/sasdemo/department/part-m-00000
-rw-r--r-- 3 sasdemo supergroup 22 2015-10-15 11:54 /user/sasdemo/department/part-m-00001
-rw-r--r-- 3 sasdemo supergroup 22 2015-10-15 12:03 /user/sasdemo/department/part-m-00002

Dependencies on SAS_HADOOP_RESTFUL environment variable
The SAS_HADOOP_RESTFUL environment variable determines whether to connect to the Hadoop server through JAR files, HttpFs, or WebHDFS. The default setting for this variable is 0, which connects to the Hadoop cluster using JAR files. The PROC SQOOP uses Apache Oozie, which provides REST API communication to a Hadoop cluster without JAR files. So when running PROC SQOOP, we need this parameter set to one.
If you don’t set the environment variable SAS_HADOOP_RESTFUL=1 in your SAS session, you could see a strange error message while executing the PROC SQOOP statement. The following SAS log reports issues with ‘hadoopuser’ parameters, and the process assumes that the Hadoop cluster is enabled with Kerberos security. However, security is disabled on the Hadoop cluster. This is therefore a misleading error message toward Kerberos.

..........
.....
NOTE: SAS initialization used:
real time 0.02 seconds
cpu time 0.02 seconds

1 OPTIONS SET=SAS_HADOOP_CONFIG_PATH="/opt/sas/thirdparty/Hadoop_Conf/CDH524";
2 /* OPTIONS SET=SAS_HADOOP_RESTFUL=1; */
3
4
5 proc sqoop
6 hadoopuser='sasdemo'
7 dbuser='hdp' dbpwd=XXXXXXXXX
8 oozieurl='http://xxxxx.xxxx.sas.com:11000/oozie'
9 namenode='hdfs://xxxxx.xxxx.sas.com:8020'
10 jobtracker=xxxxx.xxxx.sas.com:8032'
11 wfhdfspath='hdfs://xxxxx.xxxx.sas.com:8020/user/sasdemo/myworkflow.xml'
12 deletewf
13 command=' import --connect jdbc:mysql://xxxxx.xxxx.sas.com/hdpdata --append -m 1 --table department
--target-dir
13 ! /user/sasdemo/department ';
14 run;

ERROR: HADOOPUSER should not be provided for Kerberos enabled clusters.
ERROR: The path was not found: /user/sasdemo/SAS_SQOOPaq7ddo96.
NOTE: PROCEDURE SQOOP used (Total process time):
real time 3.41 seconds
cpu time 0.09 seconds
...........
......

Reference document

Base SAS® Procedure Guide, Fourth Edition

Share

About Author

Uttam Kumar

Sr Technical Architect

Uttam Kumar is a Senior Technical Architect in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division. He has focused on Hadoop, SAS integration with Hadoop, Hadoop Data access from SAS, configuration and troubleshooting. Uttam received his Master of Science degree in Computer Science from Boston University, USA.

9 Comments

  1. Jesper Bæk Overgaard on

    Hi Uttam,

    We would Like to transfer/offload SAS datasets from an old existing UNIX SAS server on our environment to Hadoop - Would This Proc SQOOP be the swiss army knife on that exercise since we have had great issues on embedding This old UNIX server to our existing metadata server setup and hence must rely on an offload Strategy.

    • Hi Jesper,
      Thansk for reading my blog and asking related question !
      The SAS PROC SQOOP is meant for moving data between external RDBMS and Hadoop cluster. In this case, you can use standard "SAS/ACCESS interface to Hadoop" connection to move SAS datasets from Unix environment to Hadoop cluster as Hive tables. You may also choose to move these SAS tables as SPDE table on HDFS using SPDE engine which comes with BASE SAS.
      -Uttam

  2. Thanks, Uttam for the prompt reply...

    Although upon following the procedure suggested by you and by looking at the SAS site about proc sqoop, we are not able to submit the oozie job, which ideally if submitted successfully should be visible on oozie web console. The URL for oozie web console is verified and is correct.
    Could you please provide an insight if you have encountered such problem.
    Should we do any modification on oozie-site.xml or any hadoop config xml files. Your quick response is appreciated and helpful for us.

    • Hi Saurabh,

      When you submit the SAS code, do you see a Ozzie wokrflow id in sas log, something like this :
      NOTE: Job ID : 0000004-151015031507797-oozie-oozi-W

      Can you post or send the SAS log from code execution. Can you also list the xml files from SAS_HADOOP_CONFIG_PATH folder. Can you please describe a bit about your environment ? The SAS compute server , Hadoop cluster and how it's hosted ? any firewall between them ?

      I have not experience such issue in my test environment.

      -Uttam

  3. Hi Uttam,
    The above information is very useful. But there are couple of questions as i am not able to execute it -
    1. As the jobtracker is not present in MapReduce version2 , so should we consider resource manager as an alternative for it?
    2. i am not able to find this oozie workflow.xml file, can you please guide me how to tracck it or create and validate it?

    • Hi Saurabh,
      Thank you for following my blog and asking relevant questions.

      1. Yes, when you have YARN infrastructure, you use resource manager for job tracker parameter. In the example you can notice I have used "jobtracker=xxxxx.xxxx.sas.com:8032' ". In my test environment, the resource manager is availabe on port 8032.

      2. The oozie wokflow.xml file will get created upon execution of PROC SQOOP statement. The oozie workflow.xml file should be located at HDFS location specified with 'wfhdfspath' variable. In the example, I have wfhdfspath='hdfs://xxxxx.xxxx.sas.com:8020/user/sasdemo/myworkflow.xml' , so the file should be located at /user/sasdemo HDFS folder. If you are not able to see the file at specified location, verify the user permission for the folder. The user you are using to connect hadoop cluster "hadoopuser='sasdemo' " should have read and write permisison on folder specified under "wfhdfspath=" variable.

  4. Hi Uttam,

    Very interesting your post.

    I would like to know a bit about SAS and Hadoop integration.

    I know that you can connect SAS to Haddop via libname using SAS/ACCESS to Hadoop. As far as I know in that case you interact with Hadoop using HiveQL. You can insert SAS tables in HDFS using hiveQL and you can read Hive tables from SAS.

    Other way is using data loader product. I think it also connect with HiveQL

    Also you can use proc hadoop to execute hdfs commands or proc sqoop as you well explained in your post

    I would like to know how you can do analytics procedures in Hadoop. For example High performance analytics procedures. I have read that this HP procedures can work with Hadoop, but how does it worsks??, is also a connector using HiveQL?, can you execute the procedures in the haddop clusters using map-reduce??, I wnat to know more about thin interaction....if you only gest data from Hadoop or if you can take advantage of hadoop clusters performance executing in a parallel architecture.

    Other question... SAS Visual Analytics and Hadoop,...I supopose that you can get information from Hadoop (vía HiveQL) and upload to LASR server..is it right??, all the calculations and aggregations are made in LASR not in Hadoop cluster..is it right??

    Thank in advance

    • Hi Juan,

      Thank you reading my blog and asking relevant questions.

      SAS provide in-database component for hadoop called "SAS Embedded Process for hadoop", which enables you to process data on the hadoop cluster. SAS EP for hadoop enables you to execute SAS data steps, code accelerator and procedure at hadoop cluster in the form of map reduce task/job. While running data processing on hadoop cluster, it uses resources available at hadoop cluster in conjugation with YARN.

      In case of SAS Visual Analytics integration with external hadoop cluster, again SAS EP helps you to load data to LASR using multi-thread parallel process. Once data is available to LASR, all calculations and aggregations occurs at LASR server.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top