SAS and secure Hadoop: 3 deployment requirements

In previous posts, we’ve shared the importance of understanding the fundamentals of Kerberos authentication and how we can simplify processes by placing SAS and Hadoop in the same realm. For SAS applications to interact with a secure Hadoop environment, we must address the third key practice:

Ensure Kerberos prerequisites are met when installing and configuring SAS applications that interact with Hadoop.

The prerequisites must be met during installation and deployment of SAS software, specifically SAS/ACCESS Interface to Hadoop for SAS 9.4.

1) Make the correct versions of the Hadoop JAR files available to SAS.

If you’ve installed other SAS/ACCESS products before, you’ll find installing SAS/ACCESS Interface for Hadoop is different. For other SAS/ACCESS products, you generally install the RDBMS client application and then make parts of this client available via the LD_LIBRARY_PATH environment variable.

With SAS/ACCESS to Hadoop, the client is essentially a collection of JAR files. When you access Hadoop through SAS/ACCESS to Hadoop, these JAR files are loaded into memory. The SAS Foundation interacts with Java through the jproxy process, which loads the Hadoop JAR files.

You will find the instructions for copying the required Hadoop JAR files and setting the SAS_HADOOP_JAR_PATH environment variable in the following product documentation:

2) Make the appropriate configuration files available to SAS.

The configuration for the Hadoop client is provided via XML files. The cluster configuration is updated when Kerberos is enabled for the Hadoop cluster and you must remember to update the cluster configuration files when this is enabled. The XML files contain properties specific to security, and the files required depend on the version of MapReduce being used in the Hadoop cluster. When Kerberos is enabled, it’s required that these XML configuration files contain all the appropriate options for SAS and Hadoop to properly connect.

If you are using MapReduce 1, you need the Hadoop core, Hadoop HDFS, and MapReduce configuration files.
If you are using MapReduce 2, you need the Hadoop core, Hadoop HDFS, MapReduce 2, and YARN configuration files.

The files are placed in a directory available to SAS Foundation and this location is set via the SAS_HADOOP_CONFIG_PATH environment variable. The SAS^® 9.4 Hadoop Configuration Guide for Base SAS^® and SAS/ACCESS^® describes how to make the cluster configuration files available to SAS Foundation.

3) Make the user’s Kerberos credentials available to SAS.

The SAS process will need to have access to the user’s Kerberos credentials for it to make a successful connection to the Hadoop cluster. There are two different ways this can be achieved, but essentially SAS requires access to the user’s Kerberos Ticket-Granting-Ticket (TGT) via the Kerberos Ticket Cache.

Enable users to enter a kinit command interactively from the SAS server. My previous post Understanding Hadoop security described the steps required for a Hadoop user to access a Hadoop client:

launch a remote connection to a server
run a kinit command
then run the Hadoop client.

The same steps apply when you are accessing the client through SAS/ACCESS to Hadoop. You can make a remote SSH connection to the server where SAS is installed. Once logged into the system, you run the command kinit, which initiates your Kerberos credentials and prompts for your Kerberos password. This step obtains your TGT and places it in the Kerberos Ticket Cache. Once completed, you can start a SAS session and run SAS code containing SAS/ACCESSS to Hadoop statements. This method provides access to the secure Hadoop environment, and SAS will interact with Kerberos to provide the strong authentication of the user.

However, in reality, how many SAS users run their SAS code by first making a remote SSH connection to the server where SAS is installed? Clearly, the SAS clients such as SAS Enterprise Guide or the new SAS Studio do not function in this way: these are proper client-server applications. SAS software does not directly interact with Kerberos. Instead, SAS relies on the underlying operating system and APIs to make those connections. If you’re running a client-server application, the interactive shell environment isn’t available, and users cannot run the kinit command. SAS clients need the operating system to perform the kinit step for users automatically. This requirement means that the operating system itself must be integrated with Kerberos, providing the user’s Kerberos password to obtain a Kerberos-Ticket-Granting Ticket (TGT).

Integrate the operating system of the SAS server into the Kerberos realm for Hadoop. Integrating the operating system with Kerberos does not necessarily mean that the user accounts are stored in a directory server. You can configure Kerberos for authentication with local accounts. However, the user accounts must exist with all the same settings (UID, GID, etc.) on all of the hosts in the environment. This requirement includes the SAS server and the hosts used in the Hadoop environment.

Managing all these local user accounts across multiple machines will be considerable management overhead for the environment. As such, it makes sense to use a directory server such as LDAP to store the user details in one place. Then the operating system can be configured to use Kerberos for authentication and LDAP for user properties.

If SAS is running on Linux, you’d expect to use a PAM (Pluggable Authentication Module) configuration to perform this step, and the PAM should be configured to use Kerberos for authentication. This results in a TGT being generated as a user’s session is initialized.

The server where SAS code will be run must also be configured to use PAM, either through the SAS Deployment Wizard during the initial deployment or manually after the deployment is complete. Both methods update the sasauth.conf file in the <SAS_HOME>/SASFoundation/9.4/utilities/bin and set the value of methods to “pam”.

This step is not sufficient for SAS to use PAM. You must also make entries in the PAM configuration that describe what authentication services are used when sasauth performs an authentication. Specifically, the “account” and “auth” module types are required. The PAM configuration of the host is locked down to the root user, and you will need the support of your IT organization to complete this step. More details are found in the Configuration Guide for SAS 9.4 Foundation for UNIX Environments.

With this configuration in place, a Kerberos Ticket-Granting-Ticket should be generated as the user’s session is started by the SAS Object Spawner. The TGT will be automatically available for the client-server applications. On most Linux systems, this Kerberos TGT will be placed in the user’s Kerberos Ticket Cache, which is a file located, by default, in /tmp. The ticket cache normally has a name /tmp/krb5cc_<uid>_<rand>, where the last section of the filename is a set of random characters allowing for a user to log in multiple times and have separate Kerberos Ticket Caches.

Given that SAS does not know in advance what the full filename will be, the PAM configuration should define an environment variable KRB5CCNAME which points to the correct Kerberos Ticket Cache. SAS and other processes use the environment variable to access the Kerberos Ticket Cache. Running the following code in a SAS session will print in the SAS log the value of the KRB5CCNAME environment variable:

%let krb5env=%sysget(KRB5CCNAME);
%put &KRB5ENV;

Which should put something like the following in the SAS log:

43 %let krb5env=%sysget(KRB5CCNAME);
44 %put &KRB5ENV;
FILE:/tmp/krb5cc_100001_ELca0y

Now that the Kerberos Ticket-Granting-Ticket is available to the SAS session running on the server, the end user is able to submit code using SAS/ACCESS to Hadoop statements that access a secure Hadoop environment.

In my next blog in the series, we will look at what happens when we connect to a secure Hadoop environment from a distributed High Performance Analytics Environment.

More information

2 Comments

Bhushan Mehendale on December 22, 2015 11:37 am

Our kerberos TGT is set to expire every 8 hours. This causes issue when a SAS job is executed from SAS EG or SAS Studio as the kerberos Cache is created only once when the session starts through the objectspawner. however when a code executes against hadoop which runs longer than 8 hours fails.

Company policy prohibits extension of TGT duration beyond 8 hours.

Is there any other mechanism to autorenew the TGT Cache so the SAS/Hadoop code does not fail after 8 hours of execution?

I would also like to mention that creating keytabs for individual user accounts is also prohibited according to company policy
Rafi S on August 20, 2015 4:33 pm

Hello great information, kudos and thank you.

What would be the plan of action for AIX and older SAS version, for example 9.3TS1M2 or M3?

Blogs

Blogs

SAS and secure Hadoop: 3 deployment requirements

About Author

2 Comments