SAS High-Performance Analytics: connecting to secure Hadoop

1

hadoop-HPAIn this post we dig deeper into the fourth recommended practice for securing the SAS-Hadoop environment through Kerberos authentication:

When configuring SAS and Hadoop jointly in a high-performance environment, ensure that all SAS servers are recognized by Kerberos.

Before explaining the complex steps in connecting to secure Hadoop within a SAS High-Performance Analytics environment such as SAS Visual Analytics, let’s start by reviewing a simpler connection from a standard SAS session through SAS/ACCESS for Hadoop.

Making connections in a standard SAS session

Hadoop_standard_thumbnailHere's a behind-the-scenes look at the steps involved in the connection. Click on the thumbnail to view the full diagram. This graphic depicts an environment where the SAS Servers are configured to authenticate the users with the same back-end directory server as the Hadoop instance. The setup is relatively straight-forward.

Let’s say a SAS user (we’ll call this user Mary) logs into her machine, and the standard Windows logon procedure obtains the Ticket-Granting Ticket (TGT) from the main corporate Active Directory. This step happens on all domain machines: Kerberos is tied into the standard deployment of Active Directory. Also, this step is completely isolated from Mary’s access to SAS.

At some later point in the day (perhaps after grabbing her morning cup of coffee!), Mary may open SAS Enterprise Guide. As we know, starting the SAS session makes a connection to the SAS Metadata Server, and her credentials (username and password) in the connection profile are authenticated by the Metadata Server. As we noted in previous posts, the SAS servers have been configured to use the same directory server as Hadoop, so this authentication step uses Pluggable Authentication Modules (PAM) to validate our user.

Next, SAS Enterprise Guide initiates Mary’s Workspace Session by connecting to the Object Spawner. The Object Spawner runs a shell script (WorkspaceServer.sh) as “Mary” and spawns the Workspace Session. In this step, as her credentials are authenticated, the PAM stack on the server obtains a Ticket-Granting Ticket (TGT) for Mary. This TGT is going to be placed in her Ticket Cache, ready to be made use of later.

To Mary, all of this has happened as SAS Enterprise Guide was opened. She has not been required to perform any special actions to get to this stage. All the “magic” has been taking place behind the curtain.

So Mary can submit her SAS code to connect to Hadoop using the standard LIBNAME statement with the required options. (Remember username and password are not valid when connecting using Kerberos. They specify the Kerberos security principals instead.) Also as discussed last time, the step for connecting to Hadoop in a SAS session can be moved behind the curtain by ensuring the principals are in the configuration file used to make the connection to Hadoop. The Hadoop client libraries then use the TGT to request the Service Tickets for HIVE and/or HDFS, and SAS makes the connection to Hadoop using the Service Tickets. Our SAS user Mary is authenticated on the Hadoop side by validating the Service Tickets provided in the connection.

How connections are made in a SAS High-Performance Analytics session

Hadoop_HPA_thumbnailSo, this is the setup in a standard SAS session. What now happens if the environment uses something like SAS High-Performance Analytics or SAS Visual Analytics to make the connection to the secured Hadoop environment? Let’s look at the steps involved if our SAS user Mary wants to make a connection to SAS Visual Analytics. Click on the thumbnail to view a larger graphic showing these steps. Wow – that looks a bit more complicated under the covers! We’ll start with understanding how the connections are made and then look at some of the configuration options.

So, the starting points in this process are the same as before. We have the same steps occurring up to making the connection to Hadoop (step 7 in the diagram.) At this point, we’ll want to explore a little more detail about the connection that is made by the LIBNAME statement in Mary’s SAS code.

What wasn’t necessary to know when connecting with a standard SAS session is the fact that the XML configuration file is actually written to a temporary location in HDFS when Mary connects to Hadoop from SAS. This XML file will be used later by the distributed processes in the SAS High-Performance Analytics environment.

After submitting the LIBNAME statement to connect to Hadoop, Mary must now submits a PROC LASR or HP PROC statement to access the high-performance environment. Submitting these procedures initiates the SSH connection from the Workspace Session to the SASHigh-Performance Analytics Environment. Since Mary needs her SAS High Performance Analytics session to be Kerberos-aware, the SSH connection must be made using Kerberos. At this point, the SSH client uses the TGT to obtain a Service Ticket for HOST on the HPA General or LASR Root Node. SAS then uses the SSH client to start the HPA General and passes details of how to connect to HDFS.

The SSH Daemon (a server process running on the HPA General) generates a TGT for Mary on the HPA General as part of authenticating her credentials. This TGT on the HPA General is then used to request Service Tickets for all the worker nodes, and the parallel SSH connections are made to initialize the HPA Captains. With the HPA processes now running, the HPA General initiates the connection to Hadoop using the TGT—this time to request service tickets for HDFS and HIVE.

The HPA General connects to HDFS to retrieve the XML file placed there by the LIBNAME statement. Our SAS user Mary is authenticated using the Service Ticket for HDFS. The HPA General now submits a Map Reduce job. This Map Reduce job initiates the SAS Embedded Process (EP) running on the Hadoop nodes. The SAS Embedded Process connects first to the HPA General and then makes connections to the assigned HPA Captains using UNIX sockets. This process is not authenticated since the two sets of processes were already authenticated.

The SAS Embedded Process runs as a standard MapReduce job and has corresponding MapReduce tasks running on each node of the Hadoop environment. The MapReduce tasks connect as necessary to HDFS and HIVE using the standard Hadoop internal tokens. These tokens are used by the tasks of a MapReduce job rather than Kerberos tickets. More details about these internals of the Hadoop system can be found in the Hortonworks Technical Report:  Adding Security to Apache Hadoop. Each SAS Embedded Process then passes the data back and forth in parallel as required by the HPA processes.

Configuration requirements for SAS High-Performance Analytics

So while the diagram looks complicated, I believe we can distill this down into the following two requirements:

  1. The SAS Workspace Server must still have access to the users TGT.
  2. HPA General or LASR Root Node must have access to the users TGT.

The simplest method of ensuring that both the SAS Workspace Server and the HPA General or LASR Root Node have access to the user’s TGT is to configure SSH to use Kerberos and to ensure the following options are set.

/etc/ssh/sshd_config
GSSAPIAuthentication yes
GSSAPICleanupCredentials yes
/etc/ssh/ssh_config
GSSAPIAuthentication yes
GSSAPIDelegateCredentials yes

Setting these options on all SAS High-Performance Analytics Environment machines should ensure all SSH connections made using Kerberos will have a valid TGT obtained for each open session. Remember for SSH to use Kerberos, the HOST Service Principal must be registered in the Kerberos Key Distribution Center and the HOST keytab must be available on each machine (normally stored in /etc/krb5.keytab).

If you have questions about configuring SAS High-Performance Analytics to access a Kerberos-authenticated Hadoop environment or have other suggestion, please share them in the comment area below.

Share

About Author

Stuart Rogers

Architecture and Security Lead

Stuart Rogers is a Architecture and Security Lead in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division. His areas of focus include the SAS Middle Tier and security authentication.

1 Comment

  1. Hello!

    I am An Information Security Architect with expansive conceptual abilities. I am wondering how we can use hadoop model while keeping the Full cryptography involved in the data communications. so that we can achieve the objective have big databases while encrypted and fully secured?

    Any comments or ideas on this?

    Thanks

    Joe de Saram

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top