SAS Grid Manager for Hadoop is a brand new product released with SAS 9.4M3 this summer. It gives you the ability to co-locate your SAS Grid jobs on your Hadoop data nodes to let you further leverage your investment in your Hadoop infrastructure. This is possible because SAS Grid Manager for Hadoop is integrated with the native components, specifically YARN and Oozie, of your Hadoop ecosystem. Let's review the architecture of this new offering.
First of all, the official name– SAS Grid Manager for Hadoop– shows that it is a brand new product, not just an addition or a different configuration of the “classic” SAS Grid Manager – which I will subsequently refer to as “for Platform” to distinguish the two.
For an end user, grid usage and functionality remains the same, but an architect will notice that many components of the offering have changed. Describing these components will be the focus of the remainder of this post.
Let me start by showing a picture of a sample software architecture, so that it will be easier to recognize all the pieces with a visual schema in front of us. The following is one possible deployment architecture; there are other deployment choices.
Third party components
Just as SAS Grid Manager for Platform builds on top of third party software from Platform Computing (part of IBM), SAS Grid Manager for Hadoop requires Hadoop to function. There is a big difference, though.
SAS Grid Manager for Platform includes all of the required Platform Computing components, as they are delivered, installed and supported by SAS.
On the other side, SAS Grid Manager for Hadoop considers all of the Hadoop components (highlighted in yellow in the above diagram) as prerequisites. As such, customers are required to procure, install and support Hadoop before SAS gets installed.
Hadoop, as you know, includes many different components. The diagram lists the one that are needed for SAS Grid Manager:
- HDFS provides cluster-wide filessytem storage
- YARN is used for resource management
- Oozie is the scheduling service
- Hue is required, if the Oozie web GUI is surfaced through Hue.
- Hive is required at install time for the SAS Deployment Wizard to be able to access the required Hadoop configuration and jar files.
- Hadoop jars and config files need to be on every machine, including clients.
YARN Resource Manager, HDFS Name Node, Hive, and Oozie are not necessarily on the same machine. By default, the SAS grid control server needs to be on the machine that YARN Resource Manager is on.
SAS programming interfaces to grid have not changed, apart from the lower-level libraries to connect to the third party software. As such, SAS will deploy the traditional SAS grid control server, SAS grid nodes, SAS thin client (aka SASGSUB) or the full SAS client (SAS Display Manger).
In a typical SAS Grid deployment, a shared directory is used to share the installation and configuration directories between machines in the grid. With SAS Grid Manager for Hadoop, you can either use NFS to mount a shared directory on all cluster hosts or use the SAS Deployment Manager (SDM) to work with the cluster manager to distribute the deployment to the cluster hosts. The SDM has the ability to create Cloudera parcels and Ambari packages to enable the distribution of the installation and configuration directories from the grid control server to the grid nodes.
One notable missing component is the SAS Grid Manager plug-in for SAS Management Console. This management interface is tightly coupled with Platform Computing GMS, and cannot be used with Hadoop.
The Middle Tier
You will notice in the above diagram that the middle tier is faded. In fact, no middle tier components are included in SAS Grid Manager for Hadoop. Anyway, a middle tier will generally be included and deployed as part of other solutions licensed on top of SAS Grid Manager, so you will still be able to program using SAS Studio and monitor the SAS infrastructure using SAS Environment Manager.
Please note that I say “monitor the SAS infrastructure”, not “monitor the SAS grid.” There are no plug-ins or modules within SAS Environment Manager that are specific to SAS Grid Manager for Hadoop. This is by design because SAS is part of your overall Hadoop environment and therefore the SAS Grid workload can be monitored using your favorite Hadoop management tools.
Hadoop provides plenty of web interfaces to monitor, manage and configure its environment. As such, you will be able to use YARN Web UI to monitor and manage submitted SAS jobs, as well as Hue web UI to review scheduled workflows.
Discussing grid storage is never a quick task and could require a full blog post on its own. It is worth noting some architecture peculiarities related to SAS Grid Manager for Hadoop. HDFS can be used to store shared data, and is used to store scheduled jobs, workflows, logs. But, we still require a traditional, POSIX complaint filesystem for stuff such as SAS Work, SASGSUB, solution specific projects, etc.
SAS Grid Manager for Hadoop enables customers to co-locate their SAS Grid and all of the associated SAS workload on their existing Hadoop cluster. We have briefly discussed the key components that are included in – or are missing from – this new offering. I hope you found this post helpful. As always, any comments are welcome.