SAS Hadoop - A peek at the technology

26

Thanks for returning to learn more about this critical technology. Following yesterday’s overview post on the new SAS Hadoop support, we’ll dig a little deeper today and consider the following:

  • Under the Hood: A Peek at the Technology
  • SAS Hadoop Value Summary
  • A Note About the Future

Under the Hood: A Peek at the Technology

Bring the power of SAS® Analytics to Hadoop

The SAS/ACCESS Interface to Hadoop offers seamless and transparent data access to Hadoop via HiveDB. SAS users access Hive tables as if they were native SAS data sets. Analytic or data processes can be performed using SAS tools while optimizing run-time execution using the appropriate Hadoop or SAS environment.

The SAS/ACCESS Interface to Hadoop enables Hadoop users to tap into the power of SAS by extending support for the complete analytics life cycle to Hadoop, including discovery, data preparation, modeling and deployment. Of particular importance to many organizations is the ability to:

  • Visually analyze or explore data in Hadoop as the precursor to more in-depth analytics via SAS Visual Analytics Explorer capabilities.
  • Leverage text mining and analytics capability based on data stored in Hadoop.
  • Use SAS Metadata Server to create and manage metadata relating to data that is stored in Hadoop.

Technical Details

  • LIBNAME statement makes Hive tables look like SAS data sets.
  • PROC SQL provides the ability to execute explicit HiveQL commands in Hadoop.
  • SAS procedures (including PROC FREQ, PROC RANK, PROC REPORT, PROC SORT, PROC SUMMARY, PROC MEANS and PROC TABULATE) are supported.

Leverage Hadoop’s Distributed Processing Capability

SAS Hadoop support allows execution of Hadoop functionality, enabling MapReduce programming, scripting support and the execution of HDFS commands from within the SAS environment. This complements SAS/ACCESS capabilities provided for Hive by extending support for Pig, MapReduce and HDFS commands.

More Technical Details

  • PROC HADOOP support allows you to submit MapReduce, scripting and HDFS commands from the SAS execution environment. This includes support for Pig, MapReduce and HDFS commands.
  • External file references are supported, which provides the ability for Hadoop files to be referenced from any SAS component. Parameters necessary to process the file, such as delimiters, are externalized, which makes it convenient to work with a Hadoop file.

Augment Hadoop using SAS® Information Management

One of the issues plaguing Hadoop is the lack or relative immaturity of tools that can be used to develop and manage Hadoop deployments. SAS data management and analytics management offerings can help organizations quickly derive value from Hadoop using fewer resources. Some examples of this include an intuitive graphical user interface to develop Hadoop capability, the ability to create data management and analytic code and deploy it within Hadoop, or the ability to register and manage Hadoop files via the SAS Management Console. This makes it easy to work with Hadoop within SAS, and extends SAS metadata, data lineage, impact analysis and security capability to Hadoop environments.

Still More Technical Details

SAS® Data Integration Studio

  • SAS Data Integration Studio includes a set of standard transforms and a job flow builder that can be used with Hadoop data. The transforms support common functionality, such as the ability to load, unload, extract, reformat, read/write multiple files, reference external files, etc.
  • SAS Data Integration Studio provides the ability to integrate Hadoop code, including Pig, MapReduce and HDFS commands in-line with a data job flow.
  • SAS Data Integration Studio provides an editor for Pig and Hive, which provides visual editing capability, including a syntax checker, for developing Pig and Hive.
  • SAS Data Integration Studio provides the ability to submit HiveQL via PROC SQL capability that can also be surfaced through Base SAS and other SAS components.
  • Since Hadoop is treated as a SAS data source, data quality capabilities that are provided by SAS and DataFlux can be leveraged to process data that is coming in or out of Hadoop.

Hadoop Function Support

  • SAS provides the ability to create UDFs that can be deployed within HDFS. This includes the ability to use SAS Enterprise Miner to take analytical scoring code and produce a UDF that can be deployed within HDFS. These UDFs can then be accessed by Hive, Pig or MapReduce code.

Metadata, Lineage & Security

  • Using Hadoop within SAS provides the benefit of data lineage (including impact analysis) and additional security. All SAS processing that is done with Hadoop is tracked, and the existing data lineage functionality can be used to better manage Hadoop usage.
  • Ability to register Hive Server using SAS Management Console so that any SAS capability can easily reference Hadoop (via a FILENAME statement, leverage parameters to better interact with Hadoop, identify delimiters so files can be parsed on the fly, etc.). This makes it possible for the entire SAS stack (BI, DI, SAS/STAT, etc.) to work with Hadoop data. It provides the ability to track what tables are in Hadoop, and provides the basis for lineage.
  • SAS honors the underlying security provided by Hadoop. For instance, SAS will not bypass Hadoop security and allow a user to read data without the proper Hadoop permissions. In addition to the underlying security provided by Hadoop, SAS will allow you to further restrict access to Hadoop based on the standard SAS security capabilities.
  • The SAS Metadata Server, a component of Base SAS software, provides the ability to generate metadata based on data that is stored in Hadoop. SAS provides flexible parsing support that is not restricted to a preset data definition, allowing support for any custom definition. Once defined, the metadata can be used to optimize interaction with the data stored in Hadoop.

Environment Support

  • Support for popular Hadoop distributions such as Cloudera, Hortonworks, EMC Greenplum, etc.

 

SAS® Hadoop Value Summary

The SAS approach marries the power of world-class analytics with Hadoop’s ability to leverage commodity-based storage and Hadoop’s ability to perform distributed processing.

The SAS Hadoop integration provides the following value to organizations looking to get the most from their big data assets:

  • SAS both simplifies and augments Hadoop. An ability to abstract the complexity of Hadoop by making it function as another data source brings the power of SAS and its well-established community to Hadoop implementations. This is critical, given the skills shortage and the complexity involved with Hadoop. In addition, boosting Hadoop with world-class analytics, along with metadata, security and lineage capabilities, helps ensure that Hadoop will be ready for enterprise expectations.
  • SAS provides total Hadoop leverage. Because SAS support for Hadoop spans the entire information management life cycle, SAS management supports metadata, lineage, monitoring, federation and security augmentation. These areas are pervasive through the entire data-to-decision life cycle.

How do enterprises benefit from the distinctive SAS Analytics and SAS Data Integration offerings?

  • SAS provides a robust, comprehensive, information management life cycle approach to Hadoop that includes data management and analytics management support. This is a huge advantage over other products that focus primarily on moving data in and out of Hadoop.
  • SAS delivers optimal solutions for each organization’s specific mix of technologies. SAS Data Integration supports Hadoop alongside other data storage and processing technologies. This offers greater flexibility than other vendor-specific products that only use Hadoop as a vehicle for landing more information on certain database or hardware platforms.

A Note About the Future

The exciting news is that this is just the start – we'll be discussing additional topics, such as data governance and Hadoop, MDM and Hadoop, SAS embedded processing on Hadoop nodes and other topics of interest to the SAS community. Please check back to hear more about how to best build your information assets.

Let the SAS Hadoop hype continue!

Share

About Author

Mark Troester

IT / CIO Thought Leader & Strategist

Mark Troester is the IT / CIO Thought Leader & Strategist for SAS. He oversees the company’s market strategy efforts for information management and for the overall CIO and IT vision. He began his career in IT and has worked in product management and product marketing for a number of Silicon Valley start-ups and established software companies. Twitter @mtroester

26 Comments

  1. Pingback: SAS: Big play for Hadoop - Information Architect

  2. Pingback: Introducing high-performance analytics for any environment - SAS Voices

    • Mark Troester

      Hello Bill - thanks for the question. Yes we do support the open source distribution of Hadoop along with those that are commercially supported by vendors like Cloudera, Hortonworks, MapR, etc.

  3. Pingback: Staying ahead of the big data curve - SAS Voices

  4. Does SAS offer a course on distributed computing in general with emphasis on how to use SAS with Hadoop. Many of us are new to the concept of parallel processing and what Hadoop does in general and would like to take a course in that.

    • Mark Troester

      Hello Sandi - At this point we do not offer a specific training class on Hadoop and distributed processing. We do have documentation that goes into the support that we provide for Hadoop and you can engage with your account team for additional professional services support. I'll pass your request along to the education team at SAS and let you know if courseware is created in the future. Thanks, Mark.

  5. Pingback: What’s new with in-memory computing? - SAS Voices

  6. Pingback: Another sneak peek for your SAS Global Forum 'agenda' - SAS Users Groups

  7. Pingback: The things I missed at #SASGF12 - Real BI for Real Users

  8. Seamus McKenna on

    Hello.
    Enjoyed the Forum and the HADOOP/BIG Data seances.
    My question is related to how I should approach the customer with SAS Institutes HADOOP solution and other Companies solutions ex. IBM. Should I treat IBMs HADOOP solution as another source database to the SAS Data warehouse environment via SAS HADOOP access engine? Should we just reccommend the open source Apache HADOOP and is this appropriate for a large company?
    I am planning to test the SAS HADOOP solution ASAP but would like to have partners.
    Any advice on this "non- lazy comment"?

    • Mark Troester

      Hello Seamus - I'm not sure that I can completely answer your question without more detail, but the SAS solution will work with various distributions of Hadoop, based on a version number. So yes, you can leverage the IBM Hadoop distribution or other distributions as "another database" to SAS. Depending on the companies Hadoop expertise, criticality of the system, their risk level, etc., will determine whether they "go it alone" with Hadoop or whether they go with a vendor that provides Hadoop support like HortonWorks, MapR, Cloudera, etc. If you have any additional questions, please feel free to email me directly at mark.troester@sas.com. Thanks, Mark.

  9. Pingback: Big data answers for your industry and your role - SAS Voices

  10. Isn't this series of posts premature? I just called in to try to add SAS/Access interface to Hadoop to my contract, only to learn that it's not available on the Windows platform or on desktop versions until at least Q3. It seems SAS's "Hadoop Hype" is just that.

  11. We have a small hadoop clutser, and would love to have our sas researchers store their data in hadoop when appropriate, as well as have students be able to access hadoop data through sas. What if any, are the licensing issues for an educational institution?

    • Mark Troester

      Hello Norman - Glad to see you are interested in our support for Hadoop. Our support for Hadoop is tied to different products that we support - for example the support provided as part of Base SAS is part of tied to that license. In addition, you will need to license SAS Access for Hadoop. It would be best for you to contact your account rep for details on pricing for your institution. Thanks, Mark.

  12. Jagadish Kulkarni on

    Hi Mark,
    I am a big fan of SAS and its technologies and glad to know that we have SAS and Hadoop connectivity. It is very important have this as big data being buzz word and we living in data age. I believe it is not the big data which is important, but BIG ANALYTICS. Is not it?
    What is your plan to bring the power of SAS (especially Analytics) on Hadoop environment and thereby help SAS users to build full blown Analytical Solutions on Hadoop? I am looking forward to use SAS on Hadoop to execute analytical algorithms like OLS, Logistic, Mixed models, time series.
    Thanks,
    Jagadish
    India

    • Mark Troester

      Hello Jagadish –

      Thanks for your comment and I’m happy to hear that you are interested in our support for Hadoop.

      In addition to the capability that we have already released, additional work is planned for Hadoop that follows the path that we have taken for other databases like EMC Greenplum and Teradata. This include support for High Performance Analytics (HPA) running on HDFS. There is a lot more information about HPA on the SAS Website, but it basically includes taking a number of analytic procedures that are run using a high-performance architecture. These procedures support data preparation, exploration, dimension reduction and predictive modeling. The trick here is that SAS leverages an embedded analytics engine that provides interprocess communication that is equired for interactive analytics, low latency, predictive analytics, etc. This complements MapReduce, which is well suited for tasks that don’t require message passing capability between the nodes like simple ETL data processing. Along with this capability, SAS is planning to extend support for our scoring accelerator to Hadoop – this will allow organizations to easily deploy analytic scoring code within Hadoop.

      Hope this helps – and good luck with your Hadoop implementation!

      Mark.

  13. Can we integrate Mainframe oriented SAS with Hadoop? It would be interesting as many of my SAS mainframe jobs process very high amounts of data.

  14. Pingback: The best of SAS blogs for 2012 - SAS Voices

  15. If a sas dataset (.sas7bdat) is placed in HDFS, does hadoop take care of distributing chunks of it across the nodes in the cluster?
    Does read response time on that dataset go down linearly by the no of nodes in the hadoop cluster?

  16. I have used Proc Hadoop to run mapreduce and to move some flatfiles around on the Hadoop system. What has been more effective is mounting the Hadoop system to my Linux server. The file system looks just like a regular Linux directory. The advantage here has been the ability to read prod logs from Hadoop with SAS code and copy flatfiles to/from the Hadoop system. For the copies I have found using a sastask command with a wait and just do a simple cp to where I want the files. Overall it is a simple solution to use the space on Hadoop and access the prod logs.

  17. Pingback: New addition into Analytics DBMS – Hadoop as the ERP of web data and its Integration with SAS - Analytics Training Blog

Leave A Reply

Back to Top