SAS Hadoop - A peek at the technology

Thanks for returning to learn more about this critical technology. Following yesterday’s overview post on the new SAS Hadoop support, we’ll dig a little deeper today and consider the following:

  • Under the Hood: A Peek at the Technology
  • SAS Hadoop Value Summary
  • A Note About the Future

Under the Hood: A Peek at the Technology

Bring the power of SAS® Analytics to Hadoop

The SAS/ACCESS Interface to Hadoop offers seamless and transparent data access to Hadoop via HiveDB. SAS users access Hive tables as if they were native SAS data sets. Analytic or data processes can be performed using SAS tools while optimizing run-time execution using the appropriate Hadoop or SAS environment.

The SAS/ACCESS Interface to Hadoop enables Hadoop users to tap into the power of SAS by extending support for the complete analytics life cycle to Hadoop, including discovery, data preparation, modeling and deployment. Of particular importance to many organizations is the ability to:

  • Visually analyze or explore data in Hadoop as the precursor to more in-depth analytics via SAS Visual Analytics Explorer capabilities.
  • Leverage text mining and analytics capability based on data stored in Hadoop.
  • Use SAS Metadata Server to create and manage metadata relating to data that is stored in Hadoop.

Technical Details

  • LIBNAME statement makes Hive tables look like SAS data sets.
  • PROC SQL provides the ability to execute explicit HiveQL commands in Hadoop.
  • SAS procedures (including PROC FREQ, PROC RANK, PROC REPORT, PROC SORT, PROC SUMMARY, PROC MEANS and PROC TABULATE) are supported.

Leverage Hadoop’s Distributed Processing Capability

SAS Hadoop support allows execution of Hadoop functionality, enabling MapReduce programming, scripting support and the execution of HDFS commands from within the SAS environment. This complements SAS/ACCESS capabilities provided for Hive by extending support for Pig, MapReduce and HDFS commands.

More Technical Details

  • PROC HADOOP support allows you to submit MapReduce, scripting and HDFS commands from the SAS execution environment. This includes support for Pig, MapReduce and HDFS commands.
  • External file references are supported, which provides the ability for Hadoop files to be referenced from any SAS component. Parameters necessary to process the file, such as delimiters, are externalized, which makes it convenient to work with a Hadoop file.

Augment Hadoop using SAS® Information Management

One of the issues plaguing Hadoop is the lack or relative immaturity of tools that can be used to develop and manage Hadoop deployments. SAS data management and analytics management offerings can help organizations quickly derive value from Hadoop using fewer resources. Some examples of this include an intuitive graphical user interface to develop Hadoop capability, the ability to create data management and analytic code and deploy it within Hadoop, or the ability to register and manage Hadoop files via the SAS Management Console. This makes it easy to work with Hadoop within SAS, and extends SAS metadata, data lineage, impact analysis and security capability to Hadoop environments.

Still More Technical Details

SAS® Data Integration Studio

  • SAS Data Integration Studio includes a set of standard transforms and a job flow builder that can be used with Hadoop data. The transforms support common functionality, such as the ability to load, unload, extract, reformat, read/write multiple files, reference external files, etc.
  • SAS Data Integration Studio provides the ability to integrate Hadoop code, including Pig, MapReduce and HDFS commands in-line with a data job flow.
  • SAS Data Integration Studio provides an editor for Pig and Hive, which provides visual editing capability, including a syntax checker, for developing Pig and Hive.
  • SAS Data Integration Studio provides the ability to submit HiveQL via PROC SQL capability that can also be surfaced through Base SAS and other SAS components.
  • Since Hadoop is treated as a SAS data source, data quality capabilities that are provided by SAS and DataFlux can be leveraged to process data that is coming in or out of Hadoop.

Hadoop Function Support

  • SAS provides the ability to create UDFs that can be deployed within HDFS. This includes the ability to use SAS Enterprise Miner to take analytical scoring code and produce a UDF that can be deployed within HDFS. These UDFs can then be accessed by Hive, Pig or MapReduce code.

Metadata, Lineage & Security

  • Using Hadoop within SAS provides the benefit of data lineage (including impact analysis) and additional security. All SAS processing that is done with Hadoop is tracked, and the existing data lineage functionality can be used to better manage Hadoop usage.
  • Ability to register Hive Server using SAS Management Console so that any SAS capability can easily reference Hadoop (via a FILENAME statement, leverage parameters to better interact with Hadoop, identify delimiters so files can be parsed on the fly, etc.). This makes it possible for the entire SAS stack (BI, DI, SAS/STAT, etc.) to work with Hadoop data. It provides the ability to track what tables are in Hadoop, and provides the basis for lineage.
  • SAS honors the underlying security provided by Hadoop. For instance, SAS will not bypass Hadoop security and allow a user to read data without the proper Hadoop permissions. In addition to the underlying security provided by Hadoop, SAS will allow you to further restrict access to Hadoop based on the standard SAS security capabilities.
  • The SAS Metadata Server, a component of Base SAS software, provides the ability to generate metadata based on data that is stored in Hadoop. SAS provides flexible parsing support that is not restricted to a preset data definition, allowing support for any custom definition. Once defined, the metadata can be used to optimize interaction with the data stored in Hadoop.

Environment Support

  • Support for popular Hadoop distributions such as Cloudera, Hortonworks, EMC Greenplum, etc.

 

SAS® Hadoop Value Summary

The SAS approach marries the power of world-class analytics with Hadoop’s ability to leverage commodity-based storage and Hadoop’s ability to perform distributed processing.

The SAS Hadoop integration provides the following value to organizations looking to get the most from their big data assets:

  • SAS both simplifies and augments Hadoop. An ability to abstract the complexity of Hadoop by making it function as another data source brings the power of SAS and its well-established community to Hadoop implementations. This is critical, given the skills shortage and the complexity involved with Hadoop. In addition, boosting Hadoop with world-class analytics, along with metadata, security and lineage capabilities, helps ensure that Hadoop will be ready for enterprise expectations.
  • SAS provides total Hadoop leverage. Because SAS support for Hadoop spans the entire information management life cycle, SAS management supports metadata, lineage, monitoring, federation and security augmentation. These areas are pervasive through the entire data-to-decision life cycle.

How do enterprises benefit from the distinctive SAS Analytics and SAS Data Integration offerings?

  • SAS provides a robust, comprehensive, information management life cycle approach to Hadoop that includes data management and analytics management support. This is a huge advantage over other products that focus primarily on moving data in and out of Hadoop.
  • SAS delivers optimal solutions for each organization’s specific mix of technologies. SAS Data Integration supports Hadoop alongside other data storage and processing technologies. This offers greater flexibility than other vendor-specific products that only use Hadoop as a vehicle for landing more information on certain database or hardware platforms.

A Note About the Future

The exciting news is that this is just the start – we'll be discussing additional topics, such as data governance and Hadoop, MDM and Hadoop, SAS embedded processing on Hadoop nodes and other topics of interest to the SAS community. Please check back to hear more about how to best build your information assets.

Let the SAS Hadoop hype continue!

tags: Data Integration Studio, high-performance analytics, information management, sas/access

15 Comments

  1. Posted March 10, 2012 at 1:51 am | Permalink

    SAS is a great company and i will suggest them to all!

  2. Bill Zanine
    Posted March 23, 2012 at 12:58 pm | Permalink

    Does SAS support open-source Apache distribution of Hadoop?

    • Posted March 31, 2012 at 12:22 pm | Permalink

      Hello Bill - thanks for the question. Yes we do support the open source distribution of Hadoop along with those that are commercially supported by vendors like Cloudera, Hortonworks, MapR, etc.

  3. Sandi
    Posted April 11, 2012 at 12:50 pm | Permalink

    Does SAS offer a course on distributed computing in general with emphasis on how to use SAS with Hadoop. Many of us are new to the concept of parallel processing and what Hadoop does in general and would like to take a course in that.

    • Posted May 1, 2012 at 10:14 pm | Permalink

      Hello Sandi - At this point we do not offer a specific training class on Hadoop and distributed processing. We do have documentation that goes into the support that we provide for Hadoop and you can engage with your account team for additional professional services support. I'll pass your request along to the education team at SAS and let you know if courseware is created in the future. Thanks, Mark.

  4. Seamus McKenna
    Posted April 27, 2012 at 2:26 pm | Permalink

    Hello.
    Enjoyed the Forum and the HADOOP/BIG Data seances.
    My question is related to how I should approach the customer with SAS Institutes HADOOP solution and other Companies solutions ex. IBM. Should I treat IBMs HADOOP solution as another source database to the SAS Data warehouse environment via SAS HADOOP access engine? Should we just reccommend the open source Apache HADOOP and is this appropriate for a large company?
    I am planning to test the SAS HADOOP solution ASAP but would like to have partners.
    Any advice on this "non- lazy comment"?

    • Posted May 1, 2012 at 10:01 pm | Permalink

      Hello Seamus - I'm not sure that I can completely answer your question without more detail, but the SAS solution will work with various distributions of Hadoop, based on a version number. So yes, you can leverage the IBM Hadoop distribution or other distributions as "another database" to SAS. Depending on the companies Hadoop expertise, criticality of the system, their risk level, etc., will determine whether they "go it alone" with Hadoop or whether they go with a vendor that provides Hadoop support like HortonWorks, MapR, Cloudera, etc. If you have any additional questions, please feel free to email me directly at mark.troester@sas.com. Thanks, Mark.

  5. Jeb Stone
    Posted May 18, 2012 at 11:15 am | Permalink

    Isn't this series of posts premature? I just called in to try to add SAS/Access interface to Hadoop to my contract, only to learn that it's not available on the Windows platform or on desktop versions until at least Q3. It seems SAS's "Hadoop Hype" is just that.

    • Posted July 9, 2012 at 10:22 pm | Permalink

      Hello Jeb -

      Sorry but there may be some confusion. I'm assuming that you want to run the SAS/ACCESS module on Windows correct... for a complete list of what is supported please check here, but several versions of Windows are supported for the host environment... http://www.sas.com/software/data-management/access/hadoop.html#section=5.

      Please contact me directly at mark.troester@sas.com if you need any help.

      Thanks, Mark.

    • Jeff
      Posted July 10, 2012 at 10:54 am | Permalink

      Hi Jeb,

      I am the product manager for SAS/ACCESS. The SAS/ACCESS Interface to Hadoop is in Limited Availability right now. Windows 64-bit is a supported platform. Call your rep again and have them contact me.

      Best wishes,
      Jeff

  6. Posted June 18, 2012 at 2:05 pm | Permalink

    We have a small hadoop clutser, and would love to have our sas researchers store their data in hadoop when appropriate, as well as have students be able to access hadoop data through sas. What if any, are the licensing issues for an educational institution?

    • Posted July 9, 2012 at 10:14 pm | Permalink

      Hello Norman - Glad to see you are interested in our support for Hadoop. Our support for Hadoop is tied to different products that we support - for example the support provided as part of Base SAS is part of tied to that license. In addition, you will need to license SAS Access for Hadoop. It would be best for you to contact your account rep for details on pricing for your institution. Thanks, Mark.

  7. Jagadish Kulkarni
    Posted August 2, 2012 at 8:13 am | Permalink

    Hi Mark,
    I am a big fan of SAS and its technologies and glad to know that we have SAS and Hadoop connectivity. It is very important have this as big data being buzz word and we living in data age. I believe it is not the big data which is important, but BIG ANALYTICS. Is not it?
    What is your plan to bring the power of SAS (especially Analytics) on Hadoop environment and thereby help SAS users to build full blown Analytical Solutions on Hadoop? I am looking forward to use SAS on Hadoop to execute analytical algorithms like OLS, Logistic, Mixed models, time series.
    Thanks,
    Jagadish
    India

    • Posted August 6, 2012 at 12:23 pm | Permalink

      Hello Jagadish –

      Thanks for your comment and I’m happy to hear that you are interested in our support for Hadoop.

      In addition to the capability that we have already released, additional work is planned for Hadoop that follows the path that we have taken for other databases like EMC Greenplum and Teradata. This include support for High Performance Analytics (HPA) running on HDFS. There is a lot more information about HPA on the SAS Website, but it basically includes taking a number of analytic procedures that are run using a high-performance architecture. These procedures support data preparation, exploration, dimension reduction and predictive modeling. The trick here is that SAS leverages an embedded analytics engine that provides interprocess communication that is equired for interactive analytics, low latency, predictive analytics, etc. This complements MapReduce, which is well suited for tasks that don’t require message passing capability between the nodes like simple ETL data processing. Along with this capability, SAS is planning to extend support for our scoring accelerator to Hadoop – this will allow organizations to easily deploy analytic scoring code within Hadoop.

      Hope this helps – and good luck with your Hadoop implementation!

      Mark.

  8. Satya
    Posted October 31, 2012 at 5:34 am | Permalink

    Can we integrate Mainframe oriented SAS with Hadoop? It would be interesting as many of my SAS mainframe jobs process very high amounts of data.

8 Trackbacks

  1. By SAS: Big play for Hadoop - Information Architect on March 6, 2012 at 11:01 am

    [...] Architect > SAS: Big play for Hadoop « Privacy bargain and big data security SAS Hadoop - A peek at the technology » SAS: Big play for Hadoop Mark Troester|March 5, 2012 [...]

  2. [...] learn more, read Mark Troester's blog post about SAS and Hadoop or register to receive the white paper, "In-Memory Analytics for Big [...]

  3. By Staying ahead of the big data curve - SAS Voices on April 5, 2012 at 12:51 pm

    [...] high-performance capabilities including concurrent, in-memory analytics, visual analytics, and a seamless integration with Hadoop. According to Keith Collins, SAS is coming out with new innovations in high-performance analytics [...]

  4. [...] our expertise together with what HDFS does well. We are achieving incredible performance this way. The customer interacts with HDFS indirectly, through SAS and through the SAS LASR Analytic [...]

  5. [...] on SAS® and Hadoop.  Been hearing a lot about SAS and Hadoop – so I'm curious to learn more about what Hadoop is and what cool things it will bring to SAS [...]

  6. [...] 392-2012 covered Hadoop Integration [...]

  7. [...] SAS Hadoop - A peek at the technology [...]

  8. By The best of SAS blogs for 2012 - SAS Voices on December 27, 2012 at 5:24 pm

    [...] SAS Hadoop - A peek at the technology [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>