Why use Hadoop?

4

You may still believe that Hadoop is going to solve all of the world’s problems with big data. It won’t. Hadoop is a framework for storing large-scale data processing with both pros and cons for organizations. 

Christopher Stevens, from Greenplum, explained that Hadoop is rapidly becoming the go-to for big data analytics. In Stevens’ presentation at the 2012 SouthEast SAS Users Group conference, he explained what Hadoop is and noted some of the pros and cons of using Hadoop.

Hadoop is:

  • Inspired by Google’s architecture: MapReduce and GFS.
  • Open source, a top-level Apache project.
  • Written in Java, plus a few shell scripts.
  • Includes tool sets such as Pig, Hive and HBase.

But Hadoop is not right for everyone. “We’ve seen customers go out and stand up Hadoop clusters, but they don’t need them,” Stevens said.

When Hadoop is right for an organization:

  • It can be effective for both unstructured and structured data if your processing can easily be made parallel.
  • When there’s a need to reduce the query times for batch jobs.
  • If you have access to lots of cheap hardware.

What Hadoop is not so good for:

  • Intense calculations with little data.
  • When your organization’s processing cannot be easily made parallel.

According to Stevens, the biggest challenge for organizations is that they need to find analysts with new skills.

Read more about Hadoop and how SAS and Greenplum work together to analyze big data with Hadoop. (This paper and all SESUG 2012 papers will be available soon.)

Share

About Author

Waynette Tubbs

Editor, Marketing Editorial

Waynette Tubbs is a seasoned technology journalist specializing in interviewing and writing about how leaders leverage advanced and emerging analytical technologies to transform their B2B and B2C organizations. In her current role, she works closely with global marketing organizations to generate content about artificial intelligence (AI), generative AI, intelligent automation, cybersecurity, data management, and marketing automation. Waynette has a master’s degree in journalism and mass communications from UNC Chapel Hill.

4 Comments

  1. May be my reply is too late and people already know this. The entire thing has two layers
    1. Data layer (Hadoop -HDFS) -- SAS accesses the hdfs file system using hadoop access engine (+additional modules) to read and write files
    2. Processing layer -- Depending on the modules that have been licensed SAS can leverage HADOOP nodes with SAS Embedded processes (Some procs can run in distributed fashion). Use a native hadoop call to distribute a java map reduce job to retrieve results. Use a LASR cluster on top of HDFS cluster to do in memory computation etc.

    If you just get HADOOP access engine the things you could do are limited. If you get other modules such as embedded process, Data Loader, Visual Analytics Platform you could do much more.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top