Why use Hadoop?

You may still believe that Hadoop is going to solve all of the world’s problems with big data. It won’t. Hadoop is a framework for storing large-scale data processing with both pros and cons for organizations.

Christopher Stevens, from Greenplum, explained that Hadoop is rapidly becoming the go-to for big data analytics. In Stevens’ presentation at the 2012 SouthEast SAS Users Group conference, he explained what Hadoop is and noted some of the pros and cons of using Hadoop.

Hadoop is:

Inspired by Google’s architecture: MapReduce and GFS.
Open source, a top-level Apache project.
Written in Java, plus a few shell scripts.
Includes tool sets such as Pig, Hive and HBase.

But Hadoop is not right for everyone. “We’ve seen customers go out and stand up Hadoop clusters, but they don’t need them,” Stevens said.

When Hadoop is right for an organization:

It can be effective for both unstructured and structured data if your processing can easily be made parallel.
When there’s a need to reduce the query times for batch jobs.
If you have access to lots of cheap hardware.

What Hadoop is not so good for:

Intense calculations with little data.
When your organization’s processing cannot be easily made parallel.

According to Stevens, the biggest challenge for organizations is that they need to find analysts with new skills.

Read more about Hadoop and how SAS and Greenplum work together to analyze big data with Hadoop. (This paper and all SESUG 2012 papers will be available soon.)

4 Comments

Subhash on May 27, 2015 11:24 am

May be my reply is too late and people already know this. The entire thing has two layers
1. Data layer (Hadoop -HDFS) -- SAS accesses the hdfs file system using hadoop access engine (+additional modules) to read and write files
2. Processing layer -- Depending on the modules that have been licensed SAS can leverage HADOOP nodes with SAS Embedded processes (Some procs can run in distributed fashion). Use a native hadoop call to distribute a java map reduce job to retrieve results. Use a LASR cluster on top of HDFS cluster to do in memory computation etc.

If you just get HADOOP access engine the things you could do are limited. If you get other modules such as embedded process, Data Loader, Visual Analytics Platform you could do much more.

- Christina Harvey on May 27, 2015 11:34 am
  
  It's never too late to help clarify a topic on SAS Users.
  Thanks for commenting, Subhash.
  
Peter Crawford on October 25, 2012 3:40 am

Is it more than a non-SAS implementation of SAS-grid?

- Waynette Tubbs on October 26, 2012 10:49 am
  
  My understanding is that it is a distributed files whereas Grid is distributed processing.

Blogs