You may still believe that Hadoop is going to solve all of the world’s problems with big data. It won’t. Hadoop is a framework for storing large-scale data processing with both pros and cons for organizations.
Christopher Stevens, from Greenplum, explained that Hadoop is rapidly becoming the go-to for big data analytics. In Stevens’ presentation at the 2012 SouthEast SAS Users Group conference, he explained what Hadoop is and noted some of the pros and cons of using Hadoop.
- Inspired by Google’s architecture: MapReduce and GFS.
- Open source, a top-level Apache project.
- Written in Java, plus a few shell scripts.
- Includes tool sets such as Pig, Hive and HBase.
But Hadoop is not right for everyone. “We’ve seen customers go out and stand up Hadoop clusters, but they don’t need them,” Stevens said.
When Hadoop is right for an organization:
- It can be effective for both unstructured and structured data if your processing can easily be made parallel.
- When there’s a need to reduce the query times for batch jobs.
- If you have access to lots of cheap hardware.
What Hadoop is not so good for:
- Intense calculations with little data.
- When your organization’s processing cannot be easily made parallel.
According to Stevens, the biggest challenge for organizations is that they need to find analysts with new skills.
Read more about Hadoop and how SAS and Greenplum work together to analyze big data with Hadoop. (This paper and all SESUG 2012 papers will be available soon.)