Adventures in the Hadoop zoo

For some odd reason, the open-source Apache Hadoop ecosystem consists of cleverly named components that seem to have escaped from the Central Park Zoo.  You may be aware that the little yellow elephant named Hadoop was actually a stuffed toy that Doug Cutting’s son owned. (Doug is the co-creator of Hadoop). You may also know that the core Hadoop framework consists of two components, HDFS (a distributed file system) and MapReduce (a programming framework to find data within the file system).  What you may not know is that the Apache Hadoop ecosystem has a ton of ancillary components including a ZooKeeper to keep the animals at bay, languages such as Pig, Mahout (elephant driver) and Hive, along with a number of others including data stores like HBase and Cassandra (data storage), Oozie (workflow scheduling), and Sqoop and Flume (data integration).

Taming all that data in the wild.

Many of the 75 vendors at the Strata + Hadoop World Conference in NYC (Oct 23 – 25) were working on tools and technologies to make Hadoop faster, as well as more robust, scalable, and approachable for enterprises.   Prior to that conference, most of the Hadoop + MapReduce scenarios that I was familiar with were limited to recommendation engines (collaborative filtering), segmentation/classification (random forests), and text analytics. During the conference, many presentations suggested that Hadoop + MapReduce was also used for risk modeling, churn analysis, point of sale (PoS) transaction analysis, threat/fraud analysis, ETL, network traffic analysis, search analysis, and data sandboxes for ad-hoc analysis across many different industries.  However, if you take a closer look at some of the mathematical techniques employed above, they typically rely on aggregations, summaries, and non-iterative techniques.  Remember, not all analytics are created equal.  Just because someone says they do analytics, it doesn’t mean they do advanced analytics, even though they may use the same words.

The bottom line is that many companies are looking at Hadoop as an extremely scalable, lower cost data storage option.  However, as noted in several different sessions, Hadoop is not a panacea and there is no shortage of challenges associated with a Hadoop implementation.  Many companies who have implemented Hadoop still use it in conjunction with traditional relational database management systems and/or database appliances.  Common concerns with Hadoop include the fact that it is better suited for batch processing, queries tend to be a bit slow, it requires a significant amount of custom coding, there is a skills shortage, and you need to have a maniacal focus on data quality.

One company highlighted an example of its struggle with a pure-open source approach to analytics.  In their case, they ran into a variety of issues with creating a general linear model.  They struggled with the data loading, iterations required for the calculation, metadata management, deciding what to do with missing values, and debugging.

How can you make the most of the data stored in Hadoop, relational database systems, and/or database appliances?  If your organization is going to spend a fair amount of time and money on ingesting and storing data, don’t limit it to aggregations, summaries, and non-iterative techniques for the analysis of the data. It is the insights derived from the analysis that creates the value and competitive advantage for an organization.

We’ll consider how individuals are leveraging their Hadoop infrastructure and making use of advanced analytics in a future post.  But until then, SAS is Hadoopin’ and can help you make the most of your investment if you’re lost amongst the animals in the zoo!

tags: Apache Hadoop, Hadoop