By now we have all heard how Yahoo uses Hadoop to optimize the user experience by ad and content targeting. We know that Hadoop is well suited for analysis or processing that can be distributed in a parallel fashion on multiple nodes. We know it’s great for managing big data. And we also know that Hadoop is ill-suited for transactional use cases since it lacks ACID support. But beyond the initial use of Hadoop by Yahoo, Google and other big web properties, how is Hadoop being used? How are you using Hadoop?
Given that Hadoop is relatively new to many organizations, it’s important to look at how the early adopters are utilizing Hadoop. Although it is still early stages for Hadoop, and many related projects are gaining traction that may expand the usage of Hadoop, there are usage patterns that are starting to emerge. Let’s take a look at these use cases from 2 overlapping perspectives, technical and business:
Staging area for Data warehouse / analytics – Using Hadoop as a vehicle to load data into a traditional data warehouse for data mining, OLAP, reporting, etc., and for loading data into an analytical store for advanced analytics. Organizations can dump large amounts of data into a Hadoop cluster, leverage SQL-like querying using Hive to make sense out of the data, aggregate (depending on the use cases) and export the data or aggregate values into the warehouse or source used for analytics. With proper design of the ETL process, Map/Reduce can be used to bring the data preparation tasks to the data, ensuring optimal processing for large data volumes. At this point, other than customers that are just kicking tires, this is the predominant use case that we see.
Analytics Sandbox - Leverage Hadoop as a vehicle for ad-hoc analysis sandbox environment. This involves moving data temporarily into Hadoop from various sources, potentially even the EDW, and making it available to an analyst for open-ended, iterative analytics. This is often teamed with leveraging BI tools to do analysis based on structured information that resides in the EDW.
Unstructured / semi-structured content storage and analysis – Often related to the first use case, this use case uses Hadoop as a storage mechanism for capturing unstructured or semi-structured content and then using Map/Reduce to parse, tokenize, join with other data.
Total data analysis – Leverage parallelism to overcome bandwidth and coordination issues relating to processing billions of records – for example, supporting a process that involves searching for something, looking for patterns, etc. This approach also relates to running analytics on larger data samples, or total data samples, vs. using sampling to simulate a larger data population.
Commodity based Storage – leveraging commodity based hardware and Hadoop to store large amounts of data. Again, this relates to some of the previous use cases, and can involve the storage of transactional data, social media data, sensor data, scientific data, emails, etc.
For more on the technical side, check out David Loshin's blog post on The Data Roundtable.
Behavioral analysis – Analyzing the behavior of key business entities - customer churn, propensity to respond, etc.
Targeting marketing offers – Determine which marketing offers should be made to each relevant target - optimizing ad placement, optimize marketing offers, etc.
Analyzing marketing effectiveness – Analyze marketing efforts after the fact – marketing ROI, campaign effectiveness, etc.
Root cause analysis – Determine root cause of failure, issues, defects – investigate user sessions, network log analysis, machine sensor analysis, etc.
Sentiment Analysis – Analyze customer or prospect sentiment analysis based on – social media analysis, email analysis, etc.
Fraud Analysis – Detect fraudulent behavior – clickstream or web analysis, mining, etc.
Risk Mitigation – Analyze potential market trends, understand future possibilities to mitigate risk – financial position or total portfolio asset analysis, compliance, return on capital, etc.
There are other ways to analyze Hadoop usage. For example, are organizations looking at Hadoop as a replacement or augmentation strategy, what technologies are being used to process data - Map/Reduce, Hive, Pig, etc. , what sub-projects are being used to augment Hadoop - Zookeeper, Sqoop, Flume, Oozie, Avro, Mahoot, etc.
But these decisions should be driven by your major technical and business usage scenarios.