“Our corporate data is growing at a rate of 27% each year and we expect that to increase. It’s just getting too expensive to extend and maintain our data warehouse.”
“Don’t talk to us about our ‘big’ data. We’re having enough trouble getting our ‘small’ data processed and analyzed in a timely manner. First things first.”
“We have to keep our data for 7 years for compliance reasons, but we’d love to store and analyze decades of data - without breaking the machine and the bank.”
Do any of these scenarios ring a bell? If so, Hadoop may be able to help.
Contrary to popular belief, Hadoop is not just for big data. (For purposes of this discussion, "big data" simply refers to data that doesn't fit comfortably – or at all – into your existing relational systems.) Granted, Hadoop was originally developed to address the big data needs of web/media companies, but today, it's being used around the world to address a wider set of data needs, big and small, by practically every industry.
When the Apache Hadoop project was initially released, it had two primary components: 1. A storage component called HDFS (Hadoop Distributed File System) that works on low-cost, commodity hardware; and 2., a resource management and processing component called MapReduce.
Although MapReduce processing is lightning fast when compared to more traditional methods, its jobs must be run in batch mode. This has proven to be a limitation for organizations who need to process data more frequently. With the recent release of Hadoop 2.0, however, the resource management functionality has been packaged separately from MapReduce (it’s called YARN) so that MapReduce doesn’t get bottlenecked and can stay focused on what it does best – process data.
Keeping these two Hadoop components in mind, HDFS and MapReduce, let’s take a quick look at how Hadoop addresses the business scenarios above.
Data Staging Area
Today, many organizations have a traditional data warehouse setup:
- Application data, such as ERP or CRM, is captured in one or more relational databases
- ETL tools then extract, transform and load this data into a data warehouse ecosystem (EDW, data marts, operational data stores, analytic sandboxes, etc.)
- Users then interact with the data warehouse ecosystem via BI and analytical tools
What if you used Hadoop to handle your ETL processing? You could write MapReduce jobs to load the application data into HDFS, transform it and then send the transformed data to the data warehouse. The bonus? Because of the low cost of Hadoop storage, you could store both versions of the data in HDFS: the before application data and the after transformed data. Your data would all be in one place, making it easier to manage, re-process, and possibly analyze at a later date.
This particular Hadoop use case was quite popular early on. Some went so far as to call Hadoop the “ETL killer,” putting ETL vendors at risk and on the defense. Fortunately, many of these vendors quickly responded with new HDFS connectors, making it easier for organizations to leverage their ETL investments in this new Hadoop world.
This use case is a good alternative if you’re experiencing rapid application data growth and/or you're having trouble getting all your ETL jobs to finish in a timely manner. Consider handing off some of this work to Hadoop - using your ETL vendor’s Hadoop/HDFS connector or MapReduce – to get ahead of your data, not behind it.
This is a simple use case that I heard about a couple of years ago during a Facebook presentation.
Instead of using costly data warehouse resources to update data in the warehouse, why not send the necessary data to Hadoop, let MapReduce do its thing, and then send the updated data back to the warehouse? The Facebook example used was updating your mutual friends list on a regular basis. As you can imagine, this would be a very resource-intensive process involving a lot of data – a job that is easily handled by Hadoop.
This use case not only applies to the processing of data stored in your data warehouse, but in any of your operational or analytical systems. Take advantage of Hadoop’s low-cost processing power so that your relational systems are freed up to do what they do best.
This third use case is very popular and pretty straightforward. Since Hadoop runs on commodity hardware that scales easily and quickly, organizations can now store and archive a lot more data at a much lower cost.
For example, what if you didn’t have to destroy data after its regulatory life to save on storage costs? What if you could easily and cost-effectively keep all your data? Or maybe it’s not just about keeping the data on-hand, but rather, being able to analyze more data. Why limit your analysis to the last three, five or seven years when you can easily store and analyze decades of data? Isn't this a data geek’s paradise?
Don’t fall into the trap of believing that Hadoop is a big-data-only solution. It’s much more than that. Hadoop is a powerful open source technology that is fully capable of supporting and managing one of your organization’s greatest assets: your data. Hadoop is ready for the challenge. Are you?
A final note: In a few weeks, SAS will be releasing my latest whitepaper called A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse. This white paper will highlight eight Hadoop use cases, including the three mentioned above, and show how Hadoop can be used to complement your existing data warehouse ecosystem. Be sure to check it out.