When we first started talking about big data a few years ago, there were a few questions I'd often ask people. For example: “What does big data mean to you?” Nine out of ten people would say it was the volume or amount of data to be analyzed. I would reply (because I can never keep my mouth shut), “What about the diversity of data from disparate sources being analyzed together with tools that help us find answers to questions we didn't know to ask?” Well, this concept was harder to understand.
How does Hadoop work with big data?
Hadoop enables massively distributed processing of unstructured data using commodity computer clusters where each node (think processor) has its own storage. One of the first programming tools or models for implementation associated with processing these large clusters of data was MapReduce. I saw it used in many algorithms, especially those requiring analysis of customer information. But times are changing for big data.
These days there are many more factors to consider when talking about big data. Not only do we have to consider the volume of the data, but the design for timely delivery of the data is also important. Let’s take an example where data is coming from a web portal streaming into a Hadoop cluster. Date time stamp design needs to accept the data and be able to query the first record and the latest record. Basically, it needs to be designed for real-time data based on the speed to the consumer.
Due to the diversity in the types of data we call big data, you can’t just get the data in. You must understand how this data/information will be consumed or presented for analysis and usage. Examples include email, financial transactions and text documents. Consider asking this question: How will this data be presented to my business users?
Load management will become increasingly challenging too, with more diverse data loaded at peak times. Different initiatives may need different data, or they may need some data that we already have in Hadoop. Requirements are still needed in a fast-moving, cheaper environment.
Pick the right tools
If your organization has tools for data ingestion, consider a proof-of-concept to make sure these tools will work for your implementation. Base your analysis on the data consumption/usage requirements. I don’t think we want tons of lines of code to maintain in the future – and we need a way to understand what data is already in Hadoop, and how to use it.
New tools for reporting and analytics are becoming available almost daily. In fact, some of our standard BI reporting and analytic tools are ready to deploy on Hadoop now.
I'm seeing more and more complex data presented for our Hadoop platforms to consume and analyze. This complexity brings its own set of problems – problems related to cleansing, transformation and general data management principles. The tools available for our big data platforms evolve quickly and often.
People refer to a Hadoop data lake as a big storage device. But really, it's a data management platform that uses the Hadoop cluster to process and store the data. That statement implies tools to help us understand and manage this data. So if you are just composting data into Hadoop storage, without following any data management principles, you may have a "fake lake." But if you're storing data in an organized fashion, with tools to ease the programming and management pain, you may indeed have a data lake.
So, does Hadoop equal big data? The answer is yes. But don't forget all the other things you need to consider as you undertake each of your projects.
Download a paper about bringing the power of SAS to Hadoop
2 Comments
Joyce... From my point of view, the statement: "Hadoop enables massively distributed processing of unstructured data" tends to confuse people, making them think that Hadoop is only for unstructured data. There are lot of people who say the same thing. But I think Hadoop is as good for structured data as it is for storing and analyzing unstructured data. Do you agree with me? Or is something wrong with my reasoning? Thanks for your blog. Sergio
Sergio - great comment! I am seeing that there are companies using it for both. I think structured data in a Hadoop environment is maturing faster than I can write a BLOG! LOL So I do agree with you, and consider design of how to use that data very, very important.
Thank you for your comment.