Analyzing the data lake

In my previous post I used junk drawers as an example of the downside of including more data in our analytics just in case it helps us discover more insights only to end up with more flotsam than findings. In this post I want to float some thoughts about a two-word concept that is becoming almost as prevalent as big data and sounds scarily close to dumping all enterprise data into a junk drawer. That concept is the data lake, which was the topic of the great Data Lake Debate blog series between Tamara Dull and Anne Buff, moderated by Jill Dyché.

Let’s start with a definition. According to Dull, a data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. Any and all data, therefore, can be captured and stored in a data lake, and data structure and business requirements do not have to be defined until the data is needed. In theory, a data lake enables the enterprise to store, process, and analyze all of its data, allowing the business to ask more questions and get better answers.

As Dull described it, data management has evolved from wanting to store and process any and all data, but not being able to due to costs and technology limitations, to the era of big data where technologies like Hadoop are cost-effectively enabling it, but leaving us questioning if we should. The central question is whether collecting and storing data without a pre-defined business purpose is a good idea.

Buff rebuffed the idea by stating that collection without purpose is hoarding and creating a data lake could prove detrimental to enterprises that do so prematurely. She argued that the key characteristic of a data lake is it’s a storage repository and storing data does not, on its own, provide business value. Data must be processed, packaged, delivered, and consumed before it can provide business value. (Read David Loshin's SAS Insights article about a "problem-solver" approach to data preparation).

While the advantage of bringing all this data into a lake is the business insights that can be derived from the analysis of this data, before analysis must come integration. “Data brought into a data lake is co-located not integrated,” Buff emphasized. “The integration happens outside of the storage environment—on the banks of this beautiful data lake.” Dull’s sharp counterpoint is that one of the values of co-locating any and all raw data in one place, like a data lake, is that we can “process it any way we want, and then store the processed results anywhere we want. And if we don’t like the results, or we have new data, or we have different questions, it’s no big deal to go back to the original, raw data and start over. A data lake is a newer storage alternative for organizations that want to mix-and-match their data so that they can analyze it and discover insights that they would never be able to find with existing, relational technologies.”

“In order for an organization to take full advantage of its data,” Buff concluded, “the organization must first develop a strategic, enterprise level understanding and use of the data it has.”

“An organization will be able to take full advantage of its data,” Dull concluded, “if there’s a way for them to bring it all together without breaking the bank. The data lake provides that opportunity.”

What say you?

I recommend reading all of the posts in the Data Lake Debate blog series for a deeper dive into the pros and cons of this concept. Please also share your perspective about the data lake by posting a comment below.

Blogs

Blogs

Analyzing the data lake

What say you?

About Author