Understanding data lakes

thinking about the possibilities of data lakes versus data warehouses This month, Analytics: The Agile Way hits the shelves. In a nutshell, the book argues that organizations need to rethink the way they do analytics. Long gone are the days in which they can afford to take a year or more to capture data, store it, analyze it and then act upon it.

To this end, I came across the notion of a data lake in my research. In fact, two groups of students in my analytics capstone course at ASU worked with a data-lake company on developing ontologies. In this post, I'll briefly introduce the notion of a data lake and contrast it with traditional data warehouses.

Differences between data lakes and data warehouses

At a high level, traditional, hierarchical data warehouses and data marts store data in files or folders. That is, the data does not exist in its natural state. Rather, the data typically arrives in one of these entities via some batch extract, transform, load (ETL) process. The data in these repositories exists in a predefined, static schema – aka, “schema on write.” (In many instances, this star or snowflake schema does not lend itself to changing business needs.)

The data lake is a relatively new concept. It has risen in lockstep with the rise of Hadoop and other distributed file systems and NoSQL databases. Unlike data warehouses, data lakes are vast data storage repositories that contain raw data in its native format. The data is:

Unaltered by ETL processes.
Not crammed into a preexisting schema that may make users’ reporting and analytics difficult.

What's more, the data remains in this raw state unless and until users need to access it. To this end, data lakes store data through the use of a much flatter architecture. As such, the data is more like a body of water in its natural state – hence the term data lake.

Unlike data warehouses, a data lake's schema and data requirements are not defined until the data is queried. Put differently, data lake schemas are on-read, not on-write. (Click here to learn more about this key difference.) As Margaret Rouse writes:

Each data element in a lake inherits a unique identifier tagged with an extended set of metadata tags. When a business question arises, users can query the data lake for relevant data. The end goal: those users can analyze that smaller data set to help answer the question.

Data lake uses

With the requisite background out of the way, let's get more specific.

A data lake serves as an environment in which users can quickly and easily access vast amounts of raw data. They can develop and provide analytics models with the ultimate intent of moving them into production. Beyond that, data lakes can serve as analytics sandboxes – i.e., ways of doing data discovery without adversely affecting downstream systems. Finally, some organizations are using data lakes as enterprisewide catalogs that help users find data and link business terms with technical metadata.

Simon Says

Expect data lakes to continue to rise in popularity as demand for real-time data and analytics grows.

Feedback

What say you?

Download a paper to see what TDWI has to say about data warehouse modernization

Blogs