Data integration considerations for the data lake: The need for metadata

A few of our clients are exploring the use of a data lake as both a landing pad and a repository for collection of enterprise data sets. However, after probing a little bit about what they expected to do with this data lake, I found that the simple use of the data lake appellation masks out a large part of the complexity of change required to transition from the conventional approach of data acquisition and storage to the data lake concept.According to TechTarget, a data lake is “a large object-based storage repository that holds data in its native format until it is needed.” A data lake is often implemented using big data technologies, such as Hadoop, where the data sets are stored using Hadoop Distributed File System (HDFS), which manages the distribution and access to the data. Other alternatives include NoSQL data management environments that can scale through horizontal or vertical partitioning, or “sharding.”

Practically speaking, the concept of the data lake is a place for collecting data sets in their original format and making them available to different consumers, formatted in the ways that the data consumers need. Doing this requires a fundamental change in the way one looks at data storage, characterized by the difference between what is called schema-on-write versus schema-on-read.

Schema-on-write is the more conventional approach to data management, in which data sets are loaded into a predefined schema or structure with an organized data model, such as a data warehouse. The alternative approach, schema-on-read, is when the data is (presumably) stored in its original format and is reconfigured into a target data model when it is read. This approach contributes to the proposed benefits of the data lake, which center on the ability of multiple data consumers to use the same data sets for different purposes.

The first challenge of data integration using schema-on-read relates to the absence of a defined table structure. This means that to integrate the data as part of a reporting or analysis application, you need to already know what is in the data and then figure out how that data needs to be configured for your specific use. And, since anyone can use that data, everyone needs to know how to configure the data!

In other words, any prospective data set consumer must be aware of the contents of the file. Minus an encyclopedic memory, that implies the need for a way to capture the structural and semantic metadata associated with each data set. Any prospective user would then consult this metadata inventory before any use to determine the contents.

Despite the increasing talk about data lakes, it's rare to hear about metadata being a core enabling technology. As we continue these discussions, it's becoming clear that the cost for the flexibility of schema-on-read must be paid for with an increased investment in metadata.