In my last post, I started to look at the use of Hadoop in general and the data lake concept in particular as part of a plan for modernizing the data environment. There are surely benefits to the data lake, especially when it's deployed using a low-cost, scalable hardware platform. The significant issue we began to explore is this: the more prolific you become at loading data into the data lake, the greater the chance that entropy will overtake any attempt at proactive management.
Let's presume that you plan to migrate all corporate data to the data lake. And the idea of the data lake is to provide a resting place for raw data in its native format until it's needed. Now, let’s imagine what you need to know when you decide that the data truly is needed:
- You need to know that the data exists.
- You need to know that the data is in a file in the data lake.
- You need to know which of the files contain the data you need.
- You need to know the format, if any, of the data.
- You need to know details about the storage layout.
- You need to know what other information is in the file.
- You need to know the security and protection characteristics for the data.
- You might want to know who created the file and when it was added to the data lake.
In other words, you need to know a lot about that data. And here is the most confusing part: you may not even know which data is the data you want! That is part of the promise of the data lake – data is kept around until someone needs it, and it's up to the data consumer to determine what data they need, when they need it.
Data catalogs for the data lake
In reality, the simplistic approach to the data lake just won’t work. You need a means for creating a catalog of the data in the data lake so that data consumers have a way to browse through the inventory of data assets to determine which are usable for a particular application or analysis.
This data catalog would have to capture all of the aforementioned metadata. Of course, the metadata should be much more granular. For example, if the file contains structured data, the metadata should show the structure, the data element names, and the kinds of data that are really in each data element – because sometimes the names and values don’t match up. Setting this up requires some forethought, and some effort is needed to keep the data catalog up to date.
Data preparation tools can help
One approach is to use existing technologies to help keep this metadata catalog up to date. Data preparation tools can use profiling to infer data types of managed files, while a collaborative metadata repository can document how different data consumers have been using different files for a variety of applications. Automated surveying tools can be used to compare the content of different files, determine when there are multiple copies of the same data, or even where there is a discernable lineage trail linking one early version of a file or spreadsheet to a number of later versions. Search tools can be used, among other things, to index all the phrases within each file.
Effective modernization cannot solely rely on the ungoverned population of the data lake. There must be a way to map the name of each file to its reflective metadata, documenting format and content. If you consider this before opening the “data chute” into the data lake, it will prevent a lot of headache down the road.