Data integration considerations for the data lake: Standardization and transformation

In my last post, I noted that the flexibility provided by the concept of the schema-on-read paradigm that is typical of a data lake had to be tempered with the use of a metadata repository so that anyone wanting to use that data could figure out what was really in it. There are a few other implications of the integration of data using the schema-on-read approach.

First, because the data is captured in its original form, no decisions are made about data standardization or transformation when a data set is acquired and initially stored. Instead, the data consumer imposes those decisions when the data is read. In fact, one benefit of this deferral is that different data consumers can apply different standardization and transformation rules based on their own requirements, which does loosen some of the constraints often imposed when one set of data standardizations and transformations are applied in the conventional schema-on-write approach.

There are a few potential drawbacks of this deferral of standardization and transformations:

First, a naïve application that accesses the same data set multiple times may apply the same set of standardization and transformation each time data is accessed. You have to hope that the application developers will be aware of this and make sure that this is only done once.
Second, it is possible that there are different data consumers thinking that their uses are distinct, yet all might apply the same standardizations and transformations to the data for each application use. In this case, there is bound to be some duplication of computation.
Third, when the transformations are not the same, different consuming applications will think that their uses are consistent when in fact the variant transformations will create incoherence among different reports and analytical applications

The benefits and the drawbacks must be balanced in relation to the expected value provided by the data lake. The ability to provide a high capacity yet low-cost platform for generalized storage is very appealing, but the details of making that data accessible and usable should not be ignored. Big data integration via the data lake holds a lot of promise. As adoption grows, we must be aware of the ways to optimize the application of standardizations and transformations so as to not duplicate work, introduce incoherence and maintain a consistent view for all derivate downstream analysis.