Data integration considerations for the data lake: Standardization and transformation


In my last post, I noted that the flexibility provided by the concept of the schema-on-read paradigm that is typical of a data lake had to be tempered with the use of a metadata repository so that anyone wanting to use that data could figure out what was really in it. There are a few other implications of the integration of data using the schema-on-read approach.

First, because the data is captured in its original form, no decisions are made about data standardization or transformation when a data set is acquired and initially stored. Instead, the data consumer imposes those decisions when the data is read. In fact, one benefit of this deferral is that different data consumers can apply different standardization and transformation rules based on their own requirements, which does loosen some of the constraints often imposed when one set of data standardizations and transformations are applied in the conventional schema-on-write approach.

There are a few potential drawbacks of this deferral of standardization and transformations:

  • First, a naïve application that accesses the same data set multiple times may apply the same set of standardization and transformation each time data is accessed. You have to hope that the application developers will be aware of this and make sure that this is only done once.
  • Second, it is possible that there are different data consumers thinking that their uses are distinct, yet all might apply the same standardizations and transformations to the data for each application use. In this case, there is bound to be some duplication of computation.
  • Third, when the transformations are not the same, different consuming applications will think that their uses are consistent when in fact the variant transformations will create incoherence among different reports and analytical applications

1439322506230The benefits and the drawbacks must be balanced in relation to the expected value provided by the data lake. The ability to provide a high capacity yet low-cost platform for generalized storage is very appealing, but the details of making that data accessible and usable should not be ignored. Big data integration via the data lake holds a lot of promise. As adoption grows, we must be aware of the ways to optimize the application of standardizations and transformations so as to not duplicate work, introduce incoherence and maintain a consistent view for all derivate downstream analysis.

SAS is a leader in Gartner Magic Quadrant for data integration tools for the fifth consecutive year.


About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at

Related Posts

Leave A Reply

Back to Top