Big data modeling: An iterative approach - Part 1


In my prior two posts, I explored some of the issues associated with data integration for big data and particularly, the conceptual data lake in which source data sets are accumulated and stored, awaiting access from interested data consumers. One of the distinctive features of this approach is the transition from schema-on-write (in which ingested data is stored in a predefined representation) to schema-on-read (where the data consumer imposes the structure and semantics on the data as it is accessed).It is important to note that data-lake-data is not “schema-less” – unless the data is truly a mish-mash of unstructured content, there is some structure or schema to the data, most likely the format that the originating source provided. I would also note that consuming the data requires some imposition of a schema for interpretation and use. On the other side, since different users may have different applications for the data, there may be multiple imposed schemas associated with the same source data set.

From a practical standpoint, this means there are multiple users simultaneously devising data models for consumption. And because of the focus on the utilization of the data instead of its initial storage, in some cases the process of data modeling for big data is somewhat different than the traditional entity/relationship modeling that data practitioners might be accustomed to.

The approach to modeling is an iterative one that is driven by a combination of the data access processes and the analyst’s perception of utility. This can be illustrated by a simple thought experiment. As an analyst, you are presented with a data set for use. The first thing to do is to load the data into an analysis tool, review the data, see if the pressing business questions can be asked, and if the results adequately answer those questions.

If the answer is “yes,” then you are ready to analyze the data. However, if the answer is “no,” you must re-examine the way the data was accessed and how it was configured for use to see what changes can be made to improve its utility. Once those changes have been considered, reload the data into the revised consumption schema and then review the data, see if the pressing business questions can be asked, and if the results adequately answer those questions. For the next step, jump back to the beginning of this paragraph.1439322506230

As you can see, the process repeats itself until the analyst settles on a suitable data model for utilization. This process benefits the analyst in eliminating the drawbacks of a single conformed data warehouse data model. Alternatively, it forces the analyst to take on a set of skills and responsibilities previously not required. And there is one drawback: if many analysts are devising their own models, how can we ensure consistency in interpreting results? More on this in my next post…

SAS is a leader in Gartner Magic Quadrant for data integration tools for the fifth consecutive year.


About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at

Leave A Reply

Back to Top