In my prior two posts, I explored some of the issues associated with data integration for big data and particularly, the conceptual data lake in which source data sets are accumulated and stored, awaiting access from interested data consumers. One of the distinctive features of this approach is the transition from schema-on-write (in which ingested data is stored in a predefined representation) to schema-on-read (where the data consumer imposes the structure and semantics on the data as it is accessed).It is important to note that data-lake-data is not “schema-less” – unless the data is truly a mish-mash of unstructured content, there is some structure or schema to the data, most likely the format that the originating source provided. I would also note that consuming the data requires some imposition of a schema for interpretation and use. On the other side, since different users may have different applications for the data, there may be multiple imposed schemas associated with the same source data set.
From a practical standpoint, this means there are multiple users simultaneously devising data models for consumption. And because of the focus on the utilization of the data instead of its initial storage, in some cases the process of data modeling for big data is somewhat different than the traditional entity/relationship modeling that data practitioners might be accustomed to.
The approach to modeling is an iterative one that is driven by a combination of the data access processes and the analyst’s perception of utility. This can be illustrated by a simple thought experiment. As an analyst, you are presented with a data set for use. The first thing to do is to load the data into an analysis tool, review the data, see if the pressing business questions can be asked, and if the results adequately answer those questions.
If the answer is “yes,” then you are ready to analyze the data. However, if the answer is “no,” you must re-examine the way the data was accessed and how it was configured for use to see what changes can be made to improve its utility. Once those changes have been considered, reload the data into the revised consumption schema and then review the data, see if the pressing business questions can be asked, and if the results adequately answer those questions. For the next step, jump back to the beginning of this paragraph.
As you can see, the process repeats itself until the analyst settles on a suitable data model for utilization. This process benefits the analyst in eliminating the drawbacks of a single conformed data warehouse data model. Alternatively, it forces the analyst to take on a set of skills and responsibilities previously not required. And there is one drawback: if many analysts are devising their own models, how can we ensure consistency in interpreting results? More on this in my next post…
SAS is a leader in Gartner Magic Quadrant for data integration tools for the fifth consecutive year.