I am currently cycling through a schema-on-read data modeling process on a specific task for one of my clients. I have been presented with a data set and have been asked to consider how that data can be best analyzed using a graph-based data management system. My process is to load the data, examine whether I have created the right graph representation, execute a few queries, and then revise the model. I think I am almost done with this process, except that as I continue to manipulate the model for analysis I notice yet one more thing about the data that I need to tweak before I can really start to analyze the data.
As soon as I am done with my modeling process, I would like to share the results of my process with others interested in analyzing the same data. While they may want to adjust the model somewhat, the fact that I have already allocated time to assess what is in the data set and how it can be represented in the graph model means that most other analysts don’t need to duplicate my effort.
If you recall my previous post, I suggested that assigning the role of data modeler to a data analyst may be a drawback to the use of a schema-less data lake unless there were some way to collaborate about the modeling process. There are two aspects to creating this collaboration. The first is the knowledge capture aspect, in which the details of the consumer’s data model’s data entities, characteristics and relationships are documented within a metadata repository that is shared among the user community. This complements the use of a metadata repository for source data information (see my earlier post on the use of metadata).
The second is the practical implementation of shared models using data virtualization technology. By adapting the developed consumer data models within a data virtualization interface, other data users can experiment with existing renderings of the data to see if any already meet their analytics needs. If not, then any of the documented models can be used as a starting point – or, if all else fails, the analyst can start from scratch to develop his or her own model.
This suggests that there is hope in layering some standards and consequently governance over the consumption of schema-on-read data from a big data platform. By enabling a collaborative framework for model design and implementation, a governance team can impose standards for consistency in use and interpretation across the enterprise.
2 Comments
Data modeling has never achieved a level of precision and most data models represent a single view of data. Data models lack semantic precision due to the lack of metadata design discipline. Schema on read provides an opportunity for further chaos by allowing data consumers to use data with even less regard to the semantics. Manipulating the data is easy with today's tools. Explaining the approach and proof behind the manipulation is not so easy. Schema on read is not data modeling as it lacks even the basics of data modeling practices which themselves are lacking discipline.
Richard, I think your post shows that you agree with what I am saying. The attempt to create a single "model" of the data will no longer satisfy the needs of the users who are trying to solve their problems with little or no practical backgrounds in data management. That being said, any attempt by a data user to impose some structure is a model, even if it is not one that demonstrates the elegance of a pure design provided by an experienced data practitioner. In the end, people are going to do what they can do within the constraints of data policies but will certainly bypass those policies if they impede their ability to progress. Perhaps the best way to proceed is to have direct engagement between the data practitioners and the business data users and provide collaborative guidance in superimposing models that actually lead to increased consistency with accepted semantics.