Big data model convergence: Combining metadata and data virtualization as a collaboration tool - Part 2

2

I am currently cycling through a schema-on-read data modeling process on a specific task for one of my clients. I have been presented with a data set and have been asked to consider how that data can be best analyzed using a graph-based data management system. My process is to load the data, examine whether I have created the right graph representation, execute a few queries, and then revise the model. I think I am almost done with this process, except that as I continue to manipulate the model for analysis I notice yet one more thing about the data that I need to tweak before I can really start to analyze the data.

As soon as I am done with my modeling process, I would like to share the results of my process with others interested in analyzing the same data. While they may want to adjust the model somewhat, the fact that I have already allocated time to assess what is in the data set and how it can be represented in the graph model means that most other analysts don’t need to duplicate my effort.

If you recall my previous post, I suggested that assigning the role of data modeler to a data analyst may be a drawback to the use of a schema-less data lake unless there were some way to collaborate about the modeling process. There are two aspects to creating this collaboration. The first is the knowledge capture aspect, in which the details of the consumer’s data model’s data entities, characteristics and relationships are documented within a metadata repository that is shared among the user community. This complements the use of a metadata repository for source data information (see my earlier post on the use of metadata).

The second is the practical implementation of shared models using data virtualization technology. By adapting the developed consumer data models within a data virtualization interface, other data users can experiment with existing renderings of the data to see if any already meet their analytics needs. If not, then any of the documented models can be used as a starting point – or, if all else fails, the analyst can start from scratch to develop his or her own model.1435864703154

This suggests that there is hope in layering some standards and consequently governance over the consumption of schema-on-read data from a big data platform. By enabling a collaborative framework for model design and implementation, a governance team can impose standards for consistency in use and interpretation across the enterprise.

Share

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

2 Comments

  1. Richard Ordiwich on

    Data modeling has never achieved a level of precision and most data models represent a single view of data. Data models lack semantic precision due to the lack of metadata design discipline. Schema on read provides an opportunity for further chaos by allowing data consumers to use data with even less regard to the semantics. Manipulating the data is easy with today's tools. Explaining the approach and proof behind the manipulation is not so easy. Schema on read is not data modeling as it lacks even the basics of data modeling practices which themselves are lacking discipline.

  2. David Loshin

    Richard, I think your post shows that you agree with what I am saying. The attempt to create a single "model" of the data will no longer satisfy the needs of the users who are trying to solve their problems with little or no practical backgrounds in data management. That being said, any attempt by a data user to impose some structure is a model, even if it is not one that demonstrates the elegance of a pure design provided by an experienced data practitioner. In the end, people are going to do what they can do within the constraints of data policies but will certainly bypass those policies if they impede their ability to progress. Perhaps the best way to proceed is to have direct engagement between the data practitioners and the business data users and provide collaborative guidance in superimposing models that actually lead to increased consistency with accepted semantics.

Leave A Reply

Back to Top