As we discussed in an earlier post in this series, one of the intents of data federation and virtualization is to layer some degree of opacity over accessing heterogeneous data sources by using a canonical model and a semantic layer for user queries. There are two additional benefits we expect from federation and virtualization: improved performance and improved data quality.
Federation is achieved by developing a semantically consistent mapping between the canonical model and the underlying sources. But performance is provided through two key mechanisms. You can anticipate data accesses that can be launched to reduce the latency between the request for data and its delivery. In addition, you can employ simulated caching in which data that has already been accessed from an originating source is copied locally and reused unless it has been changed in the original source.
Despite the benefit of increased performance, you may still be at the mercy of the original sources to guarantee quality. And this is where we see the benefit of the master data index, as it can be used to both support prefetching and caching while providing a means for data quality improvement in preparing the data for delivery to the requesting application.
First, any requests for specific entity are resolved through the master index to locate the sources that hold instances that represent that entity. In turn, those sources can be accessed simultaneously to stream the entity data from the sources into the virtual cache. At that point, the business rules that are specific to the data quality and usability expectations of the specific users can be applied to the data pulled from the original sources. In other words, this allows for the delivery of materialized master records with an assurance of all facets of consistency of the data presented for analysis:
- Structural – the types and formats of the target data elements are compatible with and across those of the sources as well as the desired target;
- Semantic – the presumed meanings of the target data concepts are assured to be consistent with and across those of the sources; and
- Temporal – the currency of the data is synchronized across the sources and then again with the target, providing a materialized version that is up-to-date with the selected data sources.
In other words, by taking advantage of the use of the inverted indexing back to original sources of record, MDM can be effectively used as a key enabling capability for the improved performance and quality expected from any enterprise canonical data model serving semantically consistent data via a virtualized and federated environment.