Last time we discussed two different models for syndicating master data. One model was replicating copies of the master data and pushing them out to the consuming applications, while the other was creating a virtual layer on top of the master data in its repository and funneling access through a data virtualization framework.
The benefit of the replication model is that it can scale to meet the performance needs of all the downstream consumers, at the risk of introducing asynchrony and inconsistency. The benefit of the virtualization approach is synchronization and consistency, but at the risk of creating a data access bottleneck. Either may be satisfactory for certain types of applications, but neither is optimal for all applications.
There is, however, a hybrid model that blends these two models: selectively replicating the master repository, maintaining a consistent view via change data capture, and enabling federated access via a virtualization layer on top of the replicas. In this approach, the repository can be replicated to one or more high-performance platforms (such as on a Hadoop cluster), with each instance intended to support a limited number of simultaneous client applications.
The virtualization layer can manage access to multiple replicas and provide elasticity in balancing the requests to different replicas as the load increases or decreases. Updates can be channeled through the source master environment, as any changes will be forwarded to the replicas within a well-defined time window.
Sounds good. The next step? Making sure all the pieces work together. Do the data virtualization tools provide seamless access to Hadoop-based systems? And how easy is it to replicate in a controlled manner? These are some of the next sets of questions when considering master data integration at the enterprise level.