What really happens when a new entity is added to a master data environment? First of all, this is only done when it is determined that the entity is not already known to exist within the system. Abstractly, a significant amount of the process involves adding the new entity into the master index. At the very least this involves creating a new master repository record to hold (at a minimum) the identifying data element values and generating a unique identifier based on those data element values.
At the same time, though, this may incur additional processing. For example, as new data element values are added to the environment, the frequencies of data values occurring may change, which will alter the probabilities associated with matching a particular set of values to known entities, which in turn may modify the tuning parameters and weights associated with the identity resolution algorithms. And since these adjustments are related to overall data value frequency, it suggests that you should wait until there are multiple new entities queued up to be added to the environment before updating the master index and the identity resolution parameters.
On the other hand, waiting until a bunch of new entities have been collected means that the master index (and repository) will lag behind the operational systems that use the master data environment by the period over which the master index is updated. In that case, if an individual is added to an operational system’s data set and the data is queued for later insertion into the master index, that individual remains “hidden” from the enterprise until the next synchronization period. That suggests that perhaps every time a new entity record is created, the identifying information should be immediately registered in the master index!
The challenge, then, is to balance completeness and currency of the master index (“full synchronization”) requiring a significant amount of processing for rebalancing probabilities vs. a less complete master index (“periodic synchronization”) that may lead to seemingly missing entities as well the need to screen out duplicate entries added to the queue for insertion via different process paths. Determining the approach, of course, will depend on your applications’ requirements, which we will explore in my next post.