Certainly, enabling any coordinator that collects data from numerous sources to integrate and then aggregate individual line-item statistics requires more firepower than just sweeping data sets and dumping them into a single database. One approach is to employ master data management techniques for entity identification and then identity resolution.
Clearly, the same individuals will be represented in the different data sets in different ways. Entity identification can be used to scan the data fields that represent the individual names, parse out the relevant tokens in those name strings and then standardize the representations.
Identity resolution is applied to match the standardized representations against a master index and determine if each entity is already logged within the master index. If so, the matched record(s) can be linked to that unique identity and cached for later aggregation. If not, there are two things to consider. The obvious one is to create a new entry, while the more thoughtful one is to yet again use the MDM techniques to try to find the closest matches and have a data practitioner work with the data publishers to determine any potential matches.
At the same time, it is in the coordinator’s best interests to communicate the shared index of entities in a managed way to the organization providing data. I say “managed” because different situations will be constrained by different data policies. For example, there may be a presumption that one company’s enumeration of individuals may not be directly shared with any other company; publishing a master list might violate this presumption.
One quick thought: the coordinator must manage the master index somehow, so could that coordinator also provide an identity resolution service to help standardize the entity representations across the community? If so, this defines a de facto standard of representation that can be communicated (and hopefully deployed) within each of the organizations within that community, which is an example of how managed services define federated governance policies.
Next time: some other thoughts about the operational model.
1 Comment
If you could effectively create a "glossary" defining the entity representation (business semantic definition + syntactic definition + contex definition) could you then communicate that representation to the community? Would seem logical enough. Translate that representation into a canonical model and you get something that can be dealt with by systems.
It becomes a target of sorts that can be used as the "common language" in data movment across systems, and the resolution service becomes a universal translator...kind of a BabbleFish for data.