Employing master data techniques for data inte-aggregation in a managed way

1

Certainly, enabling any coordinator that collects data from numerous sources to integrate and then aggregate individual line-item statistics requires more firepower than just sweeping data sets and dumping them into a single database. One approach is to employ master data management techniques for entity identification and then identity resolution.

Clearly, the same individuals will be represented in the different data sets in different ways. Entity identification can be used to scan the data fields that represent the individual names, parse out the relevant tokens in those name strings and then standardize the representations.

Identity resolution is applied to match the standardized representations against a master index and determine if each entity is already logged within the master index. If so, the matched record(s) can be linked to that unique identity and cached for later aggregation. If not, there are two things to consider. The obvious one is to create a new entry, while the more thoughtful one is to yet again use the MDM techniques to try to find the closest matches and have a data practitioner work with the data publishers to determine any potential matches.

At the same time, it is in the coordinator’s best interests to communicate the shared index of entities in a managed way to the organization providing data. I say “managed” because different situations will be constrained by different data policies. For example, there may be a presumption that one company’s enumeration of individuals may not be directly shared with any other company; publishing a master list might violate this presumption.

One quick thought: the coordinator must manage the master index somehow, so could that coordinator also provide an identity resolution service to help standardize the entity representations across the community? If so, this defines a de facto standard of representation that can be communicated (and hopefully deployed) within each of the organizations within that community, which is an example of how managed services define federated governance policies.

Next time: some other thoughts about the operational model.

Share

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

1 Comment

  1. If you could effectively create a "glossary" defining the entity representation (business semantic definition + syntactic definition + contex definition) could you then communicate that representation to the community? Would seem logical enough. Translate that representation into a canonical model and you get something that can be dealt with by systems.
    It becomes a target of sorts that can be used as the "common language" in data movment across systems, and the resolution service becomes a universal translator...kind of a BabbleFish for data.

Leave A Reply

Back to Top