Entity resolution and master data management


Master data management is an application framework comprising a number of different information management practices and services. And the core of most party-oriented (e.g. customer/employee/vendor, etc.) master data management systems is some mechanism for entity resolution, which fundamentally is intended to identify connections between data instances that refer to the same entity and link those data instances together.

I am being somewhat careful with my description of entity resolution for a few reasons. In a recent conversation with a marketing manager from a company developing an entity resolution product, I was repeatedly corrected to say “entity resolution” when I used the term “identity resolution” to refer to what their product did. In other situations, customers of ours who are implementing an MDM solution always refer to the same process as “identity resolution.”

Searching Wikipedia for the phrase “entity resolution” takes you to a page titled “Record Linkage.” When you go to Google and search for “entity resolution” and Wikipedia, the first result I got pointed me to a Wikipedia page titled “Identity Resolution.” After reading all the definitions, etc., though, I get the feeling there is little precision in differentiating these terms, which are basically used interchangeably, depending on how old you are, what companies you've worked for, and the recency of the articles you're reading in the technical media.

The common aspects of all the descriptions focused on examining the attribute contents of sets of records to determine with some level of confidence when subsets of the records refer to the same real-world thing. There are two practical approaches to record linkage: deterministic and probabilistic. In the deterministic approach, two records are said to match if a particular number of their corresponding attribute values match.

In the probabilistic approach, weights are assigned to each attribute and methods of scoring similarity provide scores that, when combined with the weights, are aggregated to provide an overall match score, above which a pair of records is said to match. If the score is below a lower threshold, the pair is not a match. Scores in between the thresholds require further inspection to determine whether the pair matches or not.

Clearly, any MDM system will need some kind of entity resolution technique to link records with a known identity. And unsurprisingly, most of the vendors selling MDM solutions have either built their solutions around the entity resolution capability, or have acquired companies whose main product is an entity resolution application. Here is a question, though: must entity resolution solely live in the MDM world? Or can entity resolution survive in isolation?


About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

1 Comment

  1. Hi David,

    Great posting as always. You're absolutely right. There is quite a bit of confusion out there, and people should clarify what they mean. I actually have my own definition, which I'm using in my upcoming book on Multi-domain MDM.

    To me, entity resolution is the process to recognize a specific entity and properly represent it uniquely, completely, accurately, and consistently from a data point of view.

    With that said, identify resolution and record linkage are steps within entity resolution.

    Identity resolution is the step to determine what data elements are used to uniquely identify an entity. For example, one company might use SSN only (notice, this is NOT a recommendation LOL), while another company might use a combination of several data elements (Name, DOB, Address, etc).

    Record linkage is the step where identity resolution is applied across multiple sources to solve for a certain entity, which ultimately result in a unique entity.


Leave A Reply

Back to Top