Determining reference data set isomorphisms

0

In my last post we started talking about the tasks associated with data harmonization; the topic of this week’s post is determining that two reference data sets refer to the same conceptual domain.

First, let’s review some definitions:

  • A value item is a representation of a specific value meaning in a value domain.
  • A value domain is a collection of value items.
  • A conceptual domain represents the meanings of the permissible values in a value domain.
  • A value meaning is a relation between a concept in a conceptual domain and a value item.

As an example, the “states of the United States” is a conceptual domain. There are concepts such as the geopolitical entities referred to as “states” that are members of the conceptual domain, such as “Alaska” and “California.” A value domain for states might consist of a collection of two-character uppercase abbreviations, such as “NY” or OH.” Another value domain for states might consist of two-digit codes such as “01” or “23.”

We can say that value domains A and B are isomorphic if:

  • The cardinality of A is equal to the cardinality of B (they have the same number of values).
  • Both A and B are associated with a conceptual domain C with the same cardinality as A and B.
  • There is an enumeration of value meanings in A that is a 1-1 mapping to concepts in C.
  • There is an enumeration of value meanings in B that is a 1-1 mapping to concepts in C.

In other words, in each value domain, each value is mapped to one and only one concept in the conceptual domain C. In this scenario, there is a value in reference value set A that is mapped to the same concept in conceptual domain C as a value in reference set B, and these two values are effectively equivalent in meaning, and the reference data sets can therefore be trivially transformed into each other.

To continue the example, given a reference data set A consisting of two-character upper-case abbreviations with 50 values in it, another reference data set B with 50 two-digit codes, and a direct mapping between each value to one and only one of the 50 states of the United States (conceptual domain C), then A and B are isomorphic and are essentially equivalent in representative power. In turn, these two data sets could be harmonized by transforming all instances of values in reference data set A to the corresponding value in reference data set B.

Reference data set isomorphism is a good starting point for harmonizing reference data sets. However, two reference sets do not always reflect the complete isomorphism, and that requires some additional insight for making the determination for harmonization.

Share

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

Leave A Reply

Back to Top