In my last post we started talking about the tasks associated with data harmonization; the topic of this week’s post is determining that two reference data sets refer to the same conceptual domain.
First, let’s review some definitions:
- A value item is a representation of a specific value meaning in a value domain.
- A value domain is a collection of value items.
- A conceptual domain represents the meanings of the permissible values in a value domain.
- A value meaning is a relation between a concept in a conceptual domain and a value item.
As an example, the “states of the United States” is a conceptual domain. There are concepts such as the geopolitical entities referred to as “states” that are members of the conceptual domain, such as “Alaska” and “California.” A value domain for states might consist of a collection of two-character uppercase abbreviations, such as “NY” or OH.” Another value domain for states might consist of two-digit codes such as “01” or “23.”
We can say that value domains A and B are isomorphic if:
- The cardinality of A is equal to the cardinality of B (they have the same number of values).
- Both A and B are associated with a conceptual domain C with the same cardinality as A and B.
- There is an enumeration of value meanings in A that is a 1-1 mapping to concepts in C.
- There is an enumeration of value meanings in B that is a 1-1 mapping to concepts in C.
In other words, in each value domain, each value is mapped to one and only one concept in the conceptual domain C. In this scenario, there is a value in reference value set A that is mapped to the same concept in conceptual domain C as a value in reference set B, and these two values are effectively equivalent in meaning, and the reference data sets can therefore be trivially transformed into each other.
To continue the example, given a reference data set A consisting of two-character upper-case abbreviations with 50 values in it, another reference data set B with 50 two-digit codes, and a direct mapping between each value to one and only one of the 50 states of the United States (conceptual domain C), then A and B are isomorphic and are essentially equivalent in representative power. In turn, these two data sets could be harmonized by transforming all instances of values in reference data set A to the corresponding value in reference data set B.
Reference data set isomorphism is a good starting point for harmonizing reference data sets. However, two reference sets do not always reflect the complete isomorphism, and that requires some additional insight for making the determination for harmonization.