In my last post I discussed isomorphisms among reference data sets, where we looked at some ideas for determining that two reference data sets completely matched. In that situation, there was agreement about the meaning of every value in each of the data sets and that there was a one-to-one mapping of values from one data set to the other.
But what happens when you have two reference data sets that are almost isomorphic, but not exactly? In this case, you might have 100 values in data set A, 102 values in data set B, and 95 of those values map to identical value meanings in a common conceptual domain. These two reference data sets are close to being isomorphic except for the values that lie outside their intersection. If we propose a threshold percentage of the intersection size to the cardinality of both sets, then if that threshold is met, we can say that the two reference data sets are “congruent” or “almost equivalent.”
Can these two reference data sets be harmonized? In turn, the underlying question is whether the two reference data sets refer refer to the same concept, even in the presence of discrepancies across the two reference data sets. And, as it often happens, the answer is that it depends on a few qualitative notions, such as:
- Can the discrepancies be attributed to an error in the definition or enumeration process?
- To what extent do the discrepant values differ?
- Is it clear that the two reference data sets truly refer to the same conceptual domain?
In many cases the answers to these questions depend on who is making the decision and for what purpose. As an example, the United States Postal Service (USPS) enumeration of two-character state codes largely overlaps with the FIPS enumeration of two-character state codes. However, the USPS enumeration includes “AA,” “AE,” and “AP” for armed forces delivery, but there are no corresponding values in the FIPS code list. So the USPS and the FIPS reference data sets for states are congruent, but not isomorphic. Can they be harmonized? Let’s look at answers to our questions:
- The discrepancies are not attributable to an error; they were defined via authoritative sources.
- There are a small number of discrepancies, but most of the values overlap and are mapped to the same value meanings.
- If the intent for the use of the state code is for location purposes, then the two reference data sets may be harmonized. If the intent of using the USPS reference set is for parcel delivery and the intent of using the FIPS reference set is for location, then the data sets cannot be harmonized.
In this case, it all boils down to the intent. Even in a case where there are very few discrepancies, the upshot is that the determination for harmonization is based on the semantics, not the format and value enumeration. This highlights the need for managing a broad swath of semantic metadata about reference data as a means for driving the harmonization process.