In one of my prior posts, I briefly mentioned harmonization of reference data sets, which basically consisted of determining when two reference sets referred to the same conceptual domain and transforming the blending of the two data sets into a single conformed standard domain. In some cases this may be simple, especially if there is a single authoritative source for the data set. In that case, all you really need to do is align each copy of the reference data set with the authoritative source.
However, I can’t imagine that it is frequently that simple for a number of reasons, such as:
- Multiple authoritative sources – There may be more than one source for the reference data. An example is two-character US state codes – there is one version from the United States Postal Service (USPS) and there is one from the National Institute of Standards and technology (NIST). The USPS list includes postal codes for military mail delivery that are not included in the NIST data set.
- Temporal dependence – There may be differences in versions of reference data sets over time. A good example is USPS ZIP codes. As the population in a ZIP code area grows beyond a specific point, the Postal Service may split the area in half, assign the original ZIP code to one half and assign a new ZIP code to the other. A copy of USPS Zip code data that is not synchronized with the original source will become misaligned with those replicates that are synchronized.
- Different semantics – The same value domain (that is, list of values) may be interpreted differently depending on the consumer. The conceptual domain of US states is modified or qualified in numerous ways even by US government laws, regulations, and agencies. In some cases the concept of a state includes certain territories but not others, and the set of locations included each time differs as well. The underlying reason in that the relevance of the inclusion differs depending on the business context (e.g., citizenship vs. taxation).
These are just a few of the issues, but they point out some difficulties that must be overcome when attempting to resolve the differences across more than one version of what is believed to be the same conceptual domain. And this underscores what needs to be kept in mind when managing reference data – the beliefs of the data consumers carries significant weight in terms of reducing variation among different representations of conceptual domains. In turn, that reinforces the need for formal methods of management as well as governance and stewardship for shared reference domains.