We have looked at two reference data sets whose code values are distinct yet equivalently map to the same conceptual domain. We have also looked at two reference data sets whose values sets largely overlap, though not equivalently. Lastly, we began the discussion about the guidelines for determining when reference data sets can be harmonized. In this last post of this month’s series, let’s look at some practical steps for harmonization.
In this case, we have two reference data sets and we have decided to harmonize the two data sets into a single remaining reference data set. Here are some basic steps to take:
- Choose a target data set: Determine the level of authority of the two reference domains. Identify the authoritative sources that provided the enumerations. If one source is “more authoritative” (your mileage may vary) than the other, choose that reference data set. Otherwise, use corporate guidance to select one of the data sets as the survivor, thereby designating the other data set as retired.
- Develop transformation: Create the transformation mapping from the values n the retired data set to the survivor data set.
- Retired data values: For each value in the retired data set that is not in the survivor set, determine how to transform the value into one that is in the survivor set.
- Transform data values: For every instance of use of the retired data set, apply the transformation.
If you cannot find an adequate and correct replacement for the retired values, it suggests that perhaps the data sets could not be harmonized after all. Note that the last step hides some additional complexity associated with reference data lineage, and I will address that in a future blog series.