A few weeks back I noted that one of the objectives on an inventory process for reference data was data harmonization, which meant determining when two reference sets refer to the same conceptual domain and harmonizing the contents into a conformed standard domain. Conceptually it sounds relatively straightforward, but as with most data management techniques, its apparent simplicity hides a significant amount of complexity.
First, let's reconsider what reference data “harmonization” really means: taking two distinct data sets (data set A and data set B) that overlap to some measurable extent and merging those two data sets into a single data set. That merging process can be performed in a number of ways, including:
- Taking all the values of data set A and eliminating the non-intersecting values from data set B.
- Taking all the values of data set B and elimination the non-intersection values from data set A.
- Take all of the values from both data set A and data set B.
- Take some (or all) of the values from data set A and some (or all) of the values from data set B.
- Don’t merge the sets at all.
Of course, four of these five alternatives are all just the mechanical last steps of a more complex process of deciding which values ultimately belong to a unified set.
There are two challenges that must be overcome before we even get to this point. The first, determining that the value domains refer to the same conceptual domain, involves mapping the values in the reference data set to a set of valid value meanings. The second challenge involves identifying the authoritative definitions that guide the choice of valid values. We will examine both of these tasks in the next set of posts.