The question in the title of this blog post seems sort of unnecessary, right? Since we all seem to tacitly agree that we know what reference data is, it seems silly to try to define it. Yet the assumption that everyone agrees to some abstract definition of reference data hides some of its biggest challenges.
When I wrote this post a few years ago, I focused on the challenge posed by the fact that few organizations had specific roles assigned for accountability over reference data. As I reflected on that recently, I decided to update this post and share some additional context.
The challenges of reference data
Let's look at three challenges of reference data:
- Harmonization of reference data domains.
- Oversight over reference data definitions.
- Governance and accountability for enterprise use.
Standardized reference data is critical to the enterprise. It provides the basic vocabulary that describes entities from core business domains and it's a framework for understanding data that’s shared among an organization’s data consumers. But when organizations fail to apply coordinated governance to reference data management, problems surface. Usually, a variety of teams and projects default to creating, redefining and reinterpreting the semantics of enumerations of values under the rubric of “reference data.”
The absence of coordination and enterprise control for reference data management complicates matters more when data sets from different sources are brought together. This is especially the case when what were presumed to be minimal differences in reference data values pose difficulties in data integration and consolidation.
Understanding the definition of reference data
To assert control, we might start with a definition of reference data that segregates those data sets from ones that are directly “owned” within an operational domain or business function. Malcolm Chisholm defines reference data like this:
"Reference data is any kind of data that is used solely to categorize other data found in a database, or solely for relating data in a database to information beyond the boundaries of the enterprise."[1]
Wikipedia has a definition as well, although it is a little less precise:
“Reference data is data that defines the set of permissible values to be used by other data fields. Reference data gains in value when it is widely reused and widely referenced. Typically, it does not change overly much in terms of definition (apart from occasional revisions).”
As an example, consider the ISO-3166 list of abbreviations for US states. This reference data list categorizes one aspect of a geolocation hierarchy associated with records in a data set. The code “US-MD” represents the geopolitical entity known as the “state of Maryland, which is one of the “states of the United States.” The use of the code “US-MD” either categorizes the location of a described entity or somehow relates a database record to the concept of the “state of Maryland.” To extend this example, the code can be used to categorize the location of a street address, the issuer of a driver’s license, or the state in which an individual is registered to vote.
Malcolm’s definition focuses on the use of reference data while the Wikipedia definition starts out focusing on origin (“defines a set of permissible values”) and then talks about reuse. Interestingly, the Wikipedia definition discusses the relative static-ness of reference data, which is a characteristic I would have pointed out as well.
Characteristics of reference data
A few years ago, I tried to assemble a revised definition that built on these two definitions as starting points. In retrospect, it might be more helpful to list the relevant characteristics of reference data:
- Reference data is any kind of data set defining a set of permissible values for other data elements.
- The collection of values within a reference data domain have contextual meaning and (in some cases situational) semantics.
- Reference data is used to categorize other data found in a database.
- Reference data is used to relate data in a database to information beyond the boundaries of the enterprise.
- Reference data sets are accepted as standards.
- Reference data sets are widely reused and referenced.
- Reference data sets are shared among a community of data consumers.
- Reference data sets change slowly.
- Revisions to reference data must be made under the authority of a reference data steward.
Combining and then “teasing apart” these two definitions allows us to add some additional detail about context and semantics. It also provides a starting point for reference data management and, ultimately, reference data governance.
Read another post on the topic of reference data governance.
[1] Chisholm, Malcolm, “The Foundation of Successful Reference Data Management”