One aspect of high-quality information is consistency. We often think about consistency in terms of consistent values. A large portion of the effort expended on “data quality dimensions” essentially focuses on data value consistency. For example, when we describe accuracy, what we often mean is consistency with a defined source of record.
You might say that consistency of data values provides a single-dimensional perspective on data usability. On the one hand, consistent values enable consistency in query results – we always want to make sure that our JOINs are executing correctly. Yet as more diverse data sets are added to the mix, we continue to see scenarios where the issue of consistency extends way beyond the realm of the values themselves.
This growing issue is what I call “interpretive consistency,” or the need for consistent definitions of data elements in data sources that originate from different places. Imagine a scenario where we examine data that comes from different sources. Even if there are metadata descriptions for the different data elements, there was probably no collaboration regarding the scope or specifics of those definitions, or additional facets of semantics such as the time frames in which the definitions are valid.
As an example, consider time-series analyses associated with geographic regions differentiated by ZIP code. A retail company might want to analyze types of product sales within specific ZIP code regions as demographic characteristics have changed over time. However, over the time period being analyzed there may have been events that changed the details of how the US Postal Service (USPS) maps ZIP code designations to geographic regions. Simply put, the USPS sometimes splits ZIP codes and assigns new numbers. So if you're looking at the demographics associated with a specific ZIP code area, check to make sure the area has not been split during the time frame of the data you're examining.
This is just a simple example. It gets more complex when the analysis focuses on relationships of data values in data elements that seem to mean the same thing but in reality have slight differences. Recognizing the existence of these differences is one thing – attempting to resolve them by reaching out to the owners of the data sources is another. In many cases, neither is easy to do. And in some cases, getting an accurate definition from the source owner is impossible. As a result, the analyst is left to interpret what each data element means.
More specifically, the data analysts are left to independently provide their own interpretations. And that's when we get into trouble. Because one analyst’s interpretation might be slightly (or widely) different than another analyst's interpretations.
This suggests the need to balance self-service data discovery and analysis with controls for collaborating about and harmonizing the semantics of data sets subjected to review. Introducing stewardship controls is one way to do it. This will encourage analysts to log their definitions, compare them with others’ definitions, and brainstorm together to try to resolve interpreted semantics across the enterprise. Such controls will limit situations where variations in inferred meanings lead to analyses with conflicting results. Introducing processes for semantic harmonization will provide some oversight over “free” (that is “unfettered”) data science.