In 2013, I wrote a series of four blog posts that began with a review of a section of Lewis Carroll’s “Through the Looking Glass” in which the White Knight tries to explain to Alice the details of a song. The selection, which centered on the White Knight’s differentiation of the name of the song, what the song is called, what the name of the song is called, and what the song “was” is quite humorous from a semanticist’s perspective. In the post, I noted the similarity between the literary example and taxonomies and hierarchical inheritance.
Now that a handful of years have passed, I thought it would be worth revisiting that blog series. At the time I was trying to illustrate that from a data consumer’s point of view there are different ways to deconstruct the meanings of content and align meanings within known definition hierarchies. But each interpretation will be limited by the consumer’s own biases. Second, I pointed out that the meanings may be disconnected from the intent of the original data producer.
It's interesting that between the time I wrote the original blog post and today, there has been a general recognition of the challenges in inferring the meaning of data when there is little provided context. It seems that a whole industry of data preparation tools has evolved around empowering communities of data consumers to:
- Profile acquired data sets.
- Apply protocols for data organization and assignment of meaning.
- Configure the data into information packets of which the consumers can comfortably make sense.
At the same time, organizations have become more focused on introducing data governance techniques that set boundaries on how “creative” a data analyst might be in attempting to interpret a mash-up of acquired or managed data sets. This is done by formalizing a standard vocabulary that embraces data element concepts and standardized definitions for commonly used terms and concepts. Enforcing the use of these types of governed standards can at the very least align different data consumers’ interpretations – so their analytical results are less likely to conflict due to differing semantics.
So what has changed over these past four years? If anything, I'd suggest that the challenges of standardizing inferred semantics have become more complex. There has been an increase in the use of non-relational, NoSQL style databases that do not impose strict schema-on-write constraints. The emergence of the data lake as an enterprise data management framework has allowed greater accumulation of a wide variety of data assets that are not subjected to a significant amount of pre-storage controls. The growth of the role of the algorithmic data scientist has created a large community of data consumers who do not necessarily have what might be called “classical training” in data management.
Altogether, this suggests that even with increased data governance and scrutiny it's likely that creative license will be applied to data analysis and the creation of value-added information products. As a result, data management professionals must remain even more vigilant with operationalizing data governance. Increased data awareness, collaboration and measurable compliance with defined data policies will help ensure consistency in analysis and interpretation of results.
Download a TDWI paper about data preparation