The data quality and data governance community has a somewhat disconcerting habit to want to append the word “quality” to every phrase that has the word “data” in it. So it is no surprise that the growing use of the phrase “big data” has been duly followed by claims of the need for “big data quality” and “big data governance.”
Using internal data sources allows you to institute data quality monitoring and measurement within the data production flow. These inspections can be used to alert data stewards when some process is detected to be introducing data errors. At that point, the data steward can initiate some kind of remedial action to address the introduction of a data flaw to ensure the production of high-quality information.
However, the characteristics of the data sets and data streams used for big data analytics projects are somewhat different from those associated with typical data sources created as a byproduct of internal transaction processing or operational systems. These applications will often absorb massive data sets from external sources whose creation points are far removed from their various repurposed uses, way beyond the administrative authority of anyone within the company.
That means that the traditional mantras of the data quality experts (such as “validate the data at the source” and “eliminate data defects”) do not apply. There are no places within the data production stream to institute inspection and monitoring, nor does the data steward have any means of influencing the quality of the data production process. Essentially, what you see is what you get.
So what does it really mean to advocate for “big data governance”? You might say that any scenario in which you cannot exercise control over the production process is not truly amenable to governance, since there is no way to eliminate the root causes of any data failures. On the other hand, perhaps governance, oversight and stewardship need to be redefined for the concepts to be meaningful in the big data context.
1 Comment
I would agree. Part of the excitement in big data is it's lack of governance and the skill (value add) organisations can provide is in managing this inherent lack of governance to extract value. However is this [lack of data governance] something new? Finance, Meteorology and Enginering have for years had to appraise the quality mutliple data feeds (internal and external to their organisation) to run thier business.