In my last post, we started to look at some of the issues with the concept of “big data governance,” especially when a large part of governance is intended to prevent the introduction of errors into data sets. Many big data analytics applications focus on the intake of numerous varied data sources acquired from external sources. By the time the data has been brought into the organization, it is basically too late to have any impact on the data creation process, so preventing errors is out the window.
In fact, the problem is much worse than that for two reasons. First, in many cases the data sets being used are not only created by parties outside the administrative domain, the internal users may have no idea where the data came from altogether. For example, public US federal transparency data sets published at www.data.gov are created solely for the purpose of posting the data to the web site, but the values populating those data sets may have come from numerous internal applications designed and implemented to support specific business functions, and the resulting data sets are effectively created without any of the original context.
That means that the actual details of the originating system are completely lost, often including the technical/structural metadata (such as data types and lengths) as well as the more important business metadata such as data element definitions and reference data domains. The user of the data is compelled to manufacture the semantics based on intuition and context, but not much else.
The second reason is that attempting to change the data means that you are potentially introducing biases into the data sets you are about to subject to analysis. This means that your hands are tied when it comes to performing any activity that potentially changes the meaning of the data, such as “cleansing” the data or eliminating duplicate records.
So you cannot control the creation of the data and are limited in correcting the data. But aren’t those the main objectives of data governance? Seems like we have a little bit of a conflict here…