In my last post, I noted two key issues where there is the desire to impose governance over large-scale data sets imported from outside the organization: the absence of control and the absence of semantics. Of course, we cannot just throw up our hands and say that the data is ungovernable. Rather, we have to examine what the intent of governance is in light of these constraints.
One approach is to reframe the question, leading to some alternative approaches to governance. Instead of considering governance as a way of controlling the creation and processes that touch data within the production cycle, consider governance as a means for controlling expectations regarding consumption and usability of the data.
This is a more practical approach, especially considering that in most cases, reports and analyses driven by a big data approach are not likely to be slowed or halted as a result of questions raised about the processes used to create the source data. In addition, many big data environments may also be designed to stream data from real-time semi-structured or unstructured sources that either have no predefined metadata, or are subject to rapid changes in structure that may limit the ability to presuppose rules about formats and structure.
The orientation I am suggesting in this post covers two facets of data utilization. Consumption looks at the business scenarios in which the big data environment is used and what the expectations are from a high-level functional perspective. Usability refers to the degree to which the expected outcomes are skewed as a result of data issues and what the users’ level of tolerance is to that skew.
We can compare two different business applications. One is using numerous data sources for developing customer profiles for marketing purposes. With a large enough data set and a plethora of data attribute variables used for the analysis, there is some tolerance to missing or incorrect values because the ultimate results are still usable. And even if there are customers who are classified incorrectly, for the most part the marketing lift can still largely be achieved.
On the other hand, a big data analytics system for identifying fraud in real time must be much more sensitive to missing or incorrect data. Flagging reasonable transactions as fraudulent and denying them can have a negative impact on customer satisfaction, especially if preventing them from executing the transaction causes inconvenience or hardship.
In this vein, governance can be interpreted as the processes of identifying the usage scenarios, engaging the data consumers, and understanding their expectations in ways that can be asserted as measures of data usability. Continuous measurement and monitoring of those assertions can alert the business users if the quality of the data dips below their expectations. In this case, even if the data stewards would not necessarily change the data, they can inform the business users about the risks of using the results in relation to potential data flaws.