Data governance can encompass a wide spectrum of practices, many of which are focused on the development, documentation, approval and deployment of policies associated with data management and utilization. I distinguish the facet of “operational” data governance from the fully encompassed practice to specifically focus on the operational tasks for data stewards and data quality practitioners in ensuring compliance with defined data policies.
The life cycle for data quality policies includes the determination of data validity rules, their scope of implementation, the method of measurement, institution of monitoring and the affiliated stewardship procedures. Those stewardship duties include:
- Evaluating the source data to identify any potential data quality rules.
- Deploying the means for validating the data against the defined rules.
- Investigating the root causes of any identified data flaw.
- Alerting any accountable individuals that an issue has appeared within the data flow.
- Managing the workflow to ensure that the root cause is addressed.
As data volumes grow and the data sources become more diverse, we need to scale these processes for three key reasons:
- The effort involved in manual review of the source data sets will overwhelm analysts, preventing a timely assessment of source data quality rules.
- The need to hand-code validation routines demands database practitioners skilled in data organization, data access and extraction and data quality.
- The challenge of investigating each potential issue is compounded when there is little or no traceability regarding its manifestation.
In other words, as data volume and diversity increases, the ability to institute control and stewardship diminishes. However, if we reconsider what data validation means, it provides the wireframe upon which a more comprehensive set of procedures can be layered.
Data validation defined
So what do we mean by data validation? One thing to recall is that those managing a reporting and analytics environment are not the owners of the data that's loaded into the data warehouse. Rather, they marshal the data from the sources into the analytical data warehouse, from which the downstream consumers execute queries and run reports. The challenge is that if any of those data consumers perceive a problem with the data, they blame the data warehouse – even if the data flaw appeared in the original source.
Data validation is a panacea for data warehouse owners to accomplish two critical goals. First, it provides a means for identifying if any data flaws have been introduced by the data integration process. By placing validation filters at strategic places along the data lineage from the data acquisition point to its delivery into the data warehouse, you can flag any inconsistencies or otherwise unexpected data values. Second, it provides a way to flag any data issues that propagate from the original data source. Cataloging an inventory of source data issues has two key benefits: you are able to effectively communicate to your data provider the existence of data flaws, and you have a means for assuring your downstream consumers that you did not introduce those data flaws.
In practice, data validation is a means of insurance that data quality issues can be identified, tracked and managed in a controlled way. The challenge, which we will discuss in our next post, is how to scale your data validation program as the volume and diversity of your input data increases.