Data validation as an operational data governance best practice, Part 1

0

man with tablet represents data validation and governanceData governance can encompass a wide spectrum of practices, many of which are focused on the development, documentation, approval and deployment of policies associated with data management and utilization. I distinguish the facet of “operational” data governance from the fully encompassed practice to specifically focus on the operational tasks for data stewards and data quality practitioners in ensuring compliance with defined data policies.

The life cycle for data quality policies includes the determination of data validity rules, their scope of implementation, the method of measurement, institution of monitoring and the affiliated stewardship procedures. Those stewardship duties include:

  • Evaluating the source data to identify any potential data quality rules.
  • Deploying the means for validating the data against the defined rules.
  • Investigating the root causes of any identified data flaw.
  • Alerting any accountable individuals that an issue has appeared within the data flow.
  • Managing the workflow to ensure that the root cause is addressed.

As data volumes grow and the data sources become more diverse, we need to scale these processes for three key reasons:

  • The effort involved in manual review of the source data sets will overwhelm analysts, preventing a timely assessment of source data quality rules.
  • The need to hand-code validation routines demands database practitioners skilled in data organization, data access and extraction and data quality.
  • The challenge of investigating each potential issue is compounded when there is little or no traceability regarding its manifestation.

In other words, as data volume and diversity increases, the ability to institute control and stewardship diminishes. However, if we reconsider what data validation means, it provides the wireframe upon which a more comprehensive set of procedures can be layered.

Data validation defined

So what do we mean by data validation? One thing to recall is that those managing a reporting and analytics environment are not the owners of the data that's loaded into the data warehouse. Rather, they marshal the data from the sources into the analytical data warehouse, from which the downstream consumers execute queries and run reports. The challenge is that if any of those data consumers perceive a problem with the data, they blame the data warehouse – even if the data flaw appeared in the original source.

Data validation is a panacea for data warehouse owners to accomplish two critical goals. First, it provides a means for identifying if any data flaws have been introduced by the data integration process. By placing validation filters at strategic places along the data lineage from the data acquisition point to its delivery into the data warehouse, you can flag any inconsistencies or otherwise unexpected data values. Second, it provides a way to flag any data issues that propagate from the original data source. Cataloging an inventory of source data issues has two key benefits: you are able to effectively communicate to your data provider the existence of data flaws, and you have a means for assuring your downstream consumers that you did not introduce those data flaws.

In practice, data validation is a means of insurance that data quality issues can be identified, tracked and managed in a controlled way. The challenge, which we will discuss in our next post, is how to scale your data validation program as the volume and diversity of your input data increases.


Download a white paper about big data governance.

Share

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

Leave A Reply

Back to Top