Automating operational data validation, Part 2

In my last post, we explored the operational facet of data governance and data stewardship. We focused on the challenges of providing a scalable way to assess incoming data sources, identify data quality rules and define enforceable data quality policies. data scientist considers automating operational data validation

As the number of acquired data sources increases, it becomes difficult to specify data quality rules in a timely manner, let alone manage their implementation and enforcement – that is, unless you have a means of automating various aspects of the data stewardship procedures. Fortunately, you can use existing data tools to automate five critical duties of the data steward.

1. Proactive profiling

The proactive profiling process, which should be done for all new incoming data sources, uses objective data profiling (i.e., statistical analysis and reporting). This analysis of the source is intended to do several things:

Provide an understanding of what's coming in from the source.
Capture and document the source data set metadata.
Classify the source data attributes in terms of their contribution to the downstream data products.
Identify any potential issues with the data before it's brought into the environment.

The outcome of this process is an objective assessment of the data values in the source data set. This, in turn, can be used to infer data quality specification rules for usability.

2. Input validation

Defined data quality rules can be used to configure an assessment control as a means of reviewing incoming data to identify where any potential anomalies exist that would impact downstream consumers. This method of automatically scanning the input prior to loading it into the data warehouse allows the tool to log when there's a data exception – and simultaneously notify both the data steward and the source owner. Data profiling tools can be configured with predefined assertions that will be used for the input validation.

3. Process control validation

You can insert data validation rules at selected hand-off points within the data integration and preparation processes to monitor gross-level receipt control metrics. Examples of such metrics are:

The number of records handed off, or a checksum value computed using selected column values.
Completeness metrics (making sure all the data values that were supposed to be provided are not null).
Selected reasonability tests.

These metrics can be used to demonstrate that the processes are executing the transformations they're supposed to perform, or to indicate that there's a diversion from expectation. Templates for code for the gross-level handoffs can be developed and parameterized, while the completeness and reasonability rules can be configured using data profiling tools.

4. End product validation

For each data product, data validity and quality rules are defined for each of the data attributes, and for combinations of the data attributes, across each of the end product tables. This includes completeness validation to verify that all output values are present. Conformance to consistency and reasonableness rules will also be assessed. Data profiling tools can be configured with predefined assertions that will be used for the end product validation.

5. Target source validation

The source validation process incorporates rules and directives for comparing data in the end products against data that came from the source. The objective is to verify that the resulting data product accurately reflects the source information. Note that because numerous transformations may have been applied from the time of acquisition to the final product, there may be some subtlety in specifying the rules for consistency. Data profiling tools can be configured with predefined assertions that will be used this procedure as well.

One interesting common thread is the use of data profiling technology, albeit in different modes for different purposes. This demonstrates that data profiling is an indispensable resource that plays a critical role for operational data governance.

Download a white paper about big data governance

Blogs