The third part of my data governance primer series addresses data quality analysis. Don’t even start a data quality analysis until you have completed the first two steps of your root cause analysis: investigate and prioritize any potential causative factors, then start your metadata assessment. Otherwise, you may be misled by your findings.
Data quality is defined as complete and accurate data that is ready for business consumption. Sources of poor data quality may include:
- Lack of data entry rules
- Unclear data element definitions
- Inconsistent metadata definitions for field type, format or intent
- Breakdowns in data transformation processes as data flows between systems or applications
Poor data quality results in bad business decisions, and it contributes to major problems in using data effectively. More importantly, bad data costs companies millions of dollars a year in terms of rework and inefficiency. Data quality, in combination with robust metadata definitions, is part of the foundation of good data governance. By cleaning up your data, you can find that diamond in the rough that can have a big impact on your operations.
Data quality analysis
A data quality management process can allow an area to start with a simple approach and over time mature to one that is more proactive and comprehensive. Initially, investigation may focus on single data elements or events. As patterns, data commonalities and other relationships appear, the data quality management process will grow to support complete business processes.
A mature data quality management process will not just resolve individual issues; it will also track relationships between data elements. With a process established, you can ensure that business rules are consistent and generate statistical analyses to monitor previously addressed issues to ensure that data quality is stable – and that an early warning system is in place as part of the data governance program.
Initial data quality analysis process
I. Define data scope
i. Determine data elements that are associated with or are direct results of the reported issue.
ii. Check that all metadata definitions are present and current.
iii. Enlist the involvement of the Data SME or Data Stewards.
iv. Identify all source systems where the data originates, is entered or derived.
II. Extract and profile the data
i. Extract the relevant data from all key source systems.
ii. Design the profile. A profile will consist, at a minimum, of total record counts, min/max values, frequency of unique values, and frequency of invalid values (if defined) for each data element profiled.
iii. Profile the data to determine key characteristics that are contributing to the issue, such as:
a. Wrong values
b. Missing values
c. Corrupt transformation processes
d. Incorrect business rules
e. Incorrect usage rules
III. Analyze Data Profile Results
i. Summarize the key findings from the profile detail.
ii. Determine what key drivers are contributing to the impact.
iii. Determine accountability for the data quality issue.
iv. Involve other Data Stewards in troubleshooting and designing the data quality solution.
IV. Design the Corrective Action Plan
Two types of plans should be developed to address known data quality issues: a corrective action plan to fix the immediate source of the problem identified, and an ongoing monitoring plan, where thresholds have been determined and metrics are routinely collected and reported to data stakeholders. This monitoring process should be scalable based on the number of data elements being tracked.
i. Corrective Action Plan
- Does the scope of the problem warrant change in metadata definitions, business practices or data entry rules?
- Does the scope of the problem warrant a data governance standard?
- Does the corrective action plan include details on how to fix the source of the problem as well as ways to correct historical data in the system?
ii. Preventive Action Plan
- This plan will be designed to minimize the probability of data quality issues from recurring.
- Determine "early warning triggers" based on designated thresholds. These thresholds should reflect the business tolerance for inaccurate data (i.e., is 95% acceptable?).
- If data latency is the source of a data quality issue, then latency thresholds should be included in the monitoring plan.
- Determine how frequently results of the monitoring plan will be reported to data stakeholders or governance oversight committees.