How to measure data quality

0
Programmer incorporates metrics around how to measure data quality
Learn how SAS Data Quality helps you make decisions you can trust

How data quality is defined impacts how data quality is measured and how those measurements are perceived by your organization. Data quality can be defined as either real-world alignment or fitness for the purpose of use. Both definitions provide important insight, which is why most organizations employ both.

  • Real-world alignment. When you define data quality in terms of real-world alignment, you'll assess data’s source as a trusted provider for all uses. This is how you will measure the potential business impact of data quality, and it drives many enterprise data management initiatives. This approach provides a consolidated repository of trusted data, such as enterprise data warehousing (EDW) and master data management (MDM).
  • Fitness for the purpose of use. Another way to define data quality is in terms of its fitness for the purpose of use. With this approach, you'll assess data’s application to a specific business use. This is how you to measure the actual business impact of data quality. This approach acknowledges that most data has multiple uses, such as the many applications of analytics, each with its own relative business context for data quality.

Metrics to help measure data quality

Below are the data quality metrics (aka data quality dimensions) that I include with all of my implementations. It’s important to note that it’s rare to attain or maintain (or with some big data uses even require) 100% with these data quality metrics. (After all, perfect data quality is an unrealistic and self-defeating goal.) Therefore, acceptable thresholds for these metrics will vary by data and use.

Currency and timeliness

These two metrics measure the time-related quality of data, which is often overlooked or assumed. Even when other data quality metrics are at or above their defined thresholds (and different uses often have different thresholds), it’s crucial to correlate them with these metrics.

  • Currency measures whether data is current with the real world that it models and how up-to-date it is given possible modifications over time. The change rates of data can vary significantly, which is why currency often requires rule-based scheduled refreshes/updates to continuously ensure this aspect of data quality. Currency can also be complicated by time-variant data, such as when sales revenue is recognized. Forecasting and predictive analytics are also complicated by lagging data that represents the current impact of past events (e.g., the unfortunate deaths of those infected by COVID-19 6 to 8 weeks ago).
  • Timeliness measures the lag between when data is expected and when it's readily available for use – or, simply if data is accessible when it's needed for use. Sometimes timeliness is a trade-off. This means that occasionally it’s preferable to use lower-quality data right now, as it is, instead of waiting for the processing that's required to assess and improve the quality of the data.

Comprehensiveness and completeness

These two metrics are a way to measure the availability of data, specifically what data might be missing.

  • Comprehensiveness measures availability compared to the total data universe or population of interest. Comprehensiveness measures missing rows/records and/or tables/files. When data is omitted, it’s not always intentional or for other questionable reasons. But its absence can, for example, significantly skew analytics and their interpreted results. We’ve all heard the cliché that you can make statistics say whatever you want. At the root of this oft cited (and true) statement is the intentional omission of inconvenient data that doesn’t support the desired assertion. Ironically, some omit such data from processes that assess and report data quality issues, then claim not to have poor-quality data.
  • Completeness measures availability as the presence of an actual data value within a column/field, excluding NULL values and any non-NULL values that indicate missing data (e.g., character spaces). Completeness can also be used as a measure of the absence of some of the sub-values that would make a data value complete (e.g., a telephone number in the US that's missing the area code). Some debate whether to measure a column/field that’s not included in a table/file definition by comprehensiveness or by completeness. I prefer to include completeness as a metric to measure when a column/field is not included in a table/file definition.

Validity and accuracy

These two metrics measure the correctness of data values, which is what most people focus on when discussing data quality.

  • Validity measures a data value’s correctness within a limited context. For example, the context might be within a defined range or list of valid values (like those incorporated into data entry mechanisms) or through verification by an authoritative reference. A good example of the latter is US postal address validation. This is certified by the US Postal Service (USPS), validating that a US postal address correctly maps to a real-world location that can receive mailed correspondence. Validity measures the real-world alignment of data in isolation of use and/or the theoretical correctness of data values outside of their practical application. Validity is relatively easy to measure, attain and maintain.
  • Accuracy measures a valid data value’s correctness within an associated context, including other data as well as business processes. For example, postal address validation doesn’t verify whether the location is an accurate home or work address for a customer. Accuracy measures the combination of the real-world alignment of data and its fitness for purpose of use in practical applications. Accuracy is therefore more difficult to measure, attain and maintain. This is why accuracy – which relies on validity as its prerequisite and foundation – often has to be assumed until proven otherwise. The most accurate data is usually the result of rigorous quality control during data creation, and a commitment to continual data quality assessment and improvement.
Discover how data management solves real-world problems
Share

About Author

Jim Harris

Blogger-in-Chief at Obsessive-Compulsive Data Quality (OCDQ)

Jim Harris is a recognized data quality thought leader with 25 years of enterprise data management industry experience. Jim is an independent consultant, speaker, and freelance writer. Jim is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality and its related disciplines, including data governance, master data management, and business intelligence.

Leave A Reply

Back to Top