One of the most common questions I get asked by our members on Data Quality Pro is, “Can you do more articles on data quality dimensions?”
Part of the reason for this request is when people first start getting involved with data quality, they invariably buy data quality books and start to observe differences in opinion amongst authors in relation to data quality dimensions.
Another challenge you’ll face is you may hire data quality practitioners who have been following one data quality dimensions approach that differs from the one you wanted to take in your organisation. This is one of the reasons why my friend and colleague Nicola Askham helped initiate a DAMA UK focus group that has been working on creating an initial working set of well-defined dimensions that are clearly specified. You can download their results here.
A lot of the work in these kinds of efforts draws heavily on published text. I’ve spoken to many authors, and it’s interesting (and entertaining!) to sometimes hear how much of a divide there is between the experts who have helped shape the direction of this industry.
What I’ve come to realise is that if expert authors can’t agree on data quality dimensions, then there is really only course of action you can take: Name and define data quality dimensions that are meaningful to your organisation.
For example, in one of my earliest organisations, we realised a lot of our data quality issues stemmed from not tracking data as it flowed from initial supplier, into the organisation, across the various teams and then out of the organisation via our information products.
In data quality parlance, we wanted to create a measure of how well a piece of information was being recorded in terms of its "data lineage" – another classic term that often causes disagreement.
In our situation we created the concept of a "data passport." The idea being that for data to move from one location to another it needed to have its "passport" stamped. This was just a simple log in a spreadsheet that showed the source and destination of the data. This allowed us to quickly find all the places a "data subject" had traveled to and quickly inform each location of any data quality defects that had been found. We could have used the terms "lineage" or "provenance," but I doubt it would have resonated with our staff.
Data accuracy is another dimension that causes a huge amount of disagreement within the data quality community. Jim Harris wrote a great article on Jack Olson’s take on accuracy recently trying to share his viewpoint. Yet even with these regular insights, if you ask 10 different practitioners how they interpret the accuracy dimension you’ll still find considerable variation.
So, what should you do? Once again, you need to make data quality work for you. Drop the textbooks and focus on what YOU need from your data to make YOUR business work more effectively.
For example, if you run a utilities company, then you may choose to create a dimension called "on-plan, on-site." The idea being that if a piece of infrastructure is labeled on a plan, then it should exist on a site. If it doesn’t, "the on-plan, on-site" dimension should record a defect and work can commence to find the root-cause and fix the issue.
But why not just use the term "accuracy" when referring to site plant equipment instead of the "on-plan, on site" example?
Here are some ways three different professionals within the same organisation could create conflicting views of the accuracy dimension:
- To a data person, they may interpret accuracy as the validity of the equipment data or whether the equipment data can be accurately compared to another source of trusted information (perhaps a local plan).
- To an electrical engineer or chemist, they may interpret this as the accuracy of the gas flow recording or the power consumption rating of the unit – not the physical location of the unit.
- To a quantity surveyor, they may confuse accuracy with their threshold of correctness for calculating overall quantity of equipment on site.
Different workers will have different perceptions of accuracy, so you need to create definitions that are meaningful to all.
These differences stem from their training and previous working environments, so pick a term that focuses on what you want the data to do for you instead. You can still use data quality dimensions as a starting point to help you refine what criteria to measure, but feel free to adapt and name them as you see fit.
What do you think about this approach? Have you been creative with dimension naming in the past? Do you get frustrated with the lack of agreement over dimensions? Please share your views below.