Data quality - one dimension at a time

I was recently asked what I would focus on given limited funds and resources to kickstart a data quality initiative.

This is a great question, because I’m sure many readers will find themselves in this position at some point in their career.

My answer is to become ruthlessly focused on managing one data quality dimension - completeness.

Why completeness?

Firstly, completeness is perhaps the easiest dimension to understand and explain.

You don’t have to get bogged down in all the philosophical arguments of the accuracy dimension. Consistency and synchronization are always open to misinterpretation. Integrity can take on too much of a database administrative viewpoint and duplication or uniqueness is only specific to a subset of your data.

For these reasons, if I had a limited budget, completeness would be my starting point.

Secondly, completeness rules are generally quite easy to start discovering, assessing and improving if you approach them in the right way.

The wrong way to approach completeness

A big mistake people make with measuring the completeness dimension is to assess one column of data at a time looking for NULL or empty values. The problem here of course is that a value may be populated but still incomplete. It is very common to find "inferred NULL" values such as "TBC," "TBA," "Unknown," "Pending" or any number of variants. There is a lot of low-hanging fruit just by building up rules that cast these non-empty values into a flag that indicates an incomplete record.

Another mistake in tackling completeness from a column by column perspective is that the analyst ignores the context of the actual data being assessed.

For example, in a healthcare patient administration system (PAS) it may be perfectly valid for a patient to have no national identification (NI) number if they are under a certain age. But if they are over a certain age, you would expect to see a valid NI number.

Applying focus

By profiling your data and discovering these completeness rules you can start to control the quality of your most critical business data. If you're really strapped for cash and resources, I would recommend focusing on the 20 percent of products or services that are driving 80 percent of the revenue, profits, costs, churn or delays in your organisation. It doesn't matter what the 80 percent metric is as long as someone cares about it!

The key point here is that particularly aging systems have long since become out of sync with their originally documented business rules and standard operating procedures. So, it is really easy for incompleteness to directly hit the bottom line as well as frustrate the hell out of workers and customers alike.

Building the foundation

Just by building up a simple library of completeness rules, sharing these with the right communities and getting some ownership in place, you’ve set the foundation for future data quality improvement.

Once everyone understands the basic process of data quality management you can start to introduce more elaborate rules and data quality dimensions. Greater investment in tools and staff can be made as you increase revenue and decrease costs.

You have to start somewhere and given limited funding my preference would be to kick off with completeness dimension.

What do you think? Where would you start with limited data quality funding? Welcome your views in the comments below.

Blogs