How to extend the completeness dimension

0

If you’re involved in some way with data quality management then you will no doubt have had to deal with the completeness dimension.

This is often one of the starting points for organisations tackling data quality because it is easily understood and (fairly) easy to assess. Conventional wisdom has teams looking for missing values.

However, there is a problem with the way many practitioners calculate the completeness of their datasets and it relates to an over-dependence on the default metrics provided by software. By going a little further you can deliver far more value to the business and make it easier to prioritise any long-term prevention measures required.

The approach I follow relates to task-based completeness rules as opposed to the simple attribute completeness rules that are most commonly adopted.

With an attribute completeness rule you are effectively aggregating up all of the data in an entire recordset (or table) and assessing whether there are any empty values. You may go a step further and create lookup tables or functions that map values which are populated but are still clearly missing data. Common examples are things like "TBC - To be continued," "??," "tba - To be announced" and so on.

All of this is great but what about if you have a customer database that has data for both adults and children? Many banks actively target children from a young age via school initiatives. In this case you may end up with empty national insurance identifiers or other information that you would typically associate with your adult customers.

You can of course create a completeness rule that is contextual so you would have an "Adult Customer NI Completeness" test that filters out the junior accounts. This would make your statistics far more relevant and actionable.

This is a great start and one you should definitely take, but I prefer to go one further by mapping what a particular task or process needs in terms of data completeness.

For example, you may have a billing process that requires certain sets of data to be complete across multiple systems in order for a transaction to successfully complete in the future. Task based completeness means that you have to look ‘horizontally’ across multiple attributes, entities and even systems to ascertain the data completeness of a given task.

Sounds complicated? It can be, but the payoffs are much greater than applying simple completeness checks. You can also start to build up your library of more complex task-based rules also such as synchronisation, dependencies, formatting rules and so on. What you will find is that these rules start to relate each other. For example, by analysing the format you may find that a field has a blank space instead of a NULL. By determining the NULL value you then realise that your dependency rules are invalidated etc.

Completeness rules are often the ‘bedrock’ of your data quality dimensions so they are important to get right. Try and manage beyond the standard rules of attribute completeness though if you want to truly deliver great value to the business.

What do you think? How do you typically manage the data quality completeness dimension? Welcome your views below.

Share

About Author

Dylan Jones

Founder, Data Quality Pro and Data Migration Pro

Dylan Jones is the founder of Data Quality Pro and Data Migration Pro, popular online communities that provide a range of practical resources and support to their respective professions. Dylan has an extensive information management background and is a prolific publisher of expert articles and tutorials on all manner of data related initiatives.

Leave A Reply

Back to Top