Lineage, data quality and continuity: Keeping your data analytics healthy

The adoption of data analytics in organisations is widespread these days. Due to the lower costs of ownership and increased ease of deployment, there are realistically no barriers for any organisation wishing to exploit more from their data. This of course presents a challenge because the rate of data analytics adoption has not always been reflected in the rate of sound data management practices when it comes to ensuring quality along the analytics information chain.

In the past, many organisations adopted the data warehouse model for enabling analytics meaning that there was at least some kind of buffer between operational systems and the management reporting layer. But many organisations choose to ignore the data warehouse approach and suck data straight from live, operational systems, in order to perform regular analytics or perhaps ad hoc analysis of specific issues.

This is where issues can creep in – because the temptation to analyse poor quality data is ever present.

Several years ago, I performed a small data audit on data that was being received each week from an external utilities supplier to a smaller utilities installation firm. They used the information to forecast their equipment purchases each quarter. Only upon measuring the quality of the supplied data did they realise that one of the power rating fields was being routinely sent blank by the data supplier.

Originally this data wasn’t deemed important but then one manager saw value in that attribute and began using it in the forecasting calculation. The problem was that the forecasting calculation was frequently inaccurate due to this missing data that no-one had previously spotted. As the data was aggregated and reported each quarter, the missing data was obscured, so the problem continued.

This is how data analytics often grows in organisations. Small pockets of insight develop organically and more and more data finds its way into the analytics universe. The aggregated nature of analytics processing often means that small, insidious errors can get lost in the noise. (Read this CIO Market Pulse report to learn how modern data architectures are causing companies to update their data management programs).

The basis of all analytics of course is trust and one small error can break the trust management hold in the analytics environment. It doesn’t even need to have a substantial impact on the analysis, the mere fact that an issue creeps through undetected is enough for people to question - what else have we missed? Where else is my decision making likely to be flawed?

So your starting point for any analytics effort I believe has to be lineage and then quality.

What is the life history of this data?
Who supplied it?
Can I trust that source?
Is it a reliable channel and well managed?

In terms of quality you need to apply rudimentary checks first of all to spot obvious gaps in the data that mean it should be dropped from analytical scope. If the data has value, then build more complex data quality rules, ideally at source but of course you may be able to use your analytics toolkit to measure the data in your own analytical environment as well.

Importantly, this must be a continuous process. You may build the integration architecture and analytical plumbing once but as my earlier example shows, data manufacturing is not a uniform process, defect causes are often constant so build continuity into the management of your analytics data by improving and monitor regularly.

Read: Key questions to kick off your data analytics projects