Thoughts on dynamic data provenance

We've explored data provenance and the importance of data lineage before on the Data Roundtable (see here). If you are working in a regulated sector such as banking, insurance or healthcare, it is especially important right now and one of the essential elements of data quality that they look for in their assessments.

Data lineage is vital to data quality management because we need to know where data originates from in order to build data quality rules to measure, monitor and improve it, ideally at source.

Achieving a mature level of data lineage is not an easy task. Most established firms have a complex data landscape with many silos and poorly documented data processes making it difficult to trace the minefield of information pathways across the organisation.

One mistake I often see is the simplification of data lineage. You'll often see data lineage being assigned by entity or table. Someone will enter a table name and then assign a business owner, technical owner, and the source that feeds into the table. The problem here is a lack of granularity; there can be many sources feeding into the table over time so you need have greater flexibility for how the design may evolve.

I believe the only way to manage data lineage is by tagging each record of information with dynamic provenance data (aka, metadata).

I'll discuss a tagging approach later but let's explore a typical customer transaction. Imagine that you place an order with a grocery supplier to deliver your weekly grocery order to the home.

Here's a simplified master record:

Name: Jane Marshall
Address: 1 Flower Meadow Lane, Chiswick, London, SW1 1BA
Order ID: 23141132
Delivery Date: 10th December 2014
Requested Time: 9am-10am
Total Cost: £97.21

This one simple example demonstrates how quickly data lineage becomes complicated. Population of the table could come from a variety of information pathways:

Customer goes into a local grocery store and books a delivery using an in-store app created by the grocery firm
Customer has no internet access so calls up the grocery call centre to place their order for home delivery
Customer uses the internet application to place their order

If we take the order detail record that holds information of all the items delivered to the consumer this also has several sources of information. The consumer themselves can be the cause of any changes and updates as they revise their order multiple times leading up to delivery. The grocery driver may also update the order at the point of delivery as they take back items the consumer no longer requires. This information would come from a portable computer used at the point of delivery.

We can see that each table in the system has multiple points of lineage, so you need to be at least tracking provenance down to record level.

To achieve provenance at record-level there is a wealth of information that you can tag:

Who made changes to the data?
What application function made changes to the data?
Where was the record of data fed from?
When was the change made?
How was the change made e.g. manual entry vs. migrated?
Why was the change made e.g. new record creation, user update?

As you can see from this simple example, you cannot store all this tagging information against each record because multiple people can make changes. If the record is a master record e.g. an equipment master, then it may be possible for scores of people and applications around the organization to make changes to the record over time. You therefore need to track this tagging data over time in a separate location that maps tag information against the record in question and allows the flexibility to add new tags and functionality as your approach matures.

At this point, a range of technical arguments emerge:

Surely this is overkill and will slow down transaction performance
What about the cost of storing all this additional data?
Who will make use of all this 'noise'?
This is a major re-design, COTS products do not support this type of functionality

Clearly, this approach is technically challenging. However, if you want accurate data provenance and a deeper understanding of how your most critical data is sourced, modified and utilised, then you need to move from simple spreadsheets containing endless tables and attribute mappings to a more dynamic, contextual, tagging system.

With Big Data and cloud storage solutions this type of provenance is achievable and, what's more, provides incredibly rich detail of not only data quality root causes but far deeper insight into the way your business and underlying architecture is performing.