Beyond provenance: your route to increased trust in statistical decision making

0

We often hear that companies are striving to manage their data as a “trusted asset” but what does this really mean? Trust is like culture. It’s seemingly hard to define and measure, but when it’s lacking,it becomes all too apparent.

Nowhere is trust more critical than in the world of statistics. With seemingly no limitations to the amount of processing and storage capacity available to even modest companies the amount of data fuelling statistical decision making processes is clearly set to rise.

Clearly, the goal of this statistical exercise is to create an advantage, either for the company or their clients. Companies like Google have been deploying statistical algorithms for years to enhance every aspect of their service.

Should the search screen have 10 results or 20? What effect does color have on clickthroughs? If we change the font size will people find this section on the page more easily?

Millions of data points confirm or refute these assumptions and the statistical conclusions help Google to organically improve their services (and bottom line). The problem comes however when we start to ignore one of the foundations of trusted decision making, data provenance.

If you walk into the average business and ask them if they trust their data, chances are you’ll get a positive response. Ask the same person if they can provide provenance of their data and you’ll most likely get a blank stare.

So what is data provenance? Time for a definition:

The primary purpose of tracing the provenance of an object or entity is normally to provide contextual and circumstantial evidence for its original production or discovery, by establishing, as far as practicable, its later history, especially the sequences of its formal ownership, custody, and places of storage. The practice has a particular value in helping authenticate objects. Comparative techniques, expert opinions, and the results of scientific tests may also be used to these ends, but establishing provenance is essentially a matter of documentation. (Wikipedia)

So data provenance effectively means “understand where the stuff came from.”

In data terms, this means “data lineage,” “information lifecycle management” or “information chain management,” and it’s often the missing component in ensuring trust in statistical decision making.

Back in 1992 we faced this problem in a “Big Data Statistical Engine” of our own. We would take millions of records of census, geospatial and retail data across Europe and run a statistical process that calculated the optimal store location for car dealerships.

By implementing a provenance framework (basically tagging all data from supplier to point of processing) we were able to shave weeks off the production lead times because we instinctively traced defects quicker and began to introduce controls to check data between each hand-off point.

So, simply by understanding how your data moves from point of collection to point of processing can have huge benefits but you can now go further.

Data doesn’t simply move from point to point, it is transformed, combined, aggregated, standardised and manipulated in a myriad of different ways. These transformations are bound by logic such as:

  • “If the product sub-category is CHILD APPAREL then set category group to CLOTHING”
  • “Merge Geospatial coordinates for all ZIP Code entries”
  • “Extract digit number 10 from the Vehicle Identification Number and add a new Model Year entry”

Documenting these rules as part of our provenance process is not enough. We need to transform these data rules into data quality rules that are continuously monitored.

For example, several years ago on a data quality assessment project we unpacked a mainframe data feed incorrectly on one field. One single character led the team to make a whole rack of assumptions because the data suddenly pointed to a conclusion that fitted our own “mental model” of how the business should behave. We were wrong, and it was a stark lesson in how statistical decision making can be impacted by even the smallest flaw.

In our example, we found that by creating a validation rule on the unpacked data helped us to trap the flaw before it passed down the line. We created a specification for quality, which was continuously monitored, along the entire information chain.

We had combined data quality and data provenance. The result was greatly increased trust in any subsequent statistical decision making.

So think about your own statistical processes. What about their data provenance? Can you pull out a single item of data and trace its information chain back to its point of capture? Is the source reliable? Do they publish quality metrics as part of a Service Level Agreement?

That's a good starting point but can you measure the quality of that data across its lifecycle based on agreed rules? Can you do that continuously? Can your data quality process instantly spot defects and alert others so they can understand the risks of subsequent decision making?

Documenting data provenance is good. Managing lifecycle data quality, now that's your goal.

Tags
Share

About Author

Dylan Jones

Founder, Data Quality Pro and Data Migration Pro

Dylan Jones is the founder of Data Quality Pro and Data Migration Pro, popular online communities that provide a range of practical resources and support to their respective professions. Dylan has an extensive information management background and is a prolific publisher of expert articles and tutorials on all manner of data related initiatives.

Leave A Reply

Back to Top