After working in the data quality industry for a number of years, I have realized that most practitioners tend to have a rather rigid perception of the assertions about the quality of data. Either a data set conforms to the set of data quality criteria and is deemed to be acceptable – or the data set fails to observe the levels of acceptability and is deemed to be flawed.
I suspect that our attempt to designate a data set to be of “acceptable quality” (in relation to a discrete assessment) is an artifact of data warehousing, in which a data set is extracted, transformed and loaded as a single, static unit. Quality characteristics are measured en masse to provide an overall score for a static collection of records that are representative of the underlying data model.
Our data quality rules are typically defined in relation to the underlying data model, with the assumption that all of a modeled entity’s attributes will have already been completed. In the data warehouse scenario, this is definitely true, since the data set is extracted after the transaction processing has already been done. In retrospect, it seems that a similar perception exists at the time the entity is modeled in the first place – the completeness of the entity record is a necessary assumption; it wouldn't make sense to design an entity model with data attributes to which values are not assigned!
⇒ Learn more about data quality in a big data world in David Loshin's white paper "Understanding Big Data Quality for Maximum Information Usability"
However, as data quality practitioners seek to extend the processes of applying rules for data validation to transaction and operational processing, it occurs to me that there may be some flaws in our thinking that might lead us to draw mistaken conclusions about the quality of data. The gap in the reasoning is that our data quality rules presume the static representation, but the data supporting transaction and operational processing is dynamic.
Here is an example: we say that no sales transactions can be processed unless the customer has a valid customer account. While this may be a valid assertion once the records of the day’s transactions are to be loaded into the data warehouse, it might not be 100% true as the actual sales process is executing.
Consider this scenario: a new customer attempts to buy a product. At that point the customer does not have an account. A temporary account is created and the customer is assigned a provisional customer identifier so that the transaction can be recorded. At the same time, credit information is collected from the prospective customer and submitted for review by the credit and finance department. Once the prospective customer’s credit information has been vetted, the customer’s account is upgraded from provisional to valid.
At a later point, the recorded sales transaction is processed, and the customer’s identifier is searched to determine whether the customer is still in provisional status or has been upgraded. If the customer is still provisional, the transaction is put on hold until the next time transactions are processed. If the customer has been upgraded, the transaction is processed and the order is sent to fulfillment.
What this means is that that at some point in time there will be sales transactions for customers that do not have a valid customer account, and that is not an error. Rather, the records are still changing, and the assertions of quality cannot be applied until the workflow process has completed and the data has been resolved from its dynamic state to a static state.
In other words, the models may be static, but the data is dynamic. But if the data is dynamic, how can data quality rules be applied to the data in process? More next time…