In my last post, I pointed out that we data quality practitioners want to apply data quality assertions to data instances to validate data in process, but the dynamic nature of data must be contrasted with our assumptions about how quality measures are applied to static records. In practice, the data used in conjunction with a business process may not be “fully-formed” until the business process fully completes. This means that records may exist within the system that would be designated as invalid after the fact, but from a practical standpoint remain valid at different points in time until the process completes.
In other words, the quality characteristics of a data instance are temporal in nature and dependent on the context of the process. For example, we might say that a record representing a patient’s hospital admittance and stay is valid if it has an admission date and a discharge date, and that the discharge date must be later than the admission date. However, while the patient is still in the hospital, the record is valid even though it is missing a discharge date.
This means that if you want to apply data quality rules that are embedded within a process, the rules themselves must be attributed as to their “duration of validity.” More simply, if you want to integrate a data quality/validation rule within a process, the rule itself must be applicable at that point in time; if not, the test itself is not valid. The rules have to reflect the quality assertions that are valid at that point in the process, and not expectations about the data after the process is complete. (For more on data quality in a big data world, read my white paper: "Understanding Big Data Quality for Maximum Information Usability.")
The upside of this approach is that it forces the analyst to consider the lifecycle of the record, and how the expectations may change across that lifecycle. A record that is incomplete at one point in the process is valid, yet at a later point, the incompleteness is no longer permissible. From a practical standpoint, integrated data validation can then identify those process points where permissible gaps are no longer allowed, creating the opportunity for proactive steps to be taken to ensure the record’s quality. The quality of data is allowed to coalesce, or “gel” over time until the point in time at which all the quality assertions must be true.