Data integration teams often find themselves in the middle of discussions where the quality of their data outputs are called into question. Without proper governance procedures in place, though, it's hard to address these accusations in a reasonable way. Here's why.
First, many operational data integration projects are not configured to include data quality assurance as part of the project. Believe it or not, during a meeting I heard a representative from a project contractor suggest that ensuring quality of the data output was not part of their statement of work. I've also reviewed project plans and contracts in which the service provider specifically states that they make no guarantees about the quality or usability of their work.
Second, without defined data assessment processes, it's challenging to pinpoint the source of a data flaw. That's especially true in the context of data integration, which typically ingests data from different sources, applies some transformations, and reformulates the data into target data models in preparation for consumption by a downstream user or application. The bad data might have already been flawed before it was ingested (in which case it's not the fault of the data integration team). But it might have been corrupted as part of the integration process (in that case, it is their fault!). Or, it might have been defined in a way that was misinterpreted by the consumer (yet again, this is not their fault).
Takeaways
There are two lessons to be learned from this – they're especially important because of the growing reliance on data to drive business.
- First, it's no longer acceptable for any team to abdicate accountability when it comes to ensuring the quality of processes that result in data products.
- Second, data integration teams should not have to shoulder the responsibility for flawed data when the flaws are beyond their control.
In essence, this means there must be some method for identifying the source of a data flaw. If a data integration professional can determine that the data quality issue originated at a particular external source, he can communicate that to the owner. Conversely, if the data issue originated during data integration processing, a data steward can use validation processes to try to pinpoint where the errors occurred.
This approach provides some trace-back mechanisms for monitoring data quality accountability. Having a “data quality audit report” that demonstrates a data warehouse issue is traceable to flawed source data maintains the perception of integrity for data integration, preparation and loading processes. At the same time, it gives downstream business data consumers the information they need to talk with source data owners about ways they can ensure that input data meets their expectations.
The implication here is a deeper link between the needs of reporting and analytics application consumers, driven off the data warehouse and source data quality. Enforcing conformance with the users’ data quality expectations requires more formal methods of exchange that are defined, communicated and monitored using data quality policies. And that is the topic of my next post.