Data quality to "DI" for

There is a time and a place for everything, but the time and place for data quality (DQ) in data integration (DI) efforts always seems like a thing everyone’s not quite sure about. I have previously blogged about the dangers of waiting until the middle of DI to consider, or become forced to consider, DQ. In hindsight, telling yourself you should have had forethought about DQ amounts to little more than someone telling you “I told you so.” Of course, many a DQ professional, myself included, has told you so. But I digress.Part of the problem of finding the right time and place for DQ is that DI was historically associated with extract, transform and load (ETL) processes, where data is extracted from multiple source systems and transformed into the structures of a single target system into which the data is loaded. ETL as DI invited a source/target mentality where DQ was assumed to be something that happened before DI or after DI. Ignorance of DQ issues being bliss, source data was copied as-is to the target, relying on – or transforming data into – the target’s data type for format consistency (although a lot ended up loaded into free-form text fields).

While DI is intended to bring data together to make it more valuable, ETL often simply related data by natural keys (e.g., customer number, account number, product number). Deduplication and consolidation were usually frowned upon because then the number of records extracted from the source would not match the number of records loaded to the target. Data cleansing, standardization, matching and enrichment were DQ functions left to be performed either by source systems before ETL extracts, or by downstream applications after ETL loads.

Master data management (MDM) tools are perhaps the best example of DQ playing a key role in the middle of DI. MDM incorporates standardization, matching (far beyond natural keys), deduplication and consolidation on the multiple disparate definitions of master data entities (parties, products, locations, assets) held in various source systems in order to create and maintain a single version of the truth.

However, to achieve DQ to DI for (pardon that bad pun on “data quality to die for”), we cannot think of the time and place for DQ as before, after or even in the middle of DI. The real problem is not poor integration due to data duplication (and other DQ issues). The real problem is duplicated DQ efforts. Different systems, applications and business units are either using different tools or custom-coding their own bespoke DQ processes. DQ functionality for profiling, parsing, standardizing, matching, consolidating and enriching data has to become reusable services that can be embedded into batch, real-time and streaming processes whenever and wherever data is integrated.