Throughout my long career of building and implementing data quality processes, I've consistently been told that data quality could not be implemented within data sources, because doing so would disrupt production systems. Therefore, source data was often copied to a central location – a staging area – where it was cleansed, transformed, unduplicated, restructured and loaded into new applications, such as an enterprise data warehouse or master data management hub.
This paradigm of dragging data from where it lives through data quality processes that exist elsewhere (and whose results are stored elsewhere) had its advantages. But one of its biggest disadvantages was the boundary it created – original data lived in its source, but quality data lived someplace else.
These boundaries multiply with the number of data sources an enterprise has. That's why a long-stated best practice has been to implement data quality processes as close to the data source as possible. This way the someplace else where quality data lives is a place that is at least a short commute from the source data.
Now, the time has come to give data quality the shortest commute possible. It’s time for data quality to work from home, so to speak. We need to push data quality beyond boundaries to where data lives. Data quality processes need to be implemented within data sources, whether that means in-database, in-cluster or in-memory. There will likely be initial disruptions caused by these implementations, but that’s true of any implementation. I've blogged before about the benefits of managing data where it is. Bringing quality home to its sources will deliver benefits that will more than make up for any temporary disruptions to production systems.
That being said, you don’t want each data source implementing its own data quality logic. The enterprise needs standard, repeatable methods for maintaining high-quality data. You need to prioritize the reuse of data quality techniques as deployable services that bring a consistent implementation of data quality to wherever data lives.