Throughout my long career of building and implementing data quality processes, I've consistently been told that data quality could not be implemented within data sources, because doing so would disrupt production systems. Therefore, source data was often copied to a central location – a staging area – where it was cleansed, transformed, unduplicated, restructured and loaded into new applications, such as an enterprise data warehouse or master data management hub.
This paradigm of dragging data from where it lives through data quality processes that exist elsewhere (and whose results are stored elsewhere) had its advantages. But one of its biggest disadvantages was the boundary it created – original data lived in its source, but quality data lived someplace else.
These boundaries multiply with the number of data sources an enterprise has. That's why a long-stated best practice has been to implement data quality processes as close to the data source as possible. This way the someplace else where quality data lives is a place that is at least a short commute from the source data.
Now, the time has come to give data quality the shortest commute possible. It’s time for data quality to work from home, so to speak. We need to push data quality beyond boundaries to where data lives. Data quality processes need to be implemented within data sources, whether that means in-database, in-cluster or in-memory. There will likely be initial disruptions caused by these implementations, but that’s true of any implementation. I've blogged before about the benefits of managing data where it is. Bringing quality home to its sources will deliver benefits that will more than make up for any temporary disruptions to production systems.
That being said, you don’t want each data source implementing its own data quality logic. The enterprise needs standard, repeatable methods for maintaining high-quality data. You need to prioritize the reuse of data quality techniques as deployable services that bring a consistent implementation of data quality to wherever data lives.
I also know that if you are transferring data from one source to another, mistakes can happen. I have seen this all too often where a team member insists the same data will make it to the new platform and we go in and look for a particular contact (one that we knew had all of the correct data fields and didn't need to be touched) and it's gone. Oi! So much for data cleansing! Sometimes we call it "data destroying"! Always always always have a back-up!! Keep the original and don't let anyone tell you "you no longer need it." You need it until you can confirm that all of your contacts made it across.
Great point, Jackie. Even when we shorten data's commute we still need to make sure it reached its destination. Although it may have been modified during its journey (i.e., data quality issues get fixed), the data's arrival should be confirmed, any modifications should be documented, and a backup should always be kept just in case we need to go back to where we started.