In my previous post I discussed the practice of putting data quality processes as close to data sources as possible. Historically this meant data quality happened during data integration in preparation for loading quality data into an enterprise data warehouse (EDW) or a master data management (MDM) hub. Nowadays, however, there’s a lot of source data available to the enterprise and a significant amount of it doesn’t pass through an EDW or MDM process before it’s used. This begs the question: Where should data quality happen?
Wherever there’s data, there should be a warning label
Every few weeks I make chili in my slow cooker (I call it “wimpy chili” since I don’t like really spicy food) – it’s a simple recipe that's easy to make. One of the main ingredients is ground beef. Grocery stores offer a wide variety of options including ground chuck, ground round and ground sirloin. All beef that has been ground up contains a mixture of meat and fat, so ground beef is labeled to warn consumers about the amount of fat it contains. Perhaps primed by the Pareto principle, when I make chili I purchase ground beef that’s 80% meat and 20% fat.
The amount of fat in ground beef is analogous to the amount of poor quality in the data business users consume. You shouldn’t consume ground beef without knowing its fat content. Likewise you shouldn’t consume data without knowing its quality. So, wherever there’s data (in Hadoop, in-stream, in-memory, in-database, in-cloud), there should be a warning label.
A warning label for data quality could be implemented as a series of yes/no or pass/fail flags appended to all data structures and populated by callable services that perform some simple data quality checks. Examples include:
- Whether all critical fields were complete (i.e., populated with non-NULL data values).
- Whether fields were populated with a valid format (e.g., social security number, date of birth, postal code).
- Whether fields were populated with a valid value (e.g., gender code, state abbreviation, country code).
- Whether a duplicate check was performed when a customer or product record was created.
These simple checks wouldn’t guarantee high-quality data, but they would warn business users about how much poor quality data they might be consuming. They might also provide a rapid assessment of whether additional data quality processing is needed before data can be put to specific business uses.
Data quality shouldn’t happen more than once
No matter where you think data quality should happen, make sure it doesn’t happen more than once. The enterprise needs standard, repeatable methods for maintaining high-quality data. Data quality functions (profiling, parsing, standardizing, matching, consolidating, enriching) should be reusable services embedded into batch, real-time and streaming processes. This way, wherever data happens to be, data quality happens to be there as well.