Data preparation is the process of blending, shaping and cleansing data to get it ready for analytics or other business purposes. Among the many tasks involved in data preparation, data quality plays an important role in refining the data being prepared.
One of the most common definitions of data quality is fitness for the purpose of use. Most data has both multiple uses and users. Data of sufficient quality for one use or user may not be of sufficient quality for other uses and users. These multiple – and often conflicting – perspectives are considered irrelevant from the perspective of an individual user, who just needs quality data to support their own business activities.
Let's use customer data as an example. Marketing may be using it for a campaign targeting customers in certain age groups to receive rebate offers via e-mail, whereas finance is using it to bill customers with an outstanding account balance via postal address. Marketing cares about the quality of the date of birth and e-mail address columns, whereas billing cares about the quality of the account balance and postal address columns. Not only are these different uses for the same data, there are also different tolerances for poor data quality. Missing, incorrect or redundant information may make the marketing campaign inefficient or ineffective – but similar issues for finance prevents the organization from collecting money it’s owed.
Build once, then reuse
Even though different purposes have different fitness requirements, this doesn’t mean data quality has to be an isolated, independent process. The rules applied to individual data columns can still be reusable for other purposes. Regardless of what use the data is put to, building the logic to validate a postal address and date of birth and to verify an e-mail address and account balance should be built once and shared. This ensures that even when the same data is prepared for different high-level purposes, low-level data quality checks are easily repeatable and consistently performed.
The extent to which data quality functions such as validation, deduplication and enrichment can be performed during data preparation is often determined by the ability to reuse such components from other efforts. The point is: Don’t reinvent the wheel, reuse what’s around. Ideally, the enterprise should strive for making data quality components a library of functions and repository of rules that can be reused to cleanse data. The more reusable your data quality processes are, the more you can enable self-service data preparation. This makes business users less reliant on IT to build custom processes, and makes the entire organization more productive while gleaning the most value out of the vast data assets from across your enterprise.
Download a paper about 5 data management for analytics best practices