Many organizations are developing analytics practices that combine the data they generate (originating from within the enterprise system landscape) with data sets that originate externally. There's a wide range of controls asserted around these external data sets. Some of the data sets are acquired through arrangements with trusted partners, and others are open data sets taken from known sources (such as government data). These data sets may have had some prescriptive data validation prior to their release. Other data is subjected to much less attention, such as scraped data from collections of websites, uncurated data sets from unknown third parties, or data collected from data streams such as social media channels.
No matter what, your data most likely needs some kind of preparation prior to being used for analytics. Of course, everyone aspires to achieving optimal results from their analyses. And knowing the truth behind the concept of “garbage in, garbage out,” many people opt to transform the data in hopes that those transformations will improve the quality of the results. (Note that I used the word “preparation” rather than “cleaning” when I referred to using data for analytics.) I hope my rationale for that choice of terms will become clear after you read the following list of conundrums – that's my fancy word for fundamental data-philosophical problems about what we have traditionally referred to as data cleaning.
Conundrum 1: Should you be encouraged to clean data?
There are a number of different aspects of data cleanliness. One fixed point involves standardization – mapping from known variations to a standard representation. An example is standardizing address terms, like mapping Street, Str, and ST to ST to represent the term “street.” This type of standardization is generally benign, since it is based on a defined standard that many people agree to. The issue occurs when performing mappings and transformations based on presumed standards, such as mapping individual nicknames to a standard form. An example would be mapping “Rick,” “Rich,” “Ricky,” “Richie” and “Dick” to the name “Richard.” It is true that all of these character strings represent variations of the name Richard – but they could also be shortened forms of other names, or they could represent an individual’s full name rather than a nickname.
Other types of cleaning are more sophisticated and use multiple data sets as sources of truth against which individual records are compared. Alternatively, records that cannot be validated are removed from the data set and are not incorporated into the analysis.
Cleaning data may simplify analytics. For example, hierarchical roll-ups by location are more accurate when standard locations are used (think about sales reports by neighborhood/ZIP code/town/county/state). But here is the dilemma: Although cleaning the data might make the analysis work better, when you change the data you may skew the result.
Conundrum 2: Does “selected excision” introduce bias?
When I use the term “selected excision,” I mean removing certain data instances or objects from the analysis as directed by a predefined set of selection criteria. Examples include eliminating duplicate records, records that are deemed to be outliers, records that appear to be deliberately falsified or records with missing information. In some cases (such as when developing customer behavior models), this means eliminating transaction records that cannot be positively linked to a known individual. The problem is that even if these records do not contain complete or accurate information, they do include some information that might be relevant to configuring prototypical behavior profiles. Removing those records means pulling out information that might be relevant as part of a behavior model – and consequently, the models are skewed (at best) or wrong (at worst).
Conundrum 3: Who dictates what is meant by “clean?"
What is good for one analyst may be garbage for another. I often use the case of fraud analysis as an example. Individuals that are perpetrating fraud often try to mask their behavior – and that means using aliases, providing variant data and selectively omitting information when requested. Identity resolution techniques can be used to link records that could possibly be related to the same individual. But that does not mean that the records should be scrubbed so that they all share the same values – that would erase the evidence of the attempts to mask the fraudulent behavior!
Abstractly, this suggests that data scrubbing that might be valuable in one business context (linking information requests together for sales processing) but would ruin the analysis in another context (e.g., fraud). In other words, different data consumers have different ideas about what is meant by “clean.”
These are just a few of the issues associated with the challenge of ensuring quality analytics. In my upcoming post, I'll look at ways to address these issues to enable high-quality results without compromising the integrity of the original sources of information.Download a paper: 5 Data Management for Analytics Best Practices