Perhaps the core component of any master data management system is identity resolution. Identity resolution methods process a collection of records to find groups of records that are presumed to represent the same real-world entity despite data variations (such as misspellings, miskeyed data or incorrect values). The simplest form of identity resolution links records that share the exact same corresponding values for selected identifying attributes (such as last name, first name and Social Security Number for individuals). This approach will find some duplicate records, but not all of them. It's not sophisticated enough to find record pairs that represent the same entity but whose values do not exactly match across all identifying attributes.
More complex methods of linkage take the variation into account. These methods use similarity scoring to compare record pairs to see if their values are close enough to deem them as a match. The naive approach to matching the records in data set A to data set B would require that each record in data set A be compared with every record in data set B. In most cases this would incur a significant amount of computation. For example, if data set A has 500,000 records and data set B has 4,000,000 records, the matching analysis would require 2,000,000,000,000 comparisons (2 trillion!).
To address this, those who develop matching applications have traditionally sought to limit the total number of record-to-record comparisons performed. They do this by blocking the data into smaller sets of records with a higher likelihood of matching. Blocking is generally done by selecting a combination of data element values that are likely to be shared among records that represent the same entity and then grouping the records that share those values into subsets. An example of a blocking key would be the first three characters of a last name field concatenated with the first three digits of the ZIP code. All records sharing that same key would be grouped together, and only those records that were in the same block would be compared.
To continue our example, if we were to create 1,000 blocks of data set A (each with 500 records) and 1,000 blocks of data set B (each with 4,000 records), then each pair of corresponding blocks requires 2,000,000 comparisons. For 1,000 block pairs, this requires 2,000,000,000 (2 billion) comparisons. That's still a lot, but it's 1,000 times smaller than the naïve approach.
The complexity of big data distribution
That being said, shouldn’t a big data platform make the need for optimization moot? Many people seem to think that a Hadoop or Spark cluster (where the records are distributed and there are multiple parallel processing units) should easily be able to handle the full-scale set of comparisons. These people are misguided because they may not understand the complexity of how data distribution across a distributed file system impacts application performance. There are three potential issues:
- The need to shuffle the data. Let's say we want to distribute the computations across the set of processing units. Recall that matching each record in data set A against each record in data set B means that all the records in both sets have to be accessible by all the processors. You can break up the computation – but at some point all the records have to be broadcast, if not shuffled around, to each processor. So even if this were a reasonable alternative it would require a significant amount of data transfer, incurring a lot of latency.
- The need to reduce the results. Even if you could effectively distribute the computation, at best you would be computing partial sets of matches that would still need to be resolved against each other. In other words, you might find some sets of record matches in your isolated groups, but those matched sets would also need to be analyzed for matches with records in other isolated groups. This suggests that attempting to distribute the computation almost doesn't make sense unless you've figured out a way to maintain a running log of your equivalence classes of matched records.
- There is still a lot of computation. The example we have (500,000 and 4,000,000 records) is not even such a big example. Imagine that you had 10,000,000 records and wanted to look for duplicates – that would need 1x1014. That would tax any but the most serious massive clusters.
Simply put, a naïve approach to identity resolution probably will still not scale on a big data platform, especially as the data set volumes increase. In my next post, we'll dig a little deeper into this issue.Download a paper about bringing the power of SAS to Hadoop