My last post suggested it's unrealistic to assume that a naïve approach to identity resolution – comparing each record in one data set against all records in the data set – would easily scale on a parallel processing cluster. Instead, we must approach the issue of big data identity resolution the same way similar issues have been handled for decades: the data must be broken up into smaller subsets of records that are more likely to match.
The interesting question, then: How do you do this with a big data programming and execution environment? The key is to understand the nature of data parallelism and how computation is optimized within the execution environment.
Let’s take Apache Spark as an example. Spark programming is based on the concept of resilient distributed data sets, or RDDs. An RDD is basically a representation of a data set; and two types of operations are performed on an RDD:
- Transformation, which effectively applies some computation to one RDD to produce another RDD.
- Action, which returns a value.
If we use an example of the set of all integers between 1 and 100 stored as an RDD, then a transformation would be adding 2 to each value (resulting in a new RDD with the integers between 3 and 103). And an action would ask for the sum of all the values in the RDD.
The part about Spark that speeds performance is lazy computation. When you write your Spark program, all transformations are cached up, and they're not executed until there is an action. At that point the collected transformations can be applied in parallel to each data element in the RDD, and the data and the parallel computations can be distributed across the different nodes in the cluster.
Now – to address the big data identity resolution challenges – you have to be able to take the data, break it up into its smaller groups, create sets of record pairs for comparison and then apply the similarity comparison functions. If you were searching for approximate duplicates in a single file, in Spark pseudo-code it looks something like this:
|Load the records.||Read the data set into an RDD.||Linear in relation to the number of records.|
|Break it up into blocks using a blocking key.||Use GroupByKey using the defined blocking key.||This action incurs a shuffle of data!|
|Create record pairs.||Use itertools.product to create a cross product of the records in each block.
Take the result and turn it into a list.
|This creates a new RDD with a list of sets of record pairs.|
|Flatten the list of sets.||Use the RDD flatMap.||Pulls each record pair in all the sets into a single list.|
|Apply similarity comparison to each record pair.||Use the RDD map of your similarity function to each record pair in the RDD.||Create a new RDD where each element is a list consisting of the record pair and the similarity score.|
|Filter out those record pairs whose score is above the match threshold.||Use the RDD filter function using the matching threshold as the filtering condition.||Creates a new RDD with all record pairs whose similarity score is above the matching threshold.|
I'll leave the details as an exercise for readers. The upshot, though, is that this process for big data identity resolution still requires a shuffle (using the GroupByKey). But once that is applied, it does reduce the space of record pairs to be compared, and it does take advantage of data parallelism. The only caveat is that the GroupByKey might need to pull all the data to a single node, and this will prove to be a performance bottleneck.
I'll suggest that there might be some other ways of doing this that are affected less by an approach to blocking. As another exercise for readers, let me pose this question: What are some alternatives for reducing or eliminating the need for this bottleneck with big data identity resolution?Download a paper about bringing the power of SAS to Hadoop