I've been doing some investigation into Apache Spark, and I'm particularly intrigued by the concept of the resilient distributed dataset, or RDD. According to the Apache Spark website, an RDD is “a fault-tolerant collection of elements that can be operated on in parallel.” Two aspects of the RDD are particularly interesting to me from a programming standpoint.
The first is the operational model, in which there are two types of operations: a transformation and an action. The programmer applies a transformation to an existing RDD to create a new RDD – and when an action is requested, values are returned from the RDD after running a computation on the data set.
The second is referred to as “lazy execution” for transformations. Conceptually, each transformation transforms one RDD into another. But the reality is quite different. In effect, each RDD represents a series of “cached” transformations to be applied to each data element in the RDD. The results are not actually computed until there is an action. At that point, the computations associated with all the cached transformations are distributed among the processing nodes and executed. The ability to cache sets of transformations but not execute them until their results are requested is similar to a programming model I learned about a long time ago called “continuations.” The continuations model had the same method of caching operations until there was a request for results.
This model allows for pipelined parallel execution, which means interim results can be returned even though the processes are still executing. In many cases this is manifested as streaming results – much faster than waiting for the entire computation to complete. Some claim that applications running on Spark have ten-fold or even hundred-fold speedups from comparable applications running on MapReduce.
So what does this have to do with data quality?
In the big data world, a powerful meme involves capturing data sets in their raw state and allowing users to analyze them in their own ways. But that means no data quality transformations would be applied to the data when it is acquired – it would happen only when the data sets are used. That's actually a good thing when one user’s needs differ from another’s, and when neither person wants the other one's data quality rules applied to the raw data.
From my perspective, the continuation programming paradigm actually provides the application developer with a lot of flexibility when it comes to applying data validation, standardization and cleansing transformations targeted at a wide range of downstream data consumers. With Spark, each business user has his or her own RDD view of the data, and each user’s set of data quality rules can be captured as a series of RDD transformations.
The result is a balance: between the capture and management of raw data sources, and competing opinions about the right set and sequence of data transformations for data quality assurance. Different users can craft their own views without affecting others' views. Although this is dramatically different from our traditional approach of enforcing one set of rules as the data set is captured and stored, it aligns with more modern approaches to consumer-based data usability.
2 Comments
very interesting article!
Thanks Kim - good to see that you thought it was interesting!