Big data quality with continuations

2

I've been doing some investigation into Apache Spark, and I'm particularly intrigued by the concept of the resilient distributed dataset, or RDD. According to the Apache Spark website, an RDD is “a fault-tolerant collection of elements that can be operated on in parallel.” Two aspects of 42-65804864the RDD are particularly interesting to me from a programming standpoint.

The first is the operational model, in which there are two types of operations: a transformation and an action. The programmer applies a transformation to an existing RDD to create a new RDD – and when an action is requested, values are returned from the RDD after running a computation on the data set.

The second is referred to as “lazy execution” for transformations. Conceptually, each transformation transforms one RDD into another. But the reality is quite different. In effect, each RDD represents a series of “cached” transformations to be applied to each data element in the RDD. The results are not actually computed until there is an action. At that point, the computations associated with all the cached transformations are distributed among the processing nodes and executed. The ability to cache sets of transformations but not execute them until their results are requested is similar to a programming model I learned about a long time ago called “continuations.” The continuations model had the same method of caching operations until there was a request for results.

This model allows for pipelined parallel execution, which means interim results can be returned even though the processes are still executing. In many cases this is manifested as streaming results – much faster than waiting for the entire computation to complete. Some claim that applications running on Spark have ten-fold or even hundred-fold speedups from comparable applications running on MapReduce.

So what does this have to do with data quality?

In the big data world, a powerful meme involves capturing data sets in their raw state and allowing users to analyze them in their own ways. But that means no data quality transformations would be applied to the data when it is acquired – it would happen only when the data sets are used. That's actually a good thing when one user’s needs differ from another’s, and when neither person wants the other one's data quality rules applied to the raw data.

From my perspective, the continuation programming paradigm actually provides the application developer with a lot of flexibility when it comes to applying data validation, standardization and cleansing transformations targeted at a wide range of downstream data consumers. With Spark, each business user has his or her own RDD view of the data, and each user’s set of data quality rules can be captured as a series of RDD transformations.

The result is a balance: between the capture and management of raw data sources, and competing opinions about the right set and sequence of data transformations for data quality assurance. Different users can craft their own views without affecting others' views. Although this is dramatically different from our traditional approach of enforcing one set of rules as the data set is captured and stored, it aligns with more modern approaches to consumer-based data usability.

Big data quality footer image

Share

About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at mdmbook.com . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at loshin@knowledge-integrity.com.

Related Posts

2 Comments

Leave A Reply

Back to Top