A double take on sampling


My previous post made the point that it’s not a matter of whether it is good for you to use samples, but how good the sample you are using is. The comments on that post raised two different, and valid, perspectives about sampling. These viewpoints reflected two different use cases for data, which I’ll address in this post.

Preventing duplicates

“I certainly do believe in sampling for simplicity,” Prashanta Chandramohan commented, “but the speed is the main criteria here. If we can analyze a larger data set in a fairly decent amount of time, why not take that route? I have been working on data matching and de-duplication process for a long time now and it used to take days to run a matching process on a fairly small set of data (10 million or so records). But now, I do matching with much higher accuracy in a few hours, and I am not even talking about running a matching engine on Hadoop using the MapReduce framework. We still have a long way to go, but these disruptive technologies are making it clear that we don’t need to worry about volume anymore.”

Like Chandramohan, I have spent a large portion of my career working on data matching, especially for what is now referred to as master data management (MDM). A common challenge is developing, testing and tuning a match algorithm for de-duplicating existing master data and preventing future duplicates in the MDM system.

In this case, the total input volume might be billions of records. For the efficiency of rapid prototyping, a small representative sample was often used for the initial development of the match algorithm. To Chandramohan’s point, even without (and before) Hadoop, technological advancements have enabled the use of a lot more data, without requiring more processing time, during the development phase of a match algorithm.

However, representative samples are still needed for the interactive review of the match results with business users and subject matter experts. Here the amount of data used is not constrained by processing efficiency, but by the time constraints of manual review (as well as the reality that there is a cognitive limit to the amount of data humans can look at during a given timeframe). Therefore, relatively small data samples attempting to represent common business scenarios are used during user acceptance testing.

Of course, testing and verification of the feedback on the match results has to process larger representative samples, eventually processing the full volume of data before the match algorithm is approved for production.

In this MDM scenario, therefore, sampling is used, but eventually we must use all the data.

Predicting trends

Preventing duplicates, however, is a very different use case than predicting trends, a category of use cases more commonly associated with big data analytics. “Even if enabling technologies for big data will allow you now to take all the data for an analysis,” Stefan Ahrens commented, “there still might be good reasons for applying an intelligent sampling strategy. For example, for a predictive modeling project where the modeled outcome is binary (i.e., customer response vs. no response) and the distribution is very skewed in the underlying population (i.e., many more non-responders than responders). I’d rather spend some time finding the appropriate approach of over-sampling (for the less frequent target outcome level) than just using all the data that are available, even if I’m no longer limited by technology to do so. Oversampling — done the right way — usually improves the predictive power of the model. Yes, I might not make use of all the information that is available in the data source. However, what is lost in terms of information is more than made up for by better model quality (predictive performance).”

“Statisticians have developed valid techniques for sampling,” Rick Wicklin commented. “Ignoring this research is lazy and leads to biased results.” Wicklin blogged about estimating popularity based on Google searches, which is a great example of how “just use all of the data” is not always a great approach. Significant biases exist in (and about) the big data sets now being utilized for such things as analyzing customer sentiment or predicting flu trends.

In these analytics scenarios, therefore, properly performed sampling is more useful than using all of the data.

Do a double take on sampling

The next time you hear someone say sampling is a good or bad approach, do a double take. Determine what type of problem they are applying data and analytics to by taking a sample of the goals they are trying to achieve before you advise them to sample or not.


About Author

Jim Harris

Blogger-in-Chief at Obsessive-Compulsive Data Quality (OCDQ)

Jim Harris is a recognized data quality thought leader with 25 years of enterprise data management industry experience. Jim is an independent consultant, speaker, and freelance writer. Jim is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality and its related disciplines, including data governance, master data management, and business intelligence.

Leave A Reply

Back to Top