In my previous post, I discussed sampling error (i.e., when a randomly chosen sample doesn’t reflect the underlying population, aka margin of error) and sampling bias (i.e., when the sample isn’t randomly chosen at all), both of which big data advocates often claim can, and should, be overcome by using all the data.
In this post, I thought it best to balance the bias against sampling by explaining why it is still sensible to sample since it works when it is done well (as opposed to done poorly such as in the example of Mozart for Babies).
It’s not all-or-nothing
Megan Thee-Brenan opened her recent New York Times article with the old survey researcher joke: “If you don’t believe in random sampling, the next time you have a blood test, tell the doctor to take it all.”
Blood testing is not the only example of how not all analytics is all-or-nothing. Public opinion polling, the focus of Thee-Brenan’s article, is another. This is an undying staple of political reporting and election forecasting. The hip or hyped, depending on your perspective, big data version of pubic opinion polling is sentiment analysis.
Thee-Brenan explained that if — and this, she noted, is the big if — the correct methods are employed, a randomly selected sample of a population can be used to estimate the views of the entire population. “Every member of the population has to have an equal or at least a known chance of being chosen, called probability sampling.”
Phoning it in
Telephone polls, for example, use probability sampling and random digit dialing. “A random sample is taken from a pool of all possible telephone numbers,” Thee-Brenan explained. “But if a person has two phone numbers, he or she has a higher chance of being contacted for a survey, so their responses are weighted to adjust for that. Pollsters are confident they can interview about 1,000 people to measure the views of a nation of over 300 million” with a margin of error (aka sampling error) of plus or minus three percentage points.
Larger samples (even the really big ones used by big data) decrease, but never eliminate, the margin of error.
Sometimes using a larger sample is not worth the effort and cost. In telephone polls, increasing the sample size to 2,000 people will decrease the margin of error to plus or minus two percentage points, but it considerably increases the cost, in time and money, of conducting the poll. This is why most pollsters stick to samples of 1,000 people.
Survey says . . .
The bottom line is that it is not a matter of whether it is good for you to use samples. It is a matter of how good the sample you are using is. You need a sound sampling methodology and a sound mind that accepts the fact that no matter how big your sample is, its margin of error never disappears — despite what some big data surveys say.
A growing trend is to use statistical software to "mine the web." People count the number of web pages that contain some term and the number that contain some other term, and then draw a conclusion about which term is more popular. Inherent in this approach is the assumption that the number of web sites is a valid proxy for the popularity of that term in the population. It is not, and I published an article "Estimating popularity based on Google searches" that gives several reasons why this is a biased and inaccurate sample.
As you point out, statisticians have developed valid techniques for sampling. Ignoring this research is lazy and leads to biased results.
Thanks for your comment and the link to your excellent article, Rick. It reminded me of my lament for the lazy "journalism" that has become sadly popular these days, namely the use of status updates from Twitter and Facebook and Google search results to "report" on the public sentiment about current events. These biased and inaccurate samples skew public perception. I fear the hype of big data analytics could cause similar issues within corporate environments when sound statistical techniques are not followed.
Great post Jim, specifically the point on blood sampling made me laughs out loud (Literally).
I certainly do believe in sampling for simplicity but the speed is the main criteria here. If we can analyze a larger data set in a fairly decent amount of time, why not take that route?
I have been working on data matching and de-duplication process for a long time now and we used to take days to run matching process on a fairly small set of data (10 million or so records). But now, I do matching with much higher accuracy in few hours, and I am not even talking about running matching engine on Hadoop using MapReduce framework. We still have long way to go but these disruptive technologies are making it clear that we don’t need to worry about volume anymore.
Very good post for presenting the need for sampling. Even if enabling technologies for big data will allow you now to take all the data for an analysis, there still might be good reasons for applying an intelligent sampling strategy. For example, for a predictive modeling project where the modeled outcome is binary (i.e., customer response vs. no response) and the distribution is very skewed in the underlying population (i.e., many more non-responders then responders) I'd rather want to spend some time on finding the appropriate approach of over-sampling (for the less frequent target outcome level) than just using all the data that are avaiable, even if I'm no longer limited by technology to do so. Oversampling - done the right way - usually improves predictive power of the model.
Yes, I might not make use of all the information that is available in the data source. However, what is lost in terms of information is more than made up for by better model quality (predictive performance).