In my previous post, I discussed sampling error (i.e., when a randomly chosen sample doesn’t reflect the underlying population, aka margin of error) and sampling bias (i.e., when the sample isn’t randomly chosen at all), both of which big data advocates often claim can, and should, be overcome by using all the data.
In this post, I thought it best to balance the bias against sampling by explaining why it is still sensible to sample since it works when it is done well (as opposed to done poorly such as in the example of Mozart for Babies).
It’s not all-or-nothing
Megan Thee-Brenan opened her recent New York Times article with the old survey researcher joke: “If you don’t believe in random sampling, the next time you have a blood test, tell the doctor to take it all.”
Blood testing is not the only example of how not all analytics is all-or-nothing. Public opinion polling, the focus of Thee-Brenan’s article, is another. This is an undying staple of political reporting and election forecasting. The hip or hyped, depending on your perspective, big data version of pubic opinion polling is sentiment analysis.
Thee-Brenan explained that if — and this, she noted, is the big if — the correct methods are employed, a randomly selected sample of a population can be used to estimate the views of the entire population. “Every member of the population has to have an equal or at least a known chance of being chosen, called probability sampling.”
Phoning it in
Telephone polls, for example, use probability sampling and random digit dialing. “A random sample is taken from a pool of all possible telephone numbers,” Thee-Brenan explained. “But if a person has two phone numbers, he or she has a higher chance of being contacted for a survey, so their responses are weighted to adjust for that. Pollsters are confident they can interview about 1,000 people to measure the views of a nation of over 300 million” with a margin of error (aka sampling error) of plus or minus three percentage points.
Larger samples (even the really big ones used by big data) decrease, but never eliminate, the margin of error.
Sometimes using a larger sample is not worth the effort and cost. In telephone polls, increasing the sample size to 2,000 people will decrease the margin of error to plus or minus two percentage points, but it considerably increases the cost, in time and money, of conducting the poll. This is why most pollsters stick to samples of 1,000 people.
Survey says . . .
The bottom line is that it is not a matter of whether it is good for you to use samples. It is a matter of how good the sample you are using is. You need a sound sampling methodology and a sound mind that accepts the fact that no matter how big your sample is, its margin of error never disappears — despite what some big data surveys say.