In his recent Financial Times article, Tim Harford explained the big data that interests many companies is what we might call found data – the digital exhaust from our web searches, our status updates on social networks, our credit card purchases and our mobile devices pinging the nearest cellular or WiFi network.
Not only are these datasets big, Harford explained, “but just as noteworthy is the fact that they are cheap to collect relative to their size, they are a messy collage of datapoints collected for disparate purposes and they can be updated in real time. As our communication, leisure and commerce have moved to the internet and the internet has moved into our phones, our cars and even our glasses, life can be recorded and quantified in a way that would have been hard to imagine just a decade ago.”
Tripping on the trap of correlation
“Found data underpin the new internet economy,” Harford explained, “as companies such as Google, Facebook and Amazon seek new ways to understand our lives through our data exhaust.” Harford shared his concerns, however, about putting too much faith in what we find in found data, including some of the points I covered in previous posts about big data hubris and the accuracy of data storytelling.
Harford also cautioned about theory-free, data-rich models that merely try to find statistical patterns in data, caring about correlation rather than causation, are inevitably fragile. “If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down. Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster and cheaper these days – but we must not pretend that the traps have all been made safe. They have not.”
Being biased about sampling bias
A great example of a trap that many believe big data has made safe is sampling. Many big data advocates argue that sampling error, also known as margin of error, which is when a randomly chosen sample doesn’t reflect the underlying population, can be overcome by using a larger random sample to make the margin of error smaller. However, as Harford explained, “sampling error has a far more dangerous friend: sampling bias. Sampling bias is when the sample isn’t randomly chosen at all.” One classic example Harford cited was the large (2.4 million) seemingly random sample of voters that failed to predict the winner of the 1936 U.S. presidential election. The sample wasn’t truly random because it was compiled from automobile registrations and telephone directories that, during the Great Depression, represented a disproportionately prosperous subset of voters.
Big data advocates often counter by saying they overcome sampling error and sampling bias by not sampling at all. Instead they claim to use all of the data, not just a sample. While that sounds good, even if we assume for the moment that getting all of the data is possible, such an approach can still suffer from a systematic bias.
Sentiment analysis, for example, could record and analyze every message on Twitter to draw conclusions about the public mood. However, Twitter users are not representative of the population as a whole. In fact, research has shown that Twitter users in the U.S. are disproportionately young and urban. As Harford explained, it is a seductive illusion to believe big data sets are comprehensive.
Lack of bias trumps lots of data, but just as we can never truly get all of the data, we can also never truly eliminate all of the biases we bring to what we find in found data.