“Correlation does not imply causation” is a saying commonly heard in science and statistics emphasizing that a correlation between two variables does not necessarily imply that one variable causes the other.
One example of this is the relationship between rain and umbrellas. People buy more umbrellas when it rains. This establishes a strong correlation between rainy days and umbrella sales. This does not imply, however, that buying an umbrella causes it to rain—obviously it does not. It also does not necessarily imply that umbrella sales are caused by rain. Yes, being caught unprepared by a rainstorm can cause you to buy an umbrella. But not only do preparedness-minded people buy umbrellas so that they are ready for a rainy day, people also buy umbrellas on, or in preparation for, sunny days to protect themselves from blinding and skin-burning sunlight.
The point is that correlations are easy to find and causes are difficult to prove. It is, therefore, a correlation more often than a cause that triggers us to take what we consider to be a data-driven action.
While it’s easy to criticize extrapolating weak correlations from small data sets, it’s also easy to believe that the correlations we will find in big data sets will be stronger, and therefore more reliably actionable. “With enough data, the numbers speak for themselves,” Chris Anderson famously argued. By contrast, Geoffrey Bowker argued that “raw data is both an oxymoron and a bad idea; to the contrary, data should be cooked with care.”
In their paper Critical Questions For Big Data, danah boyd and Kate Crawford explained that “big data is less about data that is big than it is about a capacity to search, aggregate, and cross-reference large data sets. Too often, big data enables the practice of apophenia: Seeing patterns where none actually exist, simply because enormous quantities of data can offer connections that radiate in all directions.”
An excellent example they provided was how, during the global financial crisis, a strong but spurious correlation was found between changes in the S&P 500 stock market index and butter production in Bangladesh. This serves as a creamy cautionary tale for big data analytics. Beware the trap of correlation. Otherwise you may believe that as the butter churns in Bangladesh so turns the financial fortunes of the world.