When legendary travelling folk singer-songwriter Woody Guthrie summarized his approach to organizing workers, he said, “Take it easy, but take it.” Wise words to ponder in any case, but certainly whenever we put big data on the back burner to talk statistics instead. In the context of Big Data, I would say, take it easy by not using a lot of heavy statistical jargon when proposing your big data solutions - but do take statistics, do vigilantly educate others in the value they bring, and do continue to bring sound statistical perspective to the whole big data hype.
This is no small task. Big data is trending, now and most likely for the foreseeable future. It is the it thing right now! Evidence can be found in coverage of the IT industry:
- The European Commission is investing €2,5 billion in IT, whatever "IT" is.
- IT is definitely going to bring at least 100.000 new jobs to Europe by 2020.
- IT is difficult to value, so it is impossible to insure.
- IT can predict who you are, based on your postcode.
- And maybe IT can help stop the spread of Ebola.
It's important to realize that IT is challenging and requires patience, humility, but also determination. And, big data requires statistics.
Without statistics, big data is never going to reach its full potential. And perhaps even more importantly, without informed consumers, big data can be used for misleading and downright dangerous conclusions.
Unfortunately, many people are privately dealing with a statistical post-traumatic stress syndrome. Stats courses are all-too-often viewed as dry and abstract "trials of life" to be subjected to, rather than the formative introduction to a core and vibrant skill set vital to each and every functioning citizen.
I quote, H.G. Wells from his book Mankind in the Making (1903), about the growing importance of statistical thinking in society:
The great body of physical science, a great deal of the essential fact of financial science, and endless social and political problems are only accessible and only thinkable to those who have had a sound training in mathematical analysis, and the time may not be very remote when it will be understood that for complete initiation as an efficient citizen of one of the new great complex worldwide States that are now developing, it is as necessary to be able to compute, to think in averages and maxima and minima, as it is now to be able to read and write.
Wells wrote that more than a century ago. In the meantime, the ways in which the peoples of the world interact has not gotten any less complex. And the amount of data available to analyze those interactions is exploding.
No question, big data is out there. However, BIG is not at all just about data, but also the scope of the proposed analysis. This can be a scary proposition for many. It is therefore up to us in the analytics community to sensitize the world to sound statistical thinking. This does not just include gradually upping the statistical ante in terms of the analysis, but more in vigilantly making those analyses transparent and digestible. In essence, to encourage real, and often critical, analytic thinking throughout the analysis and presentation of results.
When answering big business questions, the statistically-minded, yet communicative data scientist is first needed to translate the business problem into the required analytic approach, both in terms of the available data and the algorithms required.
- What data is available to answer the question?
- Do we need all of the data to answer the question?
- Does it require a complex algorithm with lots of variables?
- How often and how quickly does the analysis need updates?
- How ‘up-to-date’ do the inputs need to be?
- How complete are the inputs in predicting a particular event?
- How much variance remains unaccounted for?
Ockham’s Razor needs to be adequately applied to ensure that the simplest possible, yet still relevant, approach is applied. Subsequently, this needs to be explained to, and monitored by, all invested parties.
And that’s where the statistician needs to take it easy:
- Start with the results, so the audience has a clear view on the outcome
- Proceed to explain the analysis simply and with a minimum of statistical jargon
- Describe what an algorithm does, not the specifics of your killer algo
- Visualize the inputs (eg: a correlation matrix showing an ‘influence heat map’)
- Visualize the process (eg: a regression line on a chief predictor variable)
- Visualize the results (eg: a lift chart to show how much the analysis is improving results)
- Always, always tie each step back to the business challenge
- Always be open to questions and feedback
In my next blog post, I’ll look at some real-life business case examples where statistics can help tame those big data dragons and give some further insights about visualization and statistical sensitization.
I’ll close this one with another Woodyism…“Any fool can make something complicated. It takes a genius to make it simple.”
Dare to be genius!
Read a comprehensive research report on how big companies are using big data.