Statistics in the era of big data and the data scientist

Depending on whether you are a half-full or a half-empty kind of person, the "big data" revolution is either a tremendous windfall for the career of a statistician, or the makings of a real existential crisis. As with most things, it’s probably a bit of both.

On the one hand, the Harvard Business Review calls data science the sexiest job of the 21st century. Since at least some statisticians would seem very qualified to fill in that role, AND those well-paying jobs, statisticians look ready to cash in on a very rewarding career indeed.

Yet on the other hand, statisticians are taught from the very beginning, FROM THE FIRST FIVE MINUTES OF STATS 101, about the values of a rigorous experimental design, of making sure there is a representative sample, and above all else, to never…ever jump to any premature conclusions. In the "big analytics" age of doing all the analyses on all of the data, statisticians often find this basic premise challenged.

However, as big data begins to permeate all aspects of life, I believe we need even more statistic thinking, and not just from the statisticians. Let me explain my reasoning.

The field of statistics began as an inferential scientific approach arising largely from the historical aspiration to generalize observed phenomena in a small observed group to some larger group. This basically means statistics arose out of the need to carefully draw conclusions from incomplete, but representative data.

Three hundred years ago, the beginnings of the statistical approach was created to estimate the population of London during the plague, without counting every person (as this was impossible, not to mention a health risk, with the available resources of the time).

The concept of a representative sample was used and with it a very rigorous statistical methodology. Statistics consisted (and still does) of evaluating what the data actual says (eg: counts, averages, correlations, trends, regressions, etc), but at least as important was the strict analysis of whether the sample was representative of the greater population.

At first glance, with the current capabilities to cost effectively store and explore ever growing amounts of data, the need to sample seems to have been buried for good. But does that mean that sampling and generalization considerations are passé in a big data world?

Let’s consider for a moment: Is any dataset really complete?

A tax authority can in theory complete analyses on all taxpayers, but:
- Do they have all relevant data on taxpayer assets, expenses, socio-economics, etc?
- Do they have duplicates?
- How about missing profiles?
- How about comparisons with other regions, countries, provinces?
- How about a complete set of fraud profiles
An organization can, in theory, do analysis on all customers, but:
- Is their customer base representative of their target base?
- Do they have more data on loyal customers?
- What about competitors?
- And again, do they have all relevant data on social media buzz, customer center feedback, customer networks, etc?
An oil company can, in theory, do analysis on all sensors in an oil rig, but:
- Is that rig representative of all rigs?
- Is sensor data enough?
- Do we need to enrich with weather data?
- As an oil reserve empties, does the past rig behavior constitute a good predictor of future rig behavior?
Even music services like Spotify can evaluate all listeners’ behavior, but:
- Can they really predict individual preferences based solely on co-occurences?
- Are ratings universally applicable and consistent?
- Can they convert their data into value for the customer, getting customers more of the music they like?

The only correct answer to any of these questions will always start with, "It depends."

And that’s exactly why statisticians and critically-minded analysts still have a crucial role in big data analysis and the innovative organization. In my opinion, statistics are a lot more than any rigid doctrine. Statistics, or being statistically inclined, means having a critical mindset about what role numbers can play in describing the real world.

Just because my data set is bigger than yours, it doesn’t mean I’ll get more information out of it: only the potential for learning is greater with bigger data and bigger analysis. It will still take analytically-minded thinkers to draw the appropriate conclusions.

So, hold on to your critical edge, fellow statisticians. The potential of big data is truly awesome, but it’s up to us to make it work. And if we get paid for it, well that’s just a nice side effect of a job well done.

3 Comments

Longhow lam on October 9, 2014 3:27 pm

Andrew,

Nice blog post, i have two related quotes:

Those who ignore statistics are condemned to reinvent it.
Statistics is the science of learning from experience. (2006) Bradley Efron

An old joke among statisticians, if you don't believe in random sampling then the next time you have a blood test, ask your doctor to take all your blood!
Randy on October 11, 2014 2:08 pm

This is a well written blog. The data set examples at the end illustrate the need for statistical inference in addition to mathematics.

We are discussing this on LinkedIn: About Data Analysis, as well.
Ron Kenett on October 14, 2014 3:52 pm

good points but, ironically, reflecting the weakness of statistical methodology.

"having a critical mindset about what role numbers can play in describing the real world." is simply not good enough. one needs a conceptual framework and methodology for achieving this.

to address this need, we have been working on the concept of information quality IInfoQ). For papers and presentations related to InfoQ see https://sites.google.com/site/datainfoq/papers

For a "life cycle" view of statistics see: http://ssrn.com/abstract=2315556

Blogs

Blogs

Statistics in the era of big data and the data scientist

About Author

3 Comments