Pondering statistical significance in a world without .05

The latest edition of The American Statistician is dedicated to proposals for living in a post p<.05 world, a world in which a statistical test with a result of “p<.05” is no longer given the importance it has previously been afforded.

Decent, honorable researchers once discussed such things only in whispers. Now, a crowd of serious scientists is celebrating the near-death of p<.05 like the end of a long terrible war.

What happened to p<.05? The American Statistician presents many viewpoints on why we need to move to a post p<.05 world, but there are some reasons that are not fully explored.

A circle map of .05 (Just barely above significance?)

P<.05 has earned a bad reputation

Even people who have never worked with data or calculated any statistical test know the phrase “p<.05” and they understand it signifies “statistical significance,” which measures the probability that an observed difference or association is due to chance.

The non-statistician has not become familiar with p<.05 through great strides in the teaching of statistics or research methodology. Rather, they know it because it has come to be associated with the endless junk science stories that are popularized by the press.

Small studies with big conclusions. Studies that cannot be replicated. Studies that were given a press release before they were given a peer review. Studies whose dramatic p<.05 findings seem to hinge on the use of a set of particular covariates with just this particular group of subjects under just these particular conditions.

The general public reads the dramatic implications printed in newspapers, published on the web, or featured in the lifestyle montage on the TV news. The public hears the lofty statements about how this one study changes the meaning of human life or predicts the fate of the universe and concludes, “That just doesn’t make sense that you get there from here.” They may not understand probability theory, but they understand bunkum. Consequently, over time, p<.05 has taken on a different meaning for them.

Why does the popular press seem to publish without hesitation these kinds of pop-science stories? Print publishers that were once staffed with an army of trained journalists now operate with a skeleton crew, having lost most of the advertising revenue to online outlets. Too often, those publishers will now gladly reprint a press release from a reputable university as if it were news. Online news aggregators are hungry for any content, but need a headline that gets clicks and, if the study is not interesting enough by itself, will write a more provocative title. And TV news just follows the clickstream of trending links.

Why do people read these stories? Because they are interesting. People like to read interesting stories, whether or not they are true. Personally speaking, I miss Bigfoot. Was Bigfoot the work of a huckster with a set of modified snow shoes and a Nikon camera with Vaseline on the lens? Maybe. But Bigfoot never gave anyone a false diagnosis of cancer or misguided Greece from a tough economy to an economic disaster. Bigfoot would never try to make a 100-year-forward prediction based on only 100 years of data. Bigfoot would never do such things.

Of course, in practice, very few people were relying on these pop-science studies to guide important decision-making. But the weight of all of that research has taken a toll on the average person’s understanding of science and the meaning of p<.05.

P<.05 is not a statement about the reliability of a model effect; it is an insult, a catchphrase for science trolls. Given that Probability Theory doesn’t have a very good publicist, the reputation of p<.05 may never recover.

Changes in the way we do science

The lead editorial in American Statistician emphasizes the over-dependence on p<.05 and cautions researchers on that over-reliance, but it doesn’t spend much time discussing how so many researchers became so dependent on it.

How did these kinds of studies ever get published in academic journals? Because publish-or-perish tenure track research jobs require publishing anything that can be published? If you have to run the study with 11 slightly different variations or build the statistical model using something akin to a Monte Carlo assortment of covariates to get to p<.05, that’s what you have to do?

I tend to disagree that p-hacking or tenure-tracking are the main causes of the kinds of research that have given a bad name to p<.05. In the physical sciences, there is straight-up fraud and, in the social sciences, jiggle-the-handle discoveries that will never be replicated by anyone anywhere. Open data (raw data with all code transformations) and open methods (all of your statistical modeling including anything you tested) could fix those problems immediately. Fraud and honest mistakes could be zeroed out quickly.

I am reminded of when I was a regular reviewer for an academic journal twenty years ago. I received paper copies of papers for review. There was a method to manually reviewing statistical tables to verify degrees of freedom were calculated correctly, critical values precise, and all of that. In essence, you were reverse engineering the statistical results starting with only the experimental design parameters and the final output. I cannot imagine doing that now. A complete end-to-end review of all code and statistical modeling starting with the raw data is standard operating procedure in the private sector world of analytics. This should be standard practice for all academic peer reviews…if you can find the people to do it.

Another factor could be that the tenure system, itself, is collapsing. Universities are replacing tenure track positions with temporary teaching or adjunct faculty positions, a phenomenon called the “Adjunct Crisis.” Thus, the pool of people eager to pursue research or serve in the kinds of roles peer-reviewed science requires is drying up. It is possible that the number and quality of reviewers to support the peer review process has declined and consequently more papers of questionable merit can slip through the process. And that doesn’t even begin to account for the so-called “open” internet journals that have a back-end, pay-to-publish business model.

Another reason could be the shift away from pure empiricism to theory-making and social impact statements. This has surely driven much of the problem in the social sciences that are now in the midst of what is called the "Replication Crisis." Well-regarded academic journals that once would only publish studies in which the authors themselves had replicated findings multiple times with different samples now seem more interested in the Discussion than the Results. To meet that demand, researchers now develop broad theories around small scale sets of facts and then expound upon those theories with real-world implications and predictions of social impact.

That trend, itself, may be due to the fact that universities have pushed researchers to identify ways to commercialize their research, encouraging entrepreneurial opportunities to monetize basic science. For studies without commercial possibilities, researchers are encouraged to emphasize the social impact and ways in which each analysis will make the world a better place. That is certainly a part of the problem.

Over the past few decades, researchers have tried to improve our understanding of p<.05 with demands for effect size calculations that reveal the weakness in a statistically significant finding and meta-analytic reviews that attempt to summarize multiple statistically-significant (and non-significant) findings, forcing a close look at the generalizability of individual studies. The new push for replication studies has done the same in a more direct manner. But none of those efforts will alleviate the broader systematic challenges facing the support system for basic science research.

Do the results mean anything different if you report them in a beautiful graph?

The availability of big data, open data

Yet another factor affecting how we regard p<.05 is the availability of big, real-world open data sets. My undergraduate and graduate school research was mostly confined to carefully designed experimental studies and relatively small correlational studies. An n of 30 human beings was standard in experiments. A thousand subjects in a longitudinal study were together “big” data. In that world, p<.05 had a well-defined meaning.

When I first had access to mainframe data with 30,000 respondents for a survey, I could not help but start with a correlational analysis of all of the variables. But I was confused by the output. I saw cell after cell of .01 and .01 and could not figure out where the correlation was in the output. I eventually realized, to my surprise, that the correlation was .01 and the p value was .01 and my understanding of statistics was never the same.

One of the functions of probability theory is to estimate the degree to which a sample is representative of a population. What is the likelihood that this sample could have been obtained from this population? These days, in so many domains, we have ready access to exceptionally large real-world data sets. We know the “sample” could have come from that larger “population” because we extracted the sample ourselves. There is less need to rely on p<.05 to support a generalization because we have the “population” estimates. Of course, big data are not all of the data and even exceptionally large data sets can be biased, but the question as to whether the sample is like a “population” is easier to answer when you have the “population” to test it.

These problems are compounded by high-speed automated analytics where tens of thousands of regression models are calculated to re-price every sku in a department store chain overnight or machine learning models that generate countless hidden nodes in a neural network model. In the content of production analytics, p<.05 has virtually no meaning and there is no simple Bonferroni adjustment for that.

A post p<.05 world

The American Statistician articles on a post p<.05 world should be required reading for anyone working with statistics. Although I can see many reasons why the current dependence on p<.05 has hurt science, both the actual work of real scientists and in the eyes of a skeptical public, I am not ready to abandon it. Used wisely, it will always be useful.

The American Statistician proposes many changes in reporting practices that will help immediately. Despite the enormous challenges presented here and summarized in The American Statistician articles, I am optimistic that basic science research will still be able to make good use of probability theory. P-values will absolutely continue to be an important metric for separating signal from noise. But researchers and data scientists will need to think carefully about a framework for research that emphasizes reliability and validity without that simple notation.

Before we venture into this brave new world, we should probably take a moment to pack a few peanut-butter-and-jelly sandwiches, find a decent Spotify playlist, and charge up the car battery. There’s no telling how long it will take us to get there.

2 Comments

Steve Sconfienza on June 27, 2019 10:06 am

A couple of thoughts:
1. Big data are very often not a sample but a population. There is no test of statistical significance for populations: it is what it is. What is needed in that case is a test of importance, which may be a qualitative description or any number of table-based tests of importance (without the erroneous chi-square significance test).
2. P>.05 means that one in 20 samples produces an error. SAS makes it very simple to test this by repeated samples from a random number distribution. How about an air traffic system where one-in-twenty successful arrivals is the accepted standard (STOP that snickering about a certain Boeing product!).

So, tests for significance are often incorrectly used where what is required is a test of importance; P>.05 should never have been the standard to begin with.
- Michael Gilvary on August 16, 2019 12:43 pm
  
  1. Given the "guaranty" of finding something(!) significant (p<.05) when exploring a large enough dataset, one step should be to save a "test" dataset to check some of the "significant findings" of the shotgun exploration of the rest of the data. Then, you are ready to consider the reasonableness and the Materiality of the "statistically significant" findings.
  2. P=.05 relates to one of 20 flights UN-successful flights, not 19 failures. Let us pray for a flight system that is even better than p=.05!

Blogs