A group of over 800 scientists have signed their names to an article published in Nature, explaining why statistical significance shouldn’t be relied on so heavily as a measure of the success of an experiment. We asked statistics buff Andrew Steele to explain.
The standard in scientific papers is to test any given result and find out if its p-value is less than 0.05, which makes the result what’s known as ‘statistically significant’. What this actually means is that there’s a less than 5% chance of getting results this extreme if the thing being tested actually had no effect.
For example, say you’re testing a new drug against a placebo, and most of the people taking it get better, you crunch the stats, and – say the p-value is 0.01 – there was a 1% chance of getting that result if the drug didn’t work. This means p is less than the required 0.05 (5%), and boom! Write it up for a big-name journal, plaudits all round. Conversely, if fewer people got better, and p = 0.06, then this falls foul of the ‘significance’ threshold, and maybe your new drugs don’t work.
The first problem is that there are all kinds of biases caused by looking at things this way. If p = 0.06, for example, maybe you can come up with some excuse to exclude a few of the patients who didn’t get better because they were ‘outliers’ for some reason; or perhaps you can try looking at some other test result, like whether their blood pressure got lower even if it didn’t help them lose weight; and, eventually, by ‘fishing’, find a p < 0.05 and get that coveted publication.
The other problem is that statistical significance is very different to real-world significance. If you have a large enough number of patients, many tests will come out with p < 0.05 even though the difference between the group on the drug and the placebo is really really tiny. What p < 0.05 means in this case is that there almost certainly is a difference, but perhaps it’s so small that no doctors or patients would actually care.
So for those problems amongst others, lots of scientists are calling for a better test for ‘significance’ than simply whether p < 0.05. It can create weird and misleading outcomes which distort science.
The answer is probably something called Bayesian statistics—understanding that reality isn’t just a series of ‘significant’ and ‘non-significant’ claims, but exists on a continuum of probabilities between true and false. But that’s a whole separate blog post!
For a fun introduction to Bayesian statistics, watch Matt Parker & Hannah Fry’s latest YouTube video.