You are on page 1of 3

EVOLUTION Cooperation

COMMENT HISTORY To fight denial, CHEMISTRY Three more unsung PUBLISHING As well as ORCID
and conflict from ants study Galileo and women — of astatine ID and English, list authors
and chimps to us p.308 Arendt p.309 discovery p.311 in their own script p.311
ILLUSTRATION BY DAVID PARKINS

Retire statistical significance


Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories
call for an end to hyped claims and the dismissal of possibly crucial effects.

W
hen was the last time you heard How do statistics so often lead scientists to literature with overstated claims and, less
a seminar speaker claim there deny differences that those not educated in famously, led to claims of conflicts between
was ‘no difference’ between statistics can plainly see? For several genera- studies where none exists.
two groups because the difference was tions, researchers have been warned that a We have some proposals to keep scientists
‘statistically non-significant’? statistically non-significant result does not from falling prey to these misconceptions.
If your experience matches ours, there’s ‘prove’ the null hypothesis (the hypothesis
a good chance that this happened at the that there is no difference between groups or PERVASIVE PROBLEM
last talk you attended. We hope that at least no effect of a treatment on some measured Let’s be clear about what must stop: we
someone in the audience was perplexed if, as outcome)1. Nor do statistically significant should never conclude there is ‘no differ-
frequently happens, a plot or table showed results ‘prove’ some other hypothesis. Such ence’ or ‘no association’ just because a P value
that there actually was a difference. misconceptions have famously warped the is larger than a threshold such as 0.05

2 1 M A RC H 2 0 1 9 | VO L 5 6 7 | NAT U R E | 3 0 5
©
2
0
1
9
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
COMMENT

or, equivalently, because a confidence Association released a statement in The some quality-control standard). And we
interval includes zero. Neither should we American Statistician warning against the are also not advocating for an anything-
conclude that two studies conflict because misuse of statistical significance and P val- goes situation, in which weak evidence
one had a statistically significant result and ues. The issue also included many commen- suddenly becomes credible. Rather, and in
the other did not. These errors waste research taries on the subject. This month, a special line with many others over the decades, we
efforts and misinform policy decisions. issue in the same journal attempts to push are calling for a stop to the use of P values
For example, consider a series of analyses these reforms further. It presents more than in the conventional, dichotomous way — to
of unintended effects of anti-inflammatory 40 papers on ‘Statistical inference in the 21st decide whether a result refutes or supports a
drugs2. Because their results were statistically century: a world beyond P < 0.05’. The edi- scientific hypothesis5.
non-significant, one set of researchers con- tors introduce the collection with the cau-
cluded that exposure to the drugs was “not tion “don’t say ‘statistically significant’”3. QUIT CATEGORIZING
associated” with new-onset atrial fibrillation Another article4 with dozens of signatories The trouble is human and cognitive more
(the most common disturbance to heart also calls on authors and journal editors to than it is statistical: bucketing results into
rhythm) and that the results stood in con- disavow those terms. ‘statistically significant’ and ‘statistically
trast to those from an earlier study with a We agree, and call for the entire concept non-significant’ makes people think that the
statistically significant outcome. of statistical significance to be abandoned. items assigned in that way are categorically
Now, let’s look at the actual data. The We are far from different6–8. The same problems are likely to
researchers describing their statistically “Eradicating alone. When we arise under any proposed statistical alterna-
non-significant results found a risk ratio categorization invited others to tive that involves dichotomization, whether
of 1.2 (that is, a 20% greater risk in exposed will help to halt read a draft of this frequentist, Bayesian or otherwise.
patients relative to unexposed ones). They overconfident comment and sign Unfortunately, the false belief that
also found a 95% confidence interval their names if they crossing the threshold of statistical sig-
claims,
that spanned everything from a trifling concurred with our nificance is enough to show that a result is
risk decrease of 3% to a considerable risk
unwarranted message, 250 did so ‘real’ has led scientists and journal editors to
increase of 48% (P = 0.091; our calcula- declarations of within the first 24 privilege such results, thereby distorting the
tion). The researchers from the earlier, sta- ‘no difference’ hours. A week later, literature. Statistically significant estimates
tistically significant, study found the exact and absurd we had more than are biased upwards in magnitude and poten-
same risk ratio of 1.2. That study was sim- statements 800 signatories — all tially to a large degree, whereas statistically
ply more precise, with an interval spanning about checked for an aca- non-significant estimates are biased down-
from 9% to 33% greater risk (P = 0.0003; our ‘replication demic affiliation or wards in magnitude. Consequently, any dis-
calculation). failure’.” other indication of cussion that focuses on estimates chosen for
It is ludicrous to conclude that the present or past work their significance will be biased. On top of
statistically non-significant results showed in a field that depends on statistical model- this, the rigid focus on statistical significance
“no association”, when the interval estimate ling (see the list and final count of signatories encourages researchers to choose data and
included serious risk increases; it is equally in the Supplementary Information). These methods that yield statistical significance for
absurd to claim these results were in contrast include statisticians, clinical and medical some desired (or simply publishable) result,
with the earlier results showing an identical researchers, biologists and psychologists or that yield statistical non-significance for
observed effect. Yet these common practices from more than 50 countries and across all an undesired result, such as potential side
show how reliance on thresholds of statisti- continents except Antarctica. One advocate effects of drugs — thereby invalidating
cal significance can mislead us (see ‘Beware called it a “surgical strike against thought- conclusions.
false conclusions’). less testing of statistical significance” and “an The pre-registration of studies and a
These and similar errors are widespread. opportunity to register your voice in favour commitment to publish all results of all
Surveys of hundreds of articles have found of better scientific practices”. analyses can do much to mitigate these
that statistically non-significant results are We are not calling for a ban on P values. issues. However, even results from pre-reg-
interpreted as indicating ‘no difference’ or Nor are we saying they cannot be used as istered studies can be biased by decisions
‘no effect’ in around half (see ‘Wrong inter- a decision criterion in certain special- invariably left open in the analysis plan9.
pretations’ and Supplementary Information). ized applications (such as determining This occurs even with the best of intentions.
In 2016, the American Statistical whether a manufacturing process meets Again, we are not advocating a ban on
P values, confidence intervals or other sta-
tistical measures — only that we should
SOURCE: V. AMRHEIN ET AL.

BEWARE FALSE CONCLUSIONS not treat them categorically. This includes


Studies currently dubbed ‘statistically significant’ and ‘statistically non-significant’ need not be dichotomization as statistically significant or
contradictory, and such designations might cause genuine effects to be dismissed. not, as well as categorization based on other
statistical measures such as Bayes factors.
One reason to avoid such ‘dichotomania’
‘Significant’ study
is that all statistics, including P values and
(low P value) confidence intervals, naturally vary from
study to study, and often do so to a sur-
‘Non-significant’ study prising degree. In fact, random variation
(high P value) alone can easily lead to large disparities in
The observed effect (or
point estimate) is the P values, far beyond falling just to either side
same in both studies, so of the 0.05 threshold. For example, even if
they are not in conflict,
even if one is ‘significant’
researchers could conduct two perfect
and the other is not. replication studies of some genuine effect,
each with 80% power (chance) of achieving
Decreased effect No effect Increased effect
P < 0.05, it would not be very surprising for
one to obtain P < 0.01 and the other P > 0.30.

3 0 6 | NAT U R E | VO L 5 6 7 | 2 1 M A RC H 2 0 1 9
©
2
0
1
9
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
COMMENT

Whether a P value is small or large, caution and data tabulation will be more detailed
SOURCE: V. AMRHEIN ET AL.

is warranted. WRONG INTERPRETATIONS and nuanced. Authors will emphasize their


We must learn to embrace uncertainty. An analysis of 791 articles across 5 journals* estimates and the uncertainty in them — for
found that around half mistakenly assume
One practical way to do so is to rename con- non-significance means no effect. example, by explicitly discussing the lower
fidence intervals as ‘compatibility intervals’ and upper limits of their intervals. They will
Appropriately Wrongly
and interpret them in a way that avoids over- interpreted interpreted
not rely on significance tests. When P values
confidence. Specifically, we recommend that 49% 51% are reported, they will be given with sensible
authors describe the practical implications precision (for example, P = 0.021 or P = 0.13)
of all values inside the interval, especially — without adornments such as stars or let-
the observed effect (or point estimate) and ters to denote statistical significance and not
the limits. In doing so, they should remem- as binary inequalities (P < 0.05 or P > 0.05).
ber that all the values between the interval’s Decisions to interpret or to publish results
limits are reasonably compatible with the will not be based on statistical thresholds.
ARTICLES
data, given the statistical assumptions used People will spend less time with statistical
to compute the interval7,10. Therefore, sin-
791 software, and more time thinking.
gling out one particular value (such as the Our call to retire statistical significance
null value) in the interval as ‘shown’ makes and to use confidence intervals as compat-
no sense. ibility intervals is not a panacea. Although it
We’re frankly sick of seeing such non- will eliminate many bad practices, it could
sensical ‘proofs of the null’ and claims of well introduce new ones. Thus, monitoring
non-association in presentations, research *Data taken from: P. Schatz et al. Arch. Clin. Neuropsychol. 20,
1053–1059 (2005); F. Fidler et al. Conserv. Biol. 20, 1539–1544
the literature for statistical abuses should be
articles, reviews and instructional materials. (2006); R. Hoekstra et al. Psychon. Bull. Rev. 13, 1033–1037 (2006);
F. Bernardi et al. Eur. Sociol. Rev. 33, 1–15 (2017).
an ongoing priority for the scientific com-
An interval that contains the null value will munity. But eradicating categorization will
often also contain non-null values of high help to halt overconfident claims, unwar-
practical importance. That said, if you deem feeling that this is a basis for a confident ranted declarations of ‘no difference’ and
all of the values inside the interval to be prac- decision. A different level can be justified, absurd statements about ‘replication failure’
tically unimportant, you might then be able depending on the application. And, as in the when the results from the original and rep-
to say something like ‘our results are most anti-inflammatory-drugs example, interval lication studies are highly compatible. The
compatible with no important effect’. estimates can perpetuate the problems of misuse of statistical significance has done
When talking about compatibility inter- statistical significance when the dichotomi- much harm to the scientific community and
vals, bear in mind four things. First, just zation they impose is treated as a scientific those who rely on scientific advice. P values,
because the interval gives the values most standard. intervals and other statistical measures all
compatible with the data, given the assump- Last, and most important of all, be have their place, but it’s time for statistical
tions, it doesn’t mean values outside it are humble: compatibility assessments hinge significance to go. ■
incompatible; they are just less compatible. on the correctness of the statistical assump-
In fact, values just outside the interval do not tions used to compute the interval. In prac- Valentin Amrhein is a professor of zoology
differ substantively from those just inside tice, these assumptions are at best subject to at the University of Basel, Switzerland.
the interval. It is thus wrong to claim that an considerable uncertainty7,8,10. Make these Sander Greenland is a professor of
interval shows all possible values. assumptions as clear as possible and test the epidemiology and statistics at the
Second, not all values inside are equally ones you can, for example by plotting your University of California, Los Angeles. Blake
compatible with the data, given the assump- data and by fitting alternative models, and McShane is a statistical methodologist and
tions. The point estimate is the most compat- then reporting all results. professor of marketing at Northwestern
ible, and values near it are more compatible Whatever the statistics show, it is fine to University in Evanston, Illinois. For a full
than those near the limits. This is why we suggest reasons for your results, but discuss list of co-signatories, see Supplementary
urge authors to discuss the point estimate, a range of potential explanations, not just Information.
even when they have a large P value or a wide favoured ones. Inferences should be scien- e-mail: v.amrhein@unibas.ch
interval, as well as discussing the limits of tific, and that goes far beyond the merely
that interval. For example, the authors above statistical. Factors such as background 1. Fisher, R. A. Nature 136, 474 (1935).
could have written: ‘Like a previous study, evidence, study design, data quality and 2. Schmidt, M. & Rothman, K. J. Int. J. Cardiol. 177,
1089–1090 (2014).
our results suggest a 20% increase in risk understanding of underlying mechanisms 3. Wasserstein, R. L., Schirm, A. & Lazar, N. A.
of new-onset atrial fibrillation in patients are often more important than statistical Am. Stat. https://doi.org/10.1080/00031305.20
given the anti-inflammatory drugs. None- measures such as P values or intervals. 19.1583913 (2019).
4. Hurlbert, S. H., Levine, R. A. & Utts, J. Am. Stat.
theless, a risk difference ranging from a 3% The objection we hear most against https://doi.org/10.1080/00031305.2018.1543
decrease, a small negative association, to a retiring statistical significance is that it is 616 (2019).
48% increase, a substantial positive associa- needed to make yes-or-no decisions. But 5. Lehmann, E. L. Testing Statistical Hypotheses 2nd
edn 70–71 (Springer, 1986).
tion, is also reasonably compatible with our for the choices often required in regula- 6. Gigerenzer, G. Adv. Meth. Pract. Psychol. Sci. 1,
data, given our assumptions.’ Interpreting tory, policy and business environments, 198–218 (2018).
the point estimate, while acknowledging decisions based on the costs, benefits and 7. Greenland, S. Am. J. Epidemiol. 186, 639–645
(2017).
its uncertainty, will keep you from making likelihoods of all potential consequences 8. McShane, B. B., Gal, D., Gelman, A., Robert, C. &
false declarations of ‘no difference’, and from always beat those made based solely on Tackett, J. L. Am. Stat. https://doi.org/10.1080/0
making overconfident claims. statistical significance. Moreover, for deci- 0031305.2018.1527253 (2019).
9. Gelman, A. & Loken, E. Am. Sci. 102, 460–465
Third, like the 0.05 threshold from which sions about whether to pursue a research (2014).
it came, the default 95% used to compute idea further, there is no simple connection 10. Amrhein, V., Trafimow, D. & Greenland, S. Am.
intervals is itself an arbitrary convention. It between a P value and the probable results Stat. https://doi.org/10.1080/00031305.2018.1
is based on the false idea that there is a 95% of subsequent studies. 543137 (2019).
chance that the computed interval itself con- What will retiring statistical significance Supplementary information accompanies this
tains the true value, coupled with the vague look like? We hope that methods sections article; see go.nature.com/2tc5nkm

2 1 M A RC H 2 0 1 9 | VO L 5 6 7 | NAT U R E | 3 0 7
©
2
0
1
9
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.

You might also like