Understanding P-Values: Beyond the Threshold
When researchers announce findings with "p < 0.05," most people hear "definitely true." But that's fundamentally wrong, and this misunderstanding shapes how we interpret health research.
Ronald Fisher introduced p-values in the 1920s as a practical tool for researchers. He defined a p-value as the probability of observing data as extreme (or more extreme) than what you actually found, assuming the null hypothesis is true. Notice what it doesn't say: it's not the probability that your hypothesis is correct.
Think of it this way. Imagine you flip a coin 20 times and get 17 heads. The p-value would answer: "If this coin is actually fair, how likely would I be to see 17 or more heads in 20 flips?" A small p-value (maybe 0.002) means this result would be rare under a fair-coin assumption. But it doesn't prove the coin is biased—it just says the data would be surprising if it weren't.
The 0.05 threshold originated as an arbitrary convenience in 1920s statistics, not a magical truth boundary. Fisher even warned against rigid adherence to it. Yet this "significance threshold" became dogma. A study showing p = 0.049 gets published and celebrated; p = 0.051 often languishes unpublished, despite being nearly identical findings.
Common misinterpretations abound. Researchers often say "there's a 5% chance the null hypothesis is true" when they report p < 0.05. This is backwards. You can't calculate the probability of the hypothesis from a p-value alone—you'd need Bayesian methods for that. Another mistake: assuming p = 0.001 means the effect is 200 times "more true" than p = 0.05. Both just indicate rarity under the null; both require large sample sizes to achieve, especially with small effects.
The American Statistical Association released a 2016 statement cautioning that p-values don't measure effect size, practical significance, or model validity. They emphasized that a p-value alone cannot select a model or hypothesis. This wasn't radical—it was restating Fisher's original intent—but the statistical field had drifted far.
Better practice combines p-values with effect sizes and confidence intervals. A 95% confidence interval tells you the plausible range of effects; effect size (Cohen's d or similar) shows magnitude. Together, these paint a complete picture.