Magnitude Over Statistical Significance
A probiotic trial enrolls 500 IBS patients. Probiotic group shows bloating reduction of 0.2 points on a 10-point severity scale; placebo shows 0.1-point reduction. Difference = 0.1 points (p = 0.04, statistically significant). Researchers publish: "Probiotic significantly reduces bloating."
Readers celebrate. But wait: is a 0.1-point difference perceptible to patients? Some patients perceive 1-point differences; others ignore 2-point changes. Minimum clinically important difference for IBS bloating might be 1-2 points. A 0.1-point difference, though statistically significant in a 500-person trial, is clinically meaningless.
Effect size quantifies magnitude. For continuous outcomes, Cohen's d measures difference in standard deviations. Formula: d = (treatment mean - control mean) / pooled standard deviation. Benchmarks: d = 0.2 (small), 0.5 (medium), 0.8 (large). A d = 0.01 is trivial; d = 0.8 is substantial.
In the probiotic example, if standard deviation = 2 points, then d = 0.1/2 = 0.05 (tiny). Large sample sizes (500 subjects) detect tiny effects as statistically significant. Effect size reveals the true magnitude: trivial.
For binary outcomes (success/failure), odds ratio (OR) and relative risk (RR) serve as effect sizes. OR = 2 suggests 2-fold odds increase; OR = 1.05 suggests trivial change. For categorical data, Cramér's V provides effect size (0 = no association, 1 = perfect association).
Critically, p-values don't convey magnitude. P = 0.001 (very small) could accompany tiny effect (d = 0.1) in large samples or large effect (d = 0.8) in small samples. Statistical significance ≠ practical significance.
American Psychological Association (APA) and CONSORT statement (for randomized trials) mandate effect size reporting. Despite mandates, many journals still publish effect sizes without emphasis. Authors reporting p-values without effect sizes provide incomplete information. Readers cannot assess practical significance without magnitude information.
Confidence intervals incorporate effect sizes. A 95% CI of (0.1 to 0.3) for Cohen's d indicates the plausible range of true effects. Wide intervals (0.1 to 2.0) suggest uncertainty; narrow intervals (0.7 to 0.9) suggest precision. CIs convey both direction and magnitude.
Microbiome research rarely prioritizes effect sizes. Studies report that a probiotic changes Firmicutes-to-Bacteroidetes ratio from 3.2 to 2.8 (p < 0.05) without reporting Cohen's d or raw effect magnitude. Readers struggle to assess clinical meaningfulness. Better practice: report baseline means ± SD for both groups, allowing readers to calculate effect sizes independently.
Effect size interpretation varies by context. d = 0.3 might represent clinically meaningful improvement in depression (symptom reduction) but trivial benefit in cancer survival (percentage improvement). Domain expertise dictates interpretation.
Interpretive caution: small effect sizes can be important when effects accumulate (a 1% annual benefit compounds over decades) or when targeting populations with limited options. Conversely, large effect sizes mean little if based on underpowered small studies (inflated by winner's curse).
Why does effect size matter more than p-value? P-values answer: "Is there an effect?" (yes/no). Effect sizes answer: "How big is the effect?" The latter is clinically actionable. Interventions with small effect sizes might not justify costs, harms, or burden. Those with large effects usually do.
Better interpretation framework: (1) Check effect size first. (2) If effect is trivial (d < 0.2), stop there regardless of p-value. (3) If effect is meaningful, examine p-value and confidence interval. (4) Consider clinical context and patient preferences. This inverts conventional p-value-first thinking but yields more sensible conclusions.