The Replication Crisis: Why Many Studies Don't Hold Up

Open Science Collaboration found 36% of psychology studies replicate; Begley & Ellis found 11% of preclinical cancer studies replicate. Causes: low power, analytical flexibility, incentive structures. Registered reports improve replicability.

Evaluate10 min read Editor's pick

Science's Credibility Problem

The Open Science Collaboration's 2015 project shook the scientific community. Researchers selected 100 psychology studies from top journals and attempted replication. Only 36% showed statistically significant effects matching original findings. Of those that replicated, effect sizes were smaller than original reports. A field thought to be rigorous discovered it was largely unreliable.

Cancer research exhibited even greater failures. Begley and Ellis attempted to replicate 53 preclinical (lab) cancer studies published in top journals. Only 6 (11%) produced results matching original findings. The remaining 89% failed replication despite conducting identical experiments. Some failed on first attempt; others required substantial debugging to partially replicate.

Why don't studies replicate? Multiple factors compound. (1) Low statistical power: small sample sizes mean high false-positive rates and inflated effect sizes (winner's curse). (2) Analytical flexibility: multiple analyses, optional stopping, selective reporting. (3) Publication bias: positive findings publish readily; failed replications languish unpublished. (4) Incentive structure: scientists gain career advancement through novel positive findings, not confirmations. (5) Transient effects: some phenomena are situational, not robust across contexts.

Microbiome research's replicability remains uncertain. Few microbiome findings have been formally replicated by independent groups. The field measures high-dimensional data (1,000s of microbial taxa) with high inter-individual variability. Small sample sizes are common (as described in Entry 66). Conditions are ripe for replication failure.

Consider a hypothetical microbiome study: researchers measure gut microbiota in 50 IBS patients before and after probiotic treatment, finding significant reduction in symptom severity and increased Faecalibacterium abundance (both p < 0.05). The effect sizes are large (Cohen's d = 0.9). Publication follows.

A replication attempt with 100 IBS patients (larger, more powerful) finds symptom improvement (Cohen's d = 0.3) but no significant Faecalibacterium change. The original study's effect size was inflated; the replication shows modest, perhaps clinically insignificant benefit. This pattern—original studies overestimating effects—appears consistently across replications.

Registered reports improve replication rates dramatically. Researchers submit manuscripts describing research questions, methods, and analysis plans before data collection. Peer review occurs pre-hoc (before data exists), not post-hoc. The journal commits to publishing results regardless of direction (positive or null), provided methods are sound. This removes incentive for p-hacking or selective reporting.

Studies in registered report format show higher null-finding rates than traditional manuscripts, consistent with realistic statistical power. Pre-registered studies replicate more reliably. The findings appear less dramatic but are genuine.

Replication initiatives have emerged. The Reproducibility Project (Open Science Collaboration) coordinates large-scale replications. The Many Labs project tests effects across dozens of laboratories simultaneously. These initiatives reveal which findings are robust versus idiosyncratic.

Microbiome community initiatives are lagging. Few microbiome studies are formally replicated. Consortia like the International Human Microbiome Consortium collect data across sites, partially addressing replication through large-scale pooling. But systematic replication of published findings remains rare.

Publish-or-perish culture perpetuates replication failure. Confirmation studies (replicating others' findings) are viewed as less prestigious than novel discoveries. Career advancement requires novel positive findings. This creates incentive structures favoring unreplicable novelty over robust science. Reforming incentives—valuing replication, publishing null findings, rewarding methodological rigor—is essential for improving scientific credibility.

When reading microbiome research, mental skepticism toward novel findings is warranted until independent replication appears. Effect sizes matter more than p-values. Registered reports carry more credibility than traditional manuscripts.

Sources & references

Science's Credibility Problem

Sources & references

Continue reading