Cherry-Picking Data: How Studies Can Prove Anything

When Researchers Find What They're Looking For

A microbiome intervention trial measures 50 different outcomes: bacterial taxa abundance, metabolites, inflammatory markers, symptom scales, and secondary quality-of-life measures. Only 8 outcomes show statistically significant improvement (p < 0.05). Researchers write up these 8 victories in the abstract, burying the 42 null results in supplementary tables. Readers conclude the intervention works; most never see that 84% of outcomes failed.

This selective outcome reporting is rampant. Studies often measure far more outcomes than pre-registered as primary. Secondary outcomes expand analysis flexibility. With enough flexibility, researchers can usually find p < 0.05 somewhere—John Ioannidis calls this the "garden of forking paths." Each analytical choice (which confounders to adjust, outliers to exclude, subgroups to examine) creates a fork. With enough forks, some yield p < 0.05 by chance.

Subgroup analyses exemplify this problem. A trial of a cholesterol drug tests primary hypothesis: overall cholesterol reduction. The result is negative (p = 0.08). But researchers examine 20 subgroups: men vs women, age groups, baseline cholesterol levels, smoking status, etc. One subgroup (men age 45-55 with baseline cholesterol 200-250) shows p = 0.02. The abstract now claims benefit in this subgroup. This is the Texas sharpshooter fallacy: draw a target, then shoot, then draw a circle around bullet holes. With 20 subgroups, one will likely show p < 0.05 by chance (p = 0.05 × 20 = 100% chance of a false positive).

Data dredging refers to the broad practice of analyzing data repeatedly until p < 0.05 emerges. Modern computing enables this easily: test thousands of microbiome taxa as predictors of disease, and spurious associations will appear. Simmons, Losin, and Monsieur showed that with just 20 measured variables and flexible analysis, researchers can achieve p < 0.05 even for a completely fictional hypothesis (gender predicting horoscope sign in a forced-choice task). Expanding to 50 variables makes false positives nearly inevitable.

Post-hoc rationalization disguises this practice. A study finds unexpected associations and retroactively rewrites hypotheses to match findings, claiming they were hypothesized originally. Kerr termed this HARKing (Hypothesizing After Results are Known). Readers cannot distinguish true predictions from post-hoc story-telling without seeing the pre-registration.

The multiple comparisons problem compounds these issues. When testing 100 hypotheses at p = 0.05 threshold, 5 false positives emerge by chance. Bonferroni correction controls this by adjusting the p-threshold downward (divide alpha by number of comparisons). If testing 100 hypotheses, p < 0.0005 becomes the significance bar. This reduces false positives but increases false negatives (underpowered tests).

Pre-registration is the solution. Before data collection, researchers file detailed analysis plans on platforms like Open Science Framework or clinicaltrials.gov. They specify primary outcomes, secondary outcomes, planned statistical adjustments, and subgroup analyses. Deviations post-hoc are flagged as exploratory. This distinction lets readers distinguish hypothesis-confirming findings from hypothesis-generating ones.

Microbiome research is beginning to adopt pre-registration. Studies examining how fecal microbiota transplant affects Crohn's disease outcomes increasingly pre-register on clinicaltrials.gov and open-science registries. This transparency reduces flexibility and increases credibility.

When reading research, search for pre-registration. If absent, scrutinize how many outcomes were measured versus reported. Large discrepancies raise red flags. Ask whether the primary outcome was pre-specified and whether analyses were planned before data examination.

Sources & references

When Researchers Find What They're Looking For

Sources & references

Continue reading