Strategy Autopsies

6 min

Twelve Hypotheses, Zero Survivors

A post-mortem of twelve BTC on-chain hypothesis families run through the Judge gates on real daily data. Zero survived OOS and DSR, by design.

In June 2026 we ran twelve popular on-chain hypothesis families through the Judge's statistical gates on real Bitcoin daily data. Twelve entered. Zero were promoted. This document is the autopsy.

It is worth stating the conclusion before the evidence: a zero-survivor outcome is not a malfunction. The Judge exists to refuse promotion when the statistics do not support it. On this dataset, at this frequency, they did not.

Case file

  • Subject. Twelve hypothesis families drawn from widely circulated on-chain narratives, spanning valuation ratios, holder dormancy and coin-age, exchange flows, network activity, miner behavior, accumulation behavior, cross-asset macro overlays, and composite blends of the above. We name the categories and nothing more; the reason is given below.
  • Data. Real BTC on-chain daily metrics plus daily price history, ingested through the data engine with recorded source lineage and aligned to the common window where both price and metrics exist. No synthetic series.
  • Procedure. Each family was composed as an Alpha Canvas workflow — data, factor construction, signal compilation, validation — and run end to end at 1, 3, 5, and 10-day forecast horizons. Every variant evaluated was registered by design — with the accounting caveat noted below.
  • Gates. Out-of-sample evaluation on purged splits (embargo is supported by the engine but was not engaged in this run); Deflated Sharpe Ratio against the registered trial count, with false-discovery-rate control across signals; a single-holdout walk-forward report; a three-state verdict (promote, research lead, reject). The wider gate set described in Getting Started — PBO, capacity, independent review — was not part of this run.

The search was not small. Registered breadth ran into the thousands of trials per family. That number matters because it is the denominator of honesty: every gate that follows is conditioned on how hard we looked.

What the gates recorded

In sample, the narratives looked alive. Several families — holder-behavior measures in particular: dormancy and coin-age, accumulation, miner activity — produced in-sample information coefficients that, reported in isolation, would read as findings. This is the part of the report that would normally get published.

Out of sample, none survived. Across all twelve families and all horizons, out-of-sample predictive performance was uniformly negative. Not marginally insignificant — negative. The families with the strongest in-sample fit showed the widest in-sample-to-out-of-sample gap, which is the textbook signature of selection on noise rather than of a weak-but-real effect.

After the trial tax, nothing was left to argue about. DSR deflates each family's best result by the breadth of the search that produced it. With thousands of registered trials per family, the deflated bar sits several standard errors above zero. No candidate cleared it. Survivors after deflation: zero.

Walk-forward split the field into rejections and leads. Roughly half the families were rejected outright; the rest were marked research lead — a weak flag recording possible structure that might merit a differently designed study. A research lead does not promote, and it does not trade.

A note on instrument calibration

A subsequent code audit of the trial accounting found cases where the registered count could understate the true search breadth. That error biases the DSR gate lenient — toward letting candidates through — not strict. The verdicts therefore hold a fortiori: even with an undercounted tax, zero of twelve passed. The defect is tracked as an engineering issue; it is noted here because the direction of a measurement bias is itself evidence. It could not have rescued any of the twelve.

What we are not publishing

This report omits the specific factor constructions and the exact out-of-sample statistics, deliberately.

Publishing the constructions would invite readers to re-derive and "improve" them against the same public history — a search this report would then have sponsored. Publishing the exact numbers would let the families be ranked and the least-dead ones mined further. Both amount to distributing overfitting as content. Selective reporting of backtests is one of the failure modes catalogued in Why Backtests Lie; an autopsy should not reproduce the disease it documents.

What we will state is the qualitative pattern: positive in sample, uniformly negative out of sample, zero after deflation — across every family.

Why zero is an unremarkable count

The result is less surprising than it reads. Consider the arithmetic of the setting: one asset, daily bars, a usable aligned history of a few thousand observations.

  • For small true Sharpe values, the per-period standard error of an estimated Sharpe ratio is approximately 1/T1/\sqrt{T}. As an illustration, an annualized Sharpe of 0.7 estimated over roughly 3,000 daily observations (365-day crypto calendar) sits about two standard errors from zero — before any correction for how many things were tried.
  • The expected maximum of N independent null trials is bounded by 2lnN\sqrt{2 \ln N}, an asymptotic ceiling of roughly 3.7 standard errors at a thousand trials and about 4.3 at ten thousand. Correlated trials reduce the effective N — accounting for that reduction is part of applying DSR honestly — but the bar remains far above what a modest true effect can demonstrate in this sample.

In other words: at daily frequency, on a single asset, with this much history, a true and economically plausible edge would struggle to prove itself, and a false one is nearly guaranteed to be manufactured by the search. A validation system that promoted several of these twelve would have been the alarming outcome.

This is also why the framing matters. Agents and canvases exist to propose candidates cheaply; the Judge exists to decide. A pipeline that returned twelve polished strategies from this exercise would have returned twelve liabilities.

Disposition

These dead ends did not stay in a notebook. Corrai's research memory — built in the weeks after this study, and seeded with verdicts like these — records each conclusion with its reasons and registered search history, so future research sessions, human or agent, inherit what was already tried: the same family, at the same frequency, on the same asset, is flagged before it gets re-explored from scratch and re-pitched as new. Negative results are an asset precisely because they are expensive to produce honestly and free to lose.

Scope and limits

The finding is narrow and should be read narrowly.

  • It applies to this asset (BTC), daily frequency, 1–10 day horizons, and this sample window.
  • It does not assert that on-chain data contains no information anywhere. Cross-sectional designs, other frequencies, and pooled multi-asset panels are different studies with different statistical power.
  • It does not assert that these narratives are false as descriptions of market structure. It asserts that they did not support a tradeable forecast at standards we are willing to sign.

If you want to see the gates these hypotheses faced, start with Getting Started. For the reasoning behind why the gates are this strict, read Why Backtests Lie and the Deflated Sharpe Ratio explainer.