Methodology
7 minWhy Backtests Lie
Four mechanisms that make backtests systematically overstate performance: selection bias, look-ahead leakage, ignored costs, and regime dependence.
A backtest is not an experiment. It is a simulation conditioned on choices — which variants were tried, which data was used, how fills were modeled, which years happened to be in sample. Each of those choices introduces a bias, and every one of the biases points the same direction: upward. The Sharpe ratio a backtest reports is not an unbiased estimate of live performance. It is closer to an upper bound, and often not a tight one.
This page catalogs the four mechanisms responsible. None of them require dishonesty. They are the default behavior of the standard research workflow, which is why they have to be addressed structurally rather than by good intentions.
1. Selection under multiplicity
The most damaging mechanism is also the least visible: running many variants and reporting the best one.
Consider a deliberately sterile setup. One hundred strategies, each with a true Sharpe ratio of exactly zero — pure noise. Backtest each on one year of daily returns. For a strategy with zero-mean excess returns, the sampling error of the annualized Sharpe estimate over 252 daily observations is approximately 1.0 (Lo, 2002). The expected maximum of one hundred independent draws from a standard normal distribution is roughly 2.5. So the best of the hundred will show an in-sample annualized Sharpe of about 2.5 — a number most researchers would take seriously — while its expected true Sharpe remains, by construction, zero. Selecting on noise does not change the underlying process; it only changes which noise you are looking at. If the variants are correlated, the effective number of independent trials is smaller and the inflation milder, but the qualitative result stands.
Three details make this worse in practice:
- The multiplicity is the search, not the report. Every hyperparameter combination, every abandoned signal, every "quick check" counts toward N, whether or not it appears in the final notebook. Unreported trials are the quantitative analogue of the file-drawer problem.
- The researcher is part of the search. Iterating on a strategy after seeing its backtest — adjusting a threshold, swapping a lookback — is sequential selection. The trial count grows with every glance at the equity curve.
- N is rarely known after the fact. Once the search history is gone, no correction can be computed honestly. This is the argument for registering trials before results exist, not reconstructing them afterward.
The standard corrections are the Deflated Sharpe Ratio, which discounts an observed Sharpe by the number and correlation of the trials behind it (with further corrections for skewed, fat-tailed returns), and the Probability of Backtest Overfitting, which estimates how often the in-sample winner ranks below the median of its peers out of sample. Both require an honest accounting of the search — which is why it must be recorded, not recalled.
2. Information from the future
Look-ahead leakage means the simulation, at time t, uses information that did not exist at time t. Unlike selection bias, leakage can manufacture performance that looks not merely good but suspiciously stable. Three concrete forms recur:
- Backward fill. Filling a missing value at time t with the next observed value (
bfill) imports the future into the past by construction. Forward fill carries a known value forward; backward fill answers a question with data that had not yet been printed. It is a one-character difference in most data libraries and a structural difference in what the backtest measures. - The unclosed bar. A signal computed from a bar's close, high, or low is only knowable once the bar has closed. A backtest that evaluates the signal on bar t and fills within bar t assumes the researcher acted on a number before it existed. Live systems see a provisional bar; backtests see the final one.
- Missing publication lag. Aggregated data — on-chain daily metrics are a clean example — is typically stamped with the period it describes, not the time it became available. A metric describing March 1 is computable only after March 1 ends, and often arrives later, after indexing and processing; some providers additionally revise history. A backtest that trades the metric on its stamp date is trading hours to days ahead of any system that could exist.
Two broader relatives belong in the same category: survivorship bias (defining today's universe and applying it to history, which silently removes everything that failed) and data revisions (testing against a corrected series that live trading never saw). All of these are properties of the dataset, not the strategy code — which is why data must carry version fingerprints and declared availability semantics, and why a clean-looking strategy on a dirty dataset proves nothing.
3. The frictionless fill
Most backtesting frameworks default to a world without friction: no fees, no slippage, and fills at whatever price the signal was computed from. Each omission inflates results; together they can account for the entirety of an apparent edge.
- The same-bar fill illusion. If a signal uses bar t's close, a fill at bar t's close assumes execution at the instant the information came into existence. The earliest honest fill is the next bar — which is why next-bar execution is a floor, not a conservative flourish. Any backtest filling on the signal bar is measuring a strategy no one can run.
- Fees and slippage scale with turnover. As an order-of-magnitude illustration: a strategy that turns over its full book daily at 10 basis points of round-trip cost surrenders roughly 25 percentage points of annual return to friction. High-turnover strategies are precisely the ones whose gross backtests look best and whose net results degrade most.
- Costs are a declaration, not a footnote. A result is only interpretable if the fee, slippage, and fill assumptions are stated alongside it. "Sharpe 1.8" is not a finding; "Sharpe 1.8 at zero cost, same-bar fills" is a different claim from "Sharpe 0.9 at declared costs, next-bar fills."
4. One regime, one observation
Regime dependence is not, by itself, an upward bias — it is an extrapolation failure. It becomes one when combined with selection: of all the variants tried, the ones that survive are those best fitted to the regimes that happened to be in sample. A backtest over a single market regime is, in the statistical sense that matters, close to a single observation. A long-only strategy fitted to a sustained bull market measures exposure to that rally, not skill; the relevant question — does the edge survive when the regime changes — was never asked of the data.
The deeper issue is that regimes are few. Ten years of daily bars is thousands of rows but perhaps three or four distinct macro environments, and the strategy's true sample size for the question "does this generalize" is the latter number. Walk-forward evaluation and regime-split analysis do not solve this — nothing does — but they at least surface whether performance is concentrated in one slice of history, which a single full-sample Sharpe actively conceals.
The structural answer
These mechanisms compound. A best-of-N strategy, on leaky data, with frictionless fills, fitted to one regime, can report a Sharpe several multiples of anything achievable live — and each contributing bias is invisible in the final number. Inspection of the result cannot recover what the process discarded.
The conclusion this site is built around is that the corrections must be enforced by the workflow, not remembered by the researcher. Corrai is built to enforce them structurally: every completed run lands in an append-only search ledger and is taxed accordingly, validation respects time (purged splits with embargo), costs and fill rules are declared on every run, and data carries its lineage and availability semantics. In Corrai's framing, agents and researchers propose candidates; the Judge applies these gates and issues the verdict, with reasons attached. For what that looks like applied to real candidates — including the unglamorous outcome where nothing survives — see the autopsy Twelve Hypotheses, Zero Survivors, or start with Getting Started.