Methodology

7 min

The Deflated Sharpe Ratio

How the Deflated Sharpe Ratio corrects observed Sharpe for selection bias, non-normal returns, and the number of trials — and why N must be honest.

A backtested Sharpe ratio is not a fact about a strategy. It is a statistic computed from a finite sample — and, almost always, the best such statistic selected from a search. Both properties inflate it. The Deflated Sharpe Ratio (Bailey & López de Prado, 2014) is the standard correction: it restates an observed Sharpe as the probability that the true Sharpe exceeds zero, after accounting for sample length, non-normal returns, and the number of trials it took to find it.

This page covers the three pieces in order: the sampling distribution of the Sharpe estimator, the behavior of the maximum under multiple testing, and the DSR itself — followed by the operational weakness every DSR implementation inherits, and how Corrai addresses it.

The Sharpe ratio is an estimate

The estimator SR_hat = mean(r) / std(r) has a sampling distribution like any other statistic. Under IID normal returns, Lo (2002) derives its standard error as approximately sqrt((1 + SR²/2) / n). For daily data and realistic Sharpe levels, the SR²/2 term is small, and a useful first-order rule emerges: the standard error of an annualized Sharpe is roughly 1 / sqrt(years of data).

The consequence is uncomfortable. With two years of daily data, the standard error is about 0.7. An observed annualized Sharpe of 1.0 then carries a 95% interval spanning roughly −0.4 to +2.4 — an estimate consistent with both a genuinely good strategy and a money-losing one.

Non-normality widens this further. Mertens (2002) extends the standard error to third and fourth moments:

SE^(SR^)    1    γ3SR^  +  γ414SR^2n1\widehat{SE}(\widehat{SR}) \;\approx\; \sqrt{\frac{1 \;-\; \gamma_3\,\widehat{SR} \;+\; \frac{\gamma_4 - 1}{4}\,\widehat{SR}^2}{n - 1}}

where γ3\gamma_3 is skewness and γ4\gamma_4 is kurtosis (γ4=3\gamma_4 = 3 for a normal distribution). For a positive observed Sharpe, negative skew increases the variance of the estimator (the −γ3·SR_hat term turns positive), and fat tails increase it through the kurtosis term. Strategy returns — particularly short-volatility and carry-like profiles — routinely exhibit both. The strategies whose return distributions look worst are precisely the ones whose Sharpe estimates deserve the least trust.

What selection does to the maximum

The second inflation is structural. If a researcher runs N independent backtests on configurations with no true skill, each null Sharpe estimate is approximately Gaussian around zero — but the researcher does not report a random draw. They report the maximum.

The expected maximum of N independent standard normal variables grows on the order of sqrt(2 ln N). In standard-error units:

| Trials (N) | Expected maximum (approx.) | |---|---| | 10 | ~1.5 SE | | 100 | ~2.5 SE | | 500 | ~3.0 SE | | 10,000 | ~3.9 SE |

Two properties of this growth matter in practice. It is slow — going from 500 to 10,000 trials moves the bar by less than one standard error — which tempts researchers to dismiss it. And it never stops. Any search process that explores parameters, features, universes, or rebalancing rules accumulates trials continuously, and the expected best-of-search Sharpe rises with it, with no skill required anywhere. The selected backtest is a biased estimator of out-of-sample performance by construction; Why Backtests Lie treats this failure mode alongside the others.

One refinement: trials are rarely independent. Fifty variants of the same momentum signal are closer to one trial than fifty. Bailey & López de Prado capture this through the cross-trial variance of the Sharpe estimates — clustered, correlated trials produce similar Sharpes, which lowers that variance and, with it, the deflation benchmark.

The correction: PSR, then DSR

The Probabilistic Sharpe Ratio (Bailey & López de Prado, 2012) converts an observed Sharpe into the probability that the true Sharpe exceeds a benchmark SRSR^{*}, using the non-normal standard error above:

PSR(SR)  =  Φ ⁣((SR^SR)n11γ3SR^+γ414SR^2)PSR(SR^{*}) \;=\; \Phi\!\left( \frac{\big(\widehat{SR} - SR^{*}\big)\,\sqrt{n - 1}} {\sqrt{1 - \gamma_3\,\widehat{SR} + \frac{\gamma_4 - 1}{4}\,\widehat{SR}^2}} \right)

PSR against SR=0SR^{*} = 0 answers "is this Sharpe significantly positive?" — but ignores selection. The Deflated Sharpe Ratio closes that gap by setting the benchmark to the expected maximum Sharpe under the null across the NN trials actually performed:

SR0  =  V[SR^k]((1γ)Φ1 ⁣(11N)  +  γΦ1 ⁣(11Ne)),DSR  =  PSR(SR0)SR_0 \;=\; \sqrt{V\big[\widehat{SR}_k\big]}\,\left( (1 - \gamma)\,\Phi^{-1}\!\Big(1 - \tfrac{1}{N}\Big) \;+\; \gamma\,\Phi^{-1}\!\Big(1 - \tfrac{1}{N e}\Big) \right), \qquad DSR \;=\; PSR(SR_0)

where V[SR^k]V[\widehat{SR}_k] is the variance of Sharpe estimates across trials and γ0.5772\gamma \approx 0.5772 is the Euler–Mascheroni constant. The interpretation: DSR is the probability that the true Sharpe exceeds zero, once we account for the fact that the observed one is the survivor of a search of size N, estimated from n non-normal observations. A conventional acceptance threshold is DSR ≥ 0.95.

A practical note: the formulas operate at the native frequency of the returns. Deflate first, in per-period units; annualize afterward. Mixing annualized Sharpes into the PSR statistic is a common implementation error.

A conservative illustration

Take two years of daily data (n = 730, on a 365-day crypto calendar) and a search of N = 500 registered trials — modest numbers for an automated research process.

Under the null, the standard error of a single daily Sharpe estimate is about 1/sqrt(730). The expected best of 500 such trials sits roughly three standard errors above zero, which annualizes to an observed Sharpe in the neighborhood of 2 — produced by pure noise, under idealized IID-normal assumptions. That is the expected maximum; for the DSR to declare significance at conventional confidence, the observed Sharpe must clear this bar with room to spare, not merely touch it. The required value is substantially higher than what would impress anyone reading a single backtest, and negative skew or excess kurtosis pushes it higher still.

This is not a defect of the DSR. It is the honest arithmetic of short samples and wide searches, and it explains why well-run validation kills most candidates — see Twelve Hypotheses, Zero Survivors for a worked instance.

The denominator must be honest

The DSR has one operational weakness, and it is fatal if ignored: every term is observable except N, which is self-reported.

Each parameter tweak, each feature variant, each re-run after inspecting results is a trial. If the search performed 500 trials but the ledger records 12, the DSR is computed against the wrong null and silently inverts its purpose — it now lends statistical authority to an overfit result. This is the multiple-testing analogue of the file-drawer problem, and no formula can recover trials that were never recorded.

This is why Corrai treats trial accounting as structural rather than voluntary. Every run that completes is recorded in an append-only search ledger at the engine's single choke point — the ledger only grows; runs can be added but never removed — and the deflation is computed from the recorded search scope, not from whatever subset the researcher chooses to present. Agents propose, the Judge decides — and the Judge counts. PBO attacks selection bias from a complementary angle, using the within-run trial matrix rather than the cross-run count.

Where the DSR sits in the pipeline

The DSR corrects for selection and non-normality. It does not detect lookahead leakage, mis-modeled costs, or regime fragility — those require purged cross-validation with embargo, explicit cost declaration, and walk-forward evaluation respectively. In Corrai's Judge it is one gate among several; a candidate that survives the DSR has answered one question, not all of them. For the full gate sequence, see Getting Started.

A high backtested Sharpe is a claim. The Deflated Sharpe Ratio is one of the cross-examinations it must survive.

References

  • Bailey, D. H., & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality." Journal of Portfolio Management, 40(5).
  • Bailey, D. H., & López de Prado, M. (2012). "The Sharpe Ratio Efficient Frontier." Journal of Risk, 15(2).
  • Lo, A. W. (2002). "The Statistics of Sharpe Ratios." Financial Analysts Journal, 58(4).
  • Mertens, E. (2002). "Comments on Variance of the IID Estimator in Lo (2002)." Working paper, University of Basel.