Probability of Backtest Overfitting (PBO)

Probability of Backtest Overfitting (PBO): the probability that an in-sample winner falls below the out-of-sample median, estimated via CSCV.

The Probability of Backtest Overfitting (PBO) is the probability that the configuration selected as best in-sample (IS) performs below the median of all tested configurations out-of-sample (OOS). It is estimated with the combinatorially symmetric cross-validation (CSCV) framework of Bailey, Borwein, López de Prado, and Zhu.

The object being measured deserves emphasis: PBO is not a property of a single strategy. It is a property of a selection process — the research loop that searched a space of configurations and picked a winner.

How CSCV estimates it

Arrange the performance series of all N tested configurations into a T × N matrix — one column per configuration, one row per time observation.
Partition the T rows into S blocks of equal size (S must be even, since the next step pairs half against half), preserving temporal order within each block.
For each of the C(S, S/2) ways to choose half the blocks, treat the chosen half as IS and the complement as OOS.
In each split, identify the configuration with the best IS performance, then record its relative rank $\omega$ in the OOS performance distribution of all N configurations.
PBO is the fraction of splits in which that rank falls strictly below the OOS median — equivalently, the probability mass of the logit $\lambda = \ln\!\big(\omega / (1 - \omega)\big)$ below zero.

Because every observation serves in IS and OOS equally often across the combinations, the estimate is symmetric: it does not hinge on one arbitrary holdout split, and it reuses the backtests you already ran rather than demanding fresh data.

What a high PBO means

The readings split three ways. PBO well below 0.5: IS ranking carries genuine information about OOS ranking. PBO near 0.5: the IS ranking carries no information at all — the selection process has been rewarding noise, not signal. PBO above 0.5 is stronger evidence still: the IS winner systematically underperforms its peers out of sample, the signature of a search wide enough, or a candidate set correlated enough, for the maximum to be a pure sampling artifact. The mechanics of how this happens are covered in Why Backtests Lie.

A low PBO does not certify the winner; it only says the selection process preserved rank information from IS to OOS. The winner itself still has to survive scrutiny of its own performance estimate.

On thresholds

The literature does not fix a universal cutoff, and any single number applied across asset classes, sample lengths, and candidate-set sizes would be arbitrary. The defensible reading is graded: PBO near 0.5 indicates a selection process indistinguishable from noise, and elevated PBO should trigger heavier penalties on the entire candidate family — not a one-line pass/fail on the surviving strategy.

PBO and DSR are complements

| | Deflated Sharpe Ratio | PBO | | --- | --- | --- | | Object examined | the single best candidate | the selection process as a whole | | Question asked | is this Sharpe explainable by chance, given the number of trials? | does IS ranking predict OOS ranking at all? | | Failure mode caught | a lucky maximum | a noise-driven search loop |

A candidate can clear DSR while the process that produced it shows high PBO — evidence that the search itself is unreliable even if this particular maximum looks defensible. Both quantities require an honest accounting of every trial run, which is why Corrai records every completed run in an append-only search ledger and feeds the recorded search history — not self-reported counts — into the Judge's verdict.