Purged Cross-Validation and Embargo

Why random K-fold leaks in financial time series, and how purging and embargo remove label overlap and serial correlation from cross-validation.

Cross-validation estimates out-of-sample performance by holding data out. The estimate is only as trustworthy as the independence between the held-out set and the training set. In financial time series that independence fails in a specific, mechanical way: labels are constructed from forward-looking windows, so observations overlap in time, and a random K-fold split routes test-period information into the training set. Purging and embargo — formalized by Marcos López de Prado in Advances in Financial Machine Learning (2018) — are the standard repair. This page explains what breaks and what the repair does.

Why random K-fold fails on financial labels

Shuffled K-fold assigns observations to folds at random and treats each as an independent draw. (Contiguous, unshuffled K-fold fares little better: every middle fold trains on data from its own future, and the fold boundaries leak in both directions.) Two properties of financial data violate the i.i.d. premise.

Labels span intervals, not points. A supervised observation in financial ML is typically a pair: features $x_t$ observed at time $t$ , and a label $y_t$ computed from prices over a forward window $(t,\, t+h]$ — a fixed-horizon forward return, or the first barrier touched in a triple-barrier scheme. The observation is indexed at $t$ , but its label lives on the interval $(t,\, t+h]$ .

Nearby observations share label content. With daily bars and a 10-day label horizon, the observations at $t$ and $t+1$ have label windows that share nine of ten days. Their labels are largely the same outcome, sampled twice.

Random assignment ignores both facts. When $t$ lands in the training fold and $t+1$ lands in the test fold, the model is trained on a label that is, in substance, the test label. The cross-validated score then measures partly memorization of the test outcome, not generalization. Serial correlation in the features compounds the problem: rolling volatilities, moving averages, and similar trailing-window constructions are strongly correlated across adjacent timestamps, so train and test rows near a fold boundary are near-duplicates on both sides of the supervised pair.

The direction of the bias matters. This leakage inflates the apparent skill of every model, and it inflates it most for the models best at memorizing local noise — which means random K-fold does not merely overstate performance, it preferentially selects the most overfit candidate. That selection pressure is one of the failure modes catalogued in Why Backtests Lie.

Purging

Purging removes the label-overlap channel directly.

For each test fold, drop from the training set every observation whose label window intersects the label window of any test observation. Formally: let training observation $i$ have label window $[t_{i,0},\, t_{i,1}]$ and test observation $j$ have label window $[t_{j,0},\, t_{j,1}]$ . Observation $i$ is purged if the intervals overlap — whether $i$ 's window ends inside $j$ 's, starts inside $j$ 's, or envelops it entirely.

In practice, with contiguous test folds, purging removes a band of training observations adjacent to each test boundary, of width comparable to the label horizon $h$ . After purging, no training label is computed from price data that also enters a test label. The information that remains in the training set is information the model could legitimately have had.

Purging is a statement about labels. It says nothing about features — that gap is what embargo addresses.

Embargo

Purging closes the label channel on both sides of the test interval — including the band just after it, where training label windows still overlap the windows of late test observations. But consider a training observation just beyond that post-test purged band. Its label window no longer overlaps any test label, so purging leaves it in place. Its features are another matter: a 60-day rolling volatility computed shortly after the test interval is still built largely from test-period prices. Serial dependence in returns and in feature constructions outlives the label horizon, so observations shortly after the test set still carry test-period information into training.

Embargo handles this residual channel: beyond the post-test purged band, drop an additional band of training observations before training resumes. López de Prado suggests an embargo on the order of 1% of total observations as a starting point; the appropriate width ultimately depends on how persistent your features are — long trailing windows and slow-moving signals warrant wider embargoes.

Note the asymmetry. The embargo applies only after the test interval. Leakage from before the test set travels through forward-looking label windows, and purging has already removed it. Leakage after the test set travels through backward-looking features, which purging cannot see.

A timeline view

time ─────────────────────────────────────────────────────────────▶
 
[ train ........ | purged | test ........ | purged | embargo | train ..... ]
 
purged  : training observations whose label windows overlap test label
          windows — a band of width ≈ label horizon h on each side
embargo : a further band beyond the post-test purged zone, dropped to
          absorb serial correlation in features that purging cannot see

Reading left to right: training data is used up to the point where its label windows would reach into the test set; that band is purged. The test interval is evaluated. The band just after the test interval is purged for the same reason — its label windows overlap those of late test observations — and a further band is embargoed because its features are computed on test-period data. Only then does training data resume.

Relation to walk-forward and CPCV

Walk-forward is the conservative limiting case: train strictly before test, advance chronologically, never look back across the boundary. It is honest about direction but produces a single backtest path and uses data inefficiently, so the resulting performance estimate has high variance.

Combinatorial purged cross-validation (CPCV) goes the other way: partition the sample into groups, form every combination of groups as a test set, and apply purging and embargo at each boundary. The result is many backtest paths rather than one — a distribution of out-of-sample performance instead of a point estimate. Running all trials through combinatorial splits also yields the kind of IS/OOS performance matrix that PBO estimation (via the closely related CSCV procedure) consumes.

What purging and embargo do not fix

It is worth being precise about scope. Purged cross-validation with embargo removes mechanical leakage within a single train/test evaluation. It does not correct for selection across evaluations. A researcher who runs a thousand variants through perfectly purged CV and reports the best one has a clean per-trial estimate and a heavily biased selected estimate. Multiplicity is a separate disease with separate instruments — the deflated Sharpe ratio and explicit trial accounting. The case study in Twelve Hypotheses, Zero Survivors shows these corrections in practice.

How Corrai applies this

Corrai does not offer random K-fold for time-series labels — the split scheme is chronological by construction. Candidate strategies are evaluated on a purged chronological train/test split with embargo support, and multi-window walk-forward evaluation is the primary promotion gate on top of it. The split scheme, purge horizon, embargo width, and label horizon are recorded with each run, together with the data lineage, so a run can be audited later. Agents and researchers propose candidates; the Judge decides on this evidence. For the full set of gates a candidate faces, see Getting Started.