Methodology
3 minBacktest Overfitting Checklist
A checklist for detecting backtest overfitting: multiple testing, look-ahead leakage, same-bar fills, ignored costs, regime dependence, and missing trial history.
This backtest overfitting checklist is a practical companion to Why Backtests Lie. It is written for researchers searching for how to detect backtest overfitting, quant strategy validation checklist, backtest leakage checklist, overfit trading strategy signs, and AI backtesting risk controls.
The checklist is intentionally conservative. A strategy does not need to fail every item to be rejected. One severe failure can invalidate the evidence.
1. Was every trial recorded?
Ask whether the reported strategy is the only strategy tested or the winner of a larger search. Count:
- parameter sweeps
- feature variants
- universe changes
- rebalance frequencies
- risk filters
- execution assumptions
- rejected agent suggestions
- reruns after viewing the equity curve
If the trial history is missing, the observed Sharpe ratio cannot be deflated honestly. The result may still be interesting as a lead, but it is not strong evidence.
2. Is the split time-aware?
Random K-fold is usually wrong for financial time series. It mixes future and past observations, and overlapping labels leak information across folds. Use chronological walk-forward validation or purged cross-validation with embargo.
Red flags:
- shuffled train/test split
- random K-fold on rolling features
- labels computed from forward windows without purging
- test windows selected after viewing results
- one final holdout repeatedly reused during research
For the mechanics, see Purged Cross-Validation and Embargo.
3. Could the strategy know the data at the time?
A point-in-time failure can make an impossible strategy look excellent. Check every field used by the signal:
- Was the value available before the trade decision?
- Was publication lag modeled?
- Were revised values replaced with as-of values?
- Was any missing value backward-filled?
- Was the universe defined using future survival?
- Were delisted assets included?
If the answer is unclear, the data is not ready for validation.
4. Are execution assumptions realistic?
Many attractive backtests disappear after costs. Review:
- same-bar fills versus next-bar fills
- exchange fees and funding costs
- spread and slippage
- turnover and capacity
- leverage and borrowing constraints
- order timing relative to signal availability
The phrase "Sharpe 2.0" is incomplete. The useful claim is "Sharpe 2.0 under these explicit cost and execution assumptions." For more detail, see Cost-Aware Backtesting.
5. Does performance survive outside the best regime?
A strategy can look robust while being carried by one market episode. Break results by:
- bull and bear markets
- volatility regimes
- liquidity regimes
- exchange or instrument
- pre-event and post-event windows
- training and walk-forward test segments
The question is not whether every slice is positive. The question is whether the entire result depends on one historical accident.
6. Was the metric selected after the fact?
Overfitting can happen at the reporting layer. A candidate may fail on Sharpe but pass on Sortino, fail on total return but pass on hit rate, or fail after costs but pass gross. If the chosen metric changed after seeing the results, that metric choice is another trial.
Predeclare the primary metric and use secondary metrics as diagnostics, not as replacements for the failed objective.
7. Did an AI agent tune the result after seeing it?
Agentic research introduces a subtle risk. An AI agent can iterate quickly through thresholds, filters, prompt variants, and explanatory narratives. That work is useful only if every iteration remains visible. If the agent hides the path and presents only the best survivor, the workflow has recreated the file-drawer problem at machine speed.
Corrai's answer is structural: the AI Research Feed records the path, Alpha Canvas runs the workflow, and the Judge reads the evidence before promotion.
Verdict
Passing this checklist does not prove future profitability. It means the backtest is no longer failing the obvious evidence tests. That is the starting line for serious alpha validation, not the finish line.