Point-in-Time Data Lineage

Point-in-time data lineage for quant research: availability timestamps, publication lag, revisions, survivorship bias, schema fingerprints, and leakage prevention.

Point-in-time data lineage answers a simple question: what did the strategy know at the moment it made the decision? The topic appears in searches such as point-in-time data for backtesting, look-ahead bias prevention, quant data lineage, survivorship bias in backtesting, and reproducible market data pipeline.

The backtest result is only as trustworthy as the data timeline behind it.

Event time, source time, availability time

Many datasets have more than one timestamp:

Event time: the time the market event or measured period refers to.
Source time: the timestamp assigned by the provider or exchange.
Ingestion time: when the research system collected the record.
Availability time: the earliest time a live strategy could have used the value.

Backtests often misuse event time as availability time. That is a look-ahead error. A daily metric describing Monday may only be computed after Monday closes. A revised fundamentals field may describe a past quarter but become available weeks later. A corrected candle may not be the candle a live system saw at the time.

Publication lag

Publication lag is common in alternative data, on-chain analytics, macro data, fundamentals, and some vendor-normalized market data. If the backtest does not model the lag, it may trade before the data existed.

The fix is not to apply one universal lag to every field. The fix is to record availability semantics per dataset and, where necessary, per field. Some values are available immediately after a bar closes. Some arrive minutes later. Some arrive after batch processing. Some are revised.

Revisions and snapshots

Many datasets are not immutable. Providers correct errors, restate history, remap symbols, or change calculation methods. A strategy tested on the final revised history may appear to have known facts that were not known in the original record.

Point-in-time lineage requires either historical snapshots or explicit version fingerprints. At minimum, a run should record the exact dataset version it used so the result can be reproduced and challenged later.

Survivorship bias

Survivorship bias is a lineage problem. If the universe is defined using today's tradable assets and then applied backward through history, failed, delisted, merged, illiquid, or discontinued instruments disappear. The backtest no longer measures the strategy on the universe a researcher would have faced at the time.

For equity research, this means delisted securities matter. For crypto research, it means exchange listings, symbol migrations, dead tokens, contract changes, and liquidity disappearance matter. A strategy that only sees survivors is often a strategy that only sees winners.

Schema fingerprints

A reproducible backtest should know the shape of its data. Schema fingerprints help detect when a field was added, renamed, rescaled, or normalized differently. This matters because a feature pipeline may silently change behavior when the upstream schema changes.

Corrai's data engine treats source, schema, content fingerprint, and availability semantics as evidence artifacts. The Judge can then inspect whether a candidate was tested on data that matches the information timeline.

Common leakage examples

Watch for these failures:

backward-filling missing values
using final revised values on historical dates
trading on a metric before publication lag has elapsed
selecting a universe based on future survival
joining datasets on event date rather than availability date
using a bar's close to fill at the same close
treating vendor-normalized history as if it were the live feed

Point-in-time lineage does not make a strategy profitable. It removes one major source of false evidence.