The Trap of Comparing Walk-Forward Results

⚠️ Personal research and trading journal — not investment advice. The author does not provide licensed advisory services.

When I first started comparing different versions of my trading systems, I made a mistake that took months to catch. I would run two backtests, look at the summary statistics, and conclude that one version was better than the other. The numbers supported it. The analysis looked clean.

The problem: the two backtests were covering different time periods.

The Setup That Seems Obvious Until You're In It

Say you test System A on 2018-2023 and get a Sharpe of 0.85. You then run System B on 2020-2025 and get a Sharpe of 0.72. You conclude System A is better.

But you're not comparing the same thing. System A's period includes 2018-2019, two years of mostly trending markets. System B's period includes 2022-2023, which included a significant bear market in growth stocks. The market regimes are different. A fair comparison would require both systems to run on the same dates.

This seems obvious when stated directly. In practice, it's easy to miss.

The reason: walk-forward backtests are complex. They have start dates, end dates, lookback periods, training windows, out-of-sample windows. When you're iterating on a system — tweaking a parameter here, adding a filter there — you often just re-run the test and look at the new Sharpe. If the underlying date range shifted because you changed a data source, updated a price file, or adjusted the minimum lookback, you're now comparing different animals.

What I Found When I Checked

I ran a 6-window comparison of two system variants (2020-2025, 6 separate out-of-sample windows). Summary stats showed version 2 outperforming version 1 by 0.12 Sharpe. I was about to adopt version 2.

Then I verified the metadata on each backtest: start date, end date, number of completed windows, total trades per window.

The backtests had different start dates because one variant needed an extra 50 bars of warmup for a new indicator. The first window in version 2 started 2.5 months later than in version 1. That 2.5-month difference happened to exclude a losing period in late 2020 that version 1 captured. The comparison was flawed.

When I aligned both tests to the same start and end date (trimming the warmup period differently), version 2's advantage disappeared. The Sharpe difference went from 0.12 to 0.02 — statistically indistinguishable.

Why Walk-Forward Comparisons Are Especially Fragile

A single-period backtest is easy to align: same start, same end. Done.

Walk-forward backtests are harder because they have multiple sources of date slippage:

Training window length: if you change how many bars you use to calibrate the system before trading each window, the out-of-sample periods shift.

Lookback indicators: if version 2 adds a 200-day moving average that requires 200 bars to compute, the first valid trade date is ~200 days later than version 1.

Data availability: if version 2 uses an additional data source (say, RS ratings) that only exists from 2010 onward, your comparison starts later even if both versions theoretically run from 2005.

Market gaps: Thai markets have data gaps around holidays and circuit-breaker suspensions that differ from US market calendars. A system that trades both markets in walk-forward will have different "valid window" counts depending on which calendar drives the splits.

Each of these shifts the actual comparison period without changing the labels. The dates in the summary table might say "2020-2025" for both — but the actual out-of-sample trades are covering different calendar days.

The Protocol

Before comparing any two walk-forward results:

1. Print the metadata for every window: start date, end date, n_trades, n_wins, and which market regime each window covers (bull/bear/flat). This takes 5 minutes of logging code to add.

2. Verify window counts match: if version A has 12 windows and version B has 11, they're not the same test. The missing window is not neutral — it's a different exposure to some market period.

3. Check the first and last trade date: the summary stats might say the same range, but the actual first trade in version B might be 3 months later if a new indicator added warmup requirements.

4. Compare by window, not in aggregate: rather than comparing "overall Sharpe 0.85 vs 0.72," compare window-by-window: Window 1: A got 0.8, B got 0.7. Window 2: A got 1.2, B got 1.1. This forces you to notice if windows are missing.

5. Run both on the exact same date range as a sanity check: even if your final test uses the full available history, run a constrained version where both share the same strict start date and confirm the comparison holds.

This protocol adds 30 minutes to any comparison. It has saved me from at least three incorrect "this version is better" conclusions.

The Broader Lesson

Walk-forward testing is the most honest way to evaluate a trading system. It's also the most complex, and complexity creates surface area for subtle errors.

The error I'm describing isn't about bad intentions or sloppy work. It's about the gap between what the analysis labels say and what the data actually covers. The labels say "2020-2025." The data might cover "2020-2025 excluding the first 5 months for version B." These look the same on paper and produce results that are not comparable.

Verify the metadata before trusting the summary. The summary is an aggregate of windows you might not have counted correctly.

Track. Study. Wait. Strike.

Personal research and trading journal — not investment advice. The author does not provide licensed advisory services. — MOEasymmetry

Draft 2026-06-12. Source: 6-window WF comparison on Thai paper system, 2020-2025. Version 2 advantage (0.12 Sharpe) disappeared after aligning start dates. Pattern holds broadly: any WF comparison requires window-level metadata verification before conclusions are trusted.