Why My Backtest Number Got Smaller When I Made the Test Harder

⚠️ Personal research and trading journal — not investment advice. The author does not provide licensed advisory services.

One of the counterintuitive lessons from quantitative testing is that a smaller number from a harder test is more trustworthy than a bigger number from a softer test.

I learned this specifically from the volume-pop study on Thai stocks.

The Setup

My base method is the contracting-base breakout — stocks that form tight bases with higher lows, then break through the pivot point. The volume-pop hypothesis is that requiring a 1.5× volume increase on the breakout day improves outcomes: if the stock breaks quietly, skip it; if real buying shows up, take it.

I described the core finding in an earlier article. Volume-pop improves performance in Thailand. It degrades performance in the US. The results in both directions are real.

But when a user asked me whether the Thai finding was believable — whether I could make the test harder and still see the result — I went back and ran additional stress tests.

The original pooled result: +3.60% improvement in mean forward return with the volume-pop gate applied.

That's the number that looked strong.

What Harder Tests Show

Test 1: Regime gate

I split the results by market condition. Volume-pop only works in a Confirmed Uptrend (SET index above 50d MA, 50d above 200d). In correction conditions, the improvement disappears. The Confirmed Uptrend confidence interval: [+0.10%, +1.60%]. The correction CI spans zero — no signal.

The improvement is real, but it's regime-gated. The pooled +3.60% average pooled a genuine signal in uptrends with zero signal in corrections. That inflation made the number look larger than its true addressable effect.

Test 2: Walk-forward

The pooled mean comes from all years combined. But some years are exceptional — 2009 (recovery), 2014 (SET bull run), 2020 (post-COVID surge). These years have strong momentum across the board, and the volume-pop signal is especially strong in them. They inflate the pooled average.

In walk-forward testing — where each out-of-sample window is evaluated independently — the story changes:

60% of years were positive (12 of 20 years from 1990-2025)
Median annual improvement: +1.9% (not +3.60%)
The bulk of the pooled mean came from the top 3 years. Remove them: the result becomes marginal.

The walk-forward median is honest. The pooled mean was flattering.

Why the Smaller Number Is Better

Here's the unintuitive part: the +1.9% WF median is more trustworthy than the +3.60% pooled mean. Not because it's bigger — it isn't — but because it survived:

A regime split that could have nullified it
20 independent out-of-sample windows that could have shown inconsistency
The removal of three exceptional years that inflated the pooled result

Something that is still positive after all of those tests was tested harder. The reduction from 3.60% to 1.9% is the test telling you that 1.7 percentage points of the original number was from favorable conditions that won't always apply.

The 1.9% is what you should expect in a typical year. The 3.60% was what you got when you averaged together the typical years and the exceptional ones.

What This Means for Interpreting Backtests

There are two ways to make a backtest number bigger:

Making the method more refined: Add filters, tune parameters, select lookback windows. This inflates the number by fitting the method to the historical data. The improvement may not repeat.

Harder tests on the fixed method: Run it through walk-forward windows. Split by regime. Remove the top-N outlier years. Re-test on OOS data. The number often shrinks — but what remains is more likely to represent real edge.

The first approach is how most retail traders improve their backtests. The second approach is what I try to do.

When a number shrinks under harder testing, that is not a failure. That is the test working correctly — filtering out the lucky-sample component of the original estimate and leaving the replicable core.

The volume-pop improvement on Thai stocks went from +3.60% to +1.9% under harder tests. I trust the +1.9% more. It's the number I'm willing to rely on for capital decisions.

The Practical Rule

Before citing any backtest result, I now ask: what would this number look like under walk-forward? Regime-split? Drop-top-3?

If the result collapses under those tests, the original number was flattering noise.

If the result shrinks but survives — as volume-pop did — you have something. It's smaller than it looked, but it's real.