⚠️ Personal research and trading journal — not investment advice. The author does not provide licensed advisory services.
One of the counterintuitive lessons from quantitative testing is that a smaller number from a harder test is more trustworthy than a bigger number from a softer test.
I learned this specifically from the volume-pop study on Thai stocks.
The Setup
My base method is the contracting-base breakout — stocks that form tight bases with higher lows, then break through the pivot point. The volume-pop hypothesis is that requiring a 1.5× volume increase on the breakout day improves outcomes: if the stock breaks quietly, skip it; if real buying shows up, take it.
I described the core finding in an earlier article. Volume-pop improves performance in Thailand. It degrades performance in the US. The results in both directions are real.
But when a user asked me whether the Thai finding was believable — whether I could make the test harder and still see the result — I went back and ran additional stress tests.
The original pooled result: +3.60% improvement in mean forward return with the volume-pop gate applied.
That's the number that looked strong.
What Harder Tests Show
Test 1: Regime gate
I split the results by market condition. Volume-pop only works in a Confirmed Uptrend (SET index above 50d MA, 50d above 200d). In correction conditions, the improvement disappears. The Confirmed Uptrend confidence interval: [+0.10%, +1.60%]. The correction CI spans zero — no signal.
The improvement is real, but it's regime-gated. The pooled +3.60% average pooled a genuine signal in uptrends with zero signal in corrections. That inflation made the number look larger than its true addressable effect.
Test 2: Walk-forward
The pooled mean comes from all years combined. But some years are exceptional — 2009 (recovery), 2014 (SET bull run), 2020 (post-COVID surge). These years have strong momentum across the board, and the volume-pop signal is especially strong in them. They inflate the pooled average.
In walk-forward testing — where each out-of-sample window is evaluated independently — the story changes:
- 60% of years were positive (12 of 20 years from 1990-2025)
- Median annual improvement: +1.9% (not +3.60%)
- The bulk of the pooled mean came from the top 3 years. Remove them: the result becomes marginal.
The walk-forward median is honest. The pooled mean was flattering.
Why the Smaller Number Is Better
Here's the unintuitive part: the +1.9% WF median is more trustworthy than the +3.60% pooled mean. Not because it's bigger — it isn't — but because it survived:
- A regime split that could have nullified it
- 20 independent out-of-sample windows that could have shown inconsistency
- The removal of three exceptional years that inflated the pooled result
Something that is still positive after all of those tests was tested harder. The reduction from 3.60% to 1.9% is the test telling you that 1.7 percentage points of the original number was from favorable conditions that won't always apply.
The 1.9% is what you should expect in a typical year. The 3.60% was what you got when you averaged together the typical years and the exceptional ones.
What This Means for Interpreting Backtests
There are two ways to make a backtest number bigger:
Making the method more refined: Add filters, tune parameters, select lookback windows. This inflates the number by fitting the method to the historical data. The improvement may not repeat.
Harder tests on the fixed method: Run it through walk-forward windows. Split by regime. Remove the top-N outlier years. Re-test on OOS data. The number often shrinks — but what remains is more likely to represent real edge.
The first approach is how most retail traders improve their backtests. The second approach is what I try to do.
When a number shrinks under harder testing, that is not a failure. That is the test working correctly — filtering out the lucky-sample component of the original estimate and leaving the replicable core.
The volume-pop improvement on Thai stocks went from +3.60% to +1.9% under harder tests. I trust the +1.9% more. It's the number I'm willing to rely on for capital decisions.
The Practical Rule
Before citing any backtest result, I now ask: what would this number look like under walk-forward? Regime-split? Drop-top-3?
If the result collapses under those tests, the original number was flattering noise.
If the result shrinks but survives — as volume-pop did — you have something. It's smaller than it looked, but it's real.