I Ran a Massive AI Trading Competition. My Biased Research Was Better.

TL;DR

Not a single strategy from my 80-persona competition passed my standard validation gates. The best Sharpe was 0.38 - my existing portfolio starts at 0.64. Before running the finals, I needed to understand why. The answer was uncomfortable.

Time for an uncomfortable truth.

I just ran 28 strategies through the gauntlet. Klarman won at 11.1% CAGR. Sounds great. Except:

Not a single competition strategy passed my standard validation gates.

My minimum Sharpe threshold is 0.50. The best competition Sharpe was 0.38. My existing portfolio - the one built through 45 sessions of “biased” directed research - contains strategies with Sharpe ratios of 0.64, 0.73, 0.87. The best strategy in this elaborate 80-persona competition wouldn’t even qualify for the bench.

Before running the finals, I needed to understand why. A mid-competition autopsy.

The Gap

My existing portfolio:

Strategy	What It Does	OOS Sharpe
S310	Quality stocks + regime timing + put selling	0.87
S347	Credit signal + dispersion + quality stocks	0.83
S319	Similar to S310, sector-diversified	0.73
S341	Dispersion timing + sector pairs	0.64

The competition leaders:

Strategy	What It Does	OOS Sharpe
Klarman	Cheap stocks with quality floor	0.38
Zweig	Yield curve + momentum timer	0.37
Tudor Jones	200-day trend multi-asset	0.35
Muller	Sector mean-reversion	0.33

Every validated strategy has 2-3 layers. Every competition strategy has 1. That’s the pattern.

Why Everybody Was Bad

Seven root causes. In order of how much they hurt.

1. The Prompt Said “Design ONE Strategy”

This is the original sin.

When I told 80 personas to design their “single best strategy,” they each picked one approach: pure value, or pure momentum, or pure macro timing. Nobody submitted a strategy combining a stock selection layer with a regime timing layer with an options overlay - even though that’s exactly what my best strategies do.

S310 - my top performer at 0.87 Sharpe - is literally “Buffett’s stock picks + Zweig’s timing logic + put selling on top.” It took 45 sessions and 310 iterations to discover that composite. I asked 80 experts to reinvent it in one shot. Of course they couldn’t.

The edge isn’t in the idea. It’s in the composition. A 0.30 Sharpe selection layer plus a 0.30 Sharpe timing layer plus a 0.20 Sharpe options layer can compose to 0.65+. But only if someone thinks to stack them.

2. Nobody Used Options

My three best strategies all sell cash-secured puts on quality stocks. This harvests the variance risk premium - implied volatility systematically exceeds realized volatility. It’s worth roughly 0.15-0.25 Sharpe on top of whatever the equity layer delivers.

Zero competition strategies used options. Zero.

The prompt warned that multi-leg options don’t work on the platform. Apparently that discouraged all exploration - nobody thought to try single-leg puts, which is my best edge. I emphasized the limitation rather than the opportunity.

3. Translation Loss

Klarman’s design actually had three layers. The core thesis: identify stocks experiencing forced selling - prices down but fundamentals intact. Brilliant. But QuantConnect’s fundamental filter doesn’t expose price history, so the entire forced-selling logic was dropped during implementation. What I actually tested was a generic cheapness screen.

The dumbed-down version still delivered 11.1% CAGR. What could the real one have done?

Every strategy lost nuance in translation from natural language to code. In my normal workflow, I design and code interactively - if the API can’t do something, I immediately pivot. The competition’s one-directional pipeline had no feedback loop.

4. Wrong Scoring Metric

I ranked by CAGR. My real portfolio optimizes for Sharpe, then applies leverage to hit the return target.

A 0.80 Sharpe strategy at 2x leverage is far superior to a 0.40 Sharpe strategy that happens to generate the same CAGR raw. The first is a repeatable edge. The second got lucky. By ranking on CAGR, I told 80 experts to swing for the fences. Swinging for the fences is how you get five futures strategies that lose 100%+.

5. No Iteration

My validated strategies emerged from ~450 attempts. A 14% hit rate. The path to any good strategy looked like: test signal in isolation (Sharpe 0.25) -> add timing layer (0.35) -> swap to quality stocks (0.50) -> add options overlay (0.65) -> tune thresholds (0.70+). Each step built on the last.

The competition gave each persona one attempt. That’s asking a sculptor to produce a masterpiece by describing it in words, having someone else carve it, and evaluating the result with no revisions.

6. Futures Were Platform-Broken

Five of six futures strategies exploded. These aren’t bad strategies - Carver literally wrote the textbook on trend following, Dennis’s Turtles made hundreds of millions. The problem is QuantConnect’s data normalization across contract rolls creates discontinuities that leveraged strategies amplify into catastrophe. I lost 20% of the field to platform issues, not strategy issues.

7. Fundamental Universe Was Compromised

Seven strategies used dynamic stock universes. The first one timed out after 21 minutes. I had to reduce candidate pools from 300 stocks to 30-50 to get them running - compromising the selection logic. Two strategies (Buffett, Fama) still crashed.

The Uncomfortable Punchline

Here’s what this competition actually revealed:

The bias wasn’t the problem. The bias was the solution.

I kept gravitating toward macro timing + stock selection + options because that’s where the composable edges actually are. The competition confirmed it - by failing to find them any other way.

Forty-five sessions of “biased” directed research produced strategies with Sharpe ratios north of 0.60. Eighty “unbiased” independent designs produced zero that passed validation.

The difference isn’t knowledge - every persona had the right ideas. It’s process:

Iterative refinement beats one-shot design
Multi-layer composition beats single signals
Platform-aware implementation beats platform-agnostic design
Sharpe optimization beats direct CAGR targeting

What Happens Now

The finals are running. But I’m not running them the way I originally planned. The original design - let the top 12 iterate once, re-test - would produce marginal improvements at best.

Instead, I’m restructuring. Sharing everything I learned with the personas. Explicitly telling them: layer stacking works. Options overlays work. Sharpe matters more than CAGR. Let the top performers compose their ideas with each other.

Klarman’s stock selection + Zweig’s timing + a put-selling overlay = ???

That’s the real experiment now. And whatever comes out of it informs how I design Tournament II - a full redesign of the competition format, built on everything the first one taught me.

The competition didn’t find the alpha. It found something more valuable: it found why my process works, and exactly how to make the next one better.

The Grand Strategy Competition finals are underway. Results coming soon.

The Gap

Why Everybody Was Bad

1. The Prompt Said “Design ONE Strategy”

2. Nobody Used Options

3. Translation Loss

4. Wrong Scoring Metric

5. No Iteration

6. Futures Were Platform-Broken

7. Fundamental Universe Was Compromised

The Uncomfortable Punchline

What Happens Now

Related Posts

The Guy I Eliminated in Round 2 Came Back and Won Everything

A Boring Value Investor Is Leading 27 Other AI Trading Strategies. For Now.

I Told 80 AI Experts to Design Trading Strategies. Six of Them Had the Exact Same Idea.