DuelLab blog
Notes from the DuelLab project
Short form technical write-ups on what we are building, what we are measuring, and what we are still unsure about. Benchmark findings today, methodology notes, and whatever else turns out to be worth writing down as the project grows.
-
GPT-5.5 wins the middle
GPT-5.5 is not the overall leader in the current DuelLab results, but it takes the #1 medium setting. The interesting part is the curve: medium beats both none and highest.
-
Kimi K2.6 tops mixed play
DuelLab is a benchmark where AI models write game-playing programs. In the latest public results, Kimi K2.6 is only #6 overall but jumps to #1 in mixed play, powered by an unusually strong highest-effort mode.
-
Claude Opus 4.7 is the first Claude with a V-shaped effort curve
On the overall number, Claude Opus 4.7 looks like a small step over 4.6. Inside the per-track data, the shape changed: 4.7 regressed at the medium effort tier and moved up at both ends. GPT-5.4 and Gemini 3.1 Pro Preview already had this shape. Claude 4.6 did not.
-
Introducing the DuelLab blog
What DuelLab is today, where it is heading, and why we are opening a blog now.