GPT-5.5 wins the middle

DuelLab recently benchmarked ChatGPT 5.5, published in the benchmark tables as GPT-5.5. The headline is not that it is the overall leader. In the current results, GPT-5.5 is #3 overall. The interesting part is the shape: it is #1 at the medium setting, while sitting lower at both none and highest. Those three settings are roughly how much extra thinking time the model is allowed to use.

TL;DR.

1. GPT-5.5 is #3 overall in the current DuelLab public results, with an average relative per-game score of 66.1.

2. On the medium setting it jumps to #1, scoring 75.0. Its no-extra-thinking score is 60.3 and its highest-setting score is 62.9.

3. That is a different curve from GPT-5.4 and Claude Opus 4.7, where the highest setting is much stronger. GPT-5.5's best result is the middle.

The numbers

Scores below are from the DuelLab results generated on 2026-05-09 at 04:36 UTC. Each score turns match results into a 0-100 number using a rating system similar to chess ratings, then combines results across the hidden game suite. Higher is better.

Model overall none medium highest
GPT-5.4 72.3 75.7 58.0 83.3
Claude Opus 4.7 71.6 76.1 60.8 77.9
GPT-5.5 66.1 60.3 75.0 62.9
GPT-5.2 64.9 57.9 69.1 67.6
Kimi K2.6 64.5 63.4 61.6 68.6

The useful comparison is not just score versus score. It is how the scores move across the three settings. GPT-5.4 is strong at none, dips at medium, and then jumps at highest. Claude Opus 4.7 has a similar dip-and-recover shape. GPT-5.5 does the opposite: the middle setting is the top.

Chart 1: The medium peak

Benchmark score by setting for GPT-5.5 and nearby models Line chart showing GPT-5.5, GPT-5.4, Claude Opus 4.7, and GPT-5.2 across none, medium, and highest settings. GPT-5.5 peaks at medium while GPT-5.4 and Claude Opus 4.7 are stronger at highest than at medium. 0 25 50 75 100 none medium highest score (0-100) GPT-5.5 peak at medium GPT-5.5 GPT-5.4 Claude Opus 4.7 GPT-5.2
Chart 1. GPT-5.5 does not follow the same V-shaped pattern as GPT-5.4 and Claude Opus 4.7. Its best score is at the middle setting.

Where GPT-5.5 was strong

The medium setting is not a one-game fluke. GPT-5.5 is high on several hidden games at medium: Game 01, Game 02, Game 04, Game 05, and Game 08 are all above 78. The strongest individual medium score is Game 08 at 95.3.

Game alias none medium highest
Game 01 67.9 89.2 94.3
Game 02 84.5 78.2 64.3
Game 03 54.7 64.6 37.9
Game 04 79.0 92.3 87.3
Game 05 59.3 83.4 37.7
Game 06 34.4 41.0 43.0
Game 07 49.0 56.2 64.5
Game 08 53.3 95.3 74.2

The weak side is just as visible. Game 06 stays low across all three settings, and the highest setting drops hard on Game 03 and Game 05. So this is not simply "more model thinking makes GPT-5.5 better." In these results, the middle setting is where the generated players look strongest.

What we think this means

The simple takeaway is that GPT-5.5 has a very strong middle setting for this kind of task, where a model writes a program and that program has to play games. That matters because medium is the kind of setting many product teams actually use: high enough to ask for useful thought, but not always the most expensive or slowest option.

A cautious guess is that higher settings do not always mean better play. Giving the model more room to think can change the kind of program it writes. One setting may produce a simpler player that wins more often; another may produce a more ambitious player that looks clever but performs worse once the games are actually played.