GPT-5.5 wins the middle
DuelLab recently benchmarked ChatGPT 5.5, published in the benchmark tables as
GPT-5.5. The headline is not that it is the overall leader. In the current results,
GPT-5.5 is #3 overall. The interesting part is the shape: it is #1 at the
medium setting, while sitting lower at both none and
highest. Those three settings are roughly how much extra thinking time
the model is allowed to use.
TL;DR.
1. GPT-5.5 is #3 overall in the current DuelLab public results, with an average relative per-game score of 66.1.
2. On the medium setting it jumps to #1, scoring 75.0. Its no-extra-thinking score is 60.3 and its highest-setting score is 62.9.
3. That is a different curve from GPT-5.4 and Claude Opus 4.7, where the highest setting is much stronger. GPT-5.5's best result is the middle.
The numbers
Scores below are from the DuelLab results generated on 2026-05-09 at 04:36 UTC. Each score turns match results into a 0-100 number using a rating system similar to chess ratings, then combines results across the hidden game suite. Higher is better.
| Model | overall | none | medium | highest |
|---|---|---|---|---|
| GPT-5.4 | 72.3 | 75.7 | 58.0 | 83.3 |
| Claude Opus 4.7 | 71.6 | 76.1 | 60.8 | 77.9 |
| GPT-5.5 | 66.1 | 60.3 | 75.0 | 62.9 |
| GPT-5.2 | 64.9 | 57.9 | 69.1 | 67.6 |
| Kimi K2.6 | 64.5 | 63.4 | 61.6 | 68.6 |
The useful comparison is not just score versus score. It is how the scores move across
the three settings. GPT-5.4 is strong at none, dips at
medium, and then jumps at highest. Claude Opus 4.7 has a
similar dip-and-recover shape. GPT-5.5 does the opposite: the middle setting is the
top.
Chart 1: The medium peak
Where GPT-5.5 was strong
The medium setting is not a one-game fluke. GPT-5.5 is high on several hidden games at
medium: Game 01, Game 02, Game 04, Game 05, and Game 08 are all above 78.
The strongest individual medium score is Game 08 at 95.3.
| Game alias | none | medium | highest |
|---|---|---|---|
| Game 01 | 67.9 | 89.2 | 94.3 |
| Game 02 | 84.5 | 78.2 | 64.3 |
| Game 03 | 54.7 | 64.6 | 37.9 |
| Game 04 | 79.0 | 92.3 | 87.3 |
| Game 05 | 59.3 | 83.4 | 37.7 |
| Game 06 | 34.4 | 41.0 | 43.0 |
| Game 07 | 49.0 | 56.2 | 64.5 |
| Game 08 | 53.3 | 95.3 | 74.2 |
The weak side is just as visible. Game 06 stays low across all three settings, and the highest setting drops hard on Game 03 and Game 05. So this is not simply "more model thinking makes GPT-5.5 better." In these results, the middle setting is where the generated players look strongest.
What we think this means
The simple takeaway is that GPT-5.5 has a very strong middle setting for this kind of
task, where a model writes a program and that program has to play games. That matters
because medium is the kind of setting many product teams actually use:
high enough to ask for useful thought, but not always the most expensive or slowest
option.
A cautious guess is that higher settings do not always mean better play. Giving the model more room to think can change the kind of program it writes. One setting may produce a simpler player that wins more often; another may produce a more ambitious player that looks clever but performs worse once the games are actually played.