2026-05-09 · Model finding · All posts

GPT-5.5 wins the middle

DuelLab recently benchmarked ChatGPT 5.5, published in the benchmark tables as GPT-5.5. The headline is not that it is the overall leader. In the current results, GPT-5.5 is #3 overall. The interesting part is the shape: it is #1 at the medium setting, while sitting lower at both none and highest. Those three settings are roughly how much extra thinking time the model is allowed to use.

TL;DR.

1. GPT-5.5 is #3 overall in the current DuelLab public results, with an average relative per-game score of 66.1.

2. On the medium setting it jumps to #1, scoring 75.0. Its no-extra-thinking score is 60.3 and its highest-setting score is 62.9.

3. That is a different curve from GPT-5.4 and Claude Opus 4.7, where the highest setting is much stronger. GPT-5.5's best result is the middle.

The numbers

Scores below are from the DuelLab results generated on 2026-05-09 at 04:36 UTC. Each score turns match results into a 0-100 number using a rating system similar to chess ratings, then combines results across the hidden game suite. Higher is better.

Benchmark score (0-100) by setting
Model	overall	none	medium	highest
GPT-5.4	72.3	75.7	58.0	83.3
Claude Opus 4.7	71.6	76.1	60.8	77.9
GPT-5.5	66.1	60.3	75.0	62.9
GPT-5.2	64.9	57.9	69.1	67.6
Kimi K2.6	64.5	63.4	61.6	68.6

The useful comparison is not just score versus score. It is how the scores move across the three settings. GPT-5.4 is strong at none, dips at medium, and then jumps at highest. Claude Opus 4.7 has a similar dip-and-recover shape. GPT-5.5 does the opposite: the middle setting is the top.

Chart 1: The medium peak

Chart 1. GPT-5.5 does not follow the same V-shaped pattern as GPT-5.4 and Claude Opus 4.7. Its best score is at the middle setting.

Where GPT-5.5 was strong

The medium setting is not a one-game fluke. GPT-5.5 is high on several hidden games at medium: Game 01, Game 02, Game 04, Game 05, and Game 08 are all above 78. The strongest individual medium score is Game 08 at 95.3.

GPT-5.5 per-game score by setting
Game alias	none	medium	highest
Game 01	67.9	89.2	94.3
Game 02	84.5	78.2	64.3
Game 03	54.7	64.6	37.9
Game 04	79.0	92.3	87.3
Game 05	59.3	83.4	37.7
Game 06	34.4	41.0	43.0
Game 07	49.0	56.2	64.5
Game 08	53.3	95.3	74.2

The weak side is just as visible. Game 06 stays low across all three settings, and the highest setting drops hard on Game 03 and Game 05. So this is not simply "more model thinking makes GPT-5.5 better." In these results, the middle setting is where the generated players look strongest.

What we think this means

The simple takeaway is that GPT-5.5 has a very strong middle setting for this kind of task, where a model writes a program and that program has to play games. That matters because medium is the kind of setting many product teams actually use: high enough to ask for useful thought, but not always the most expensive or slowest option.

A cautious guess is that higher settings do not always mean better play. Giving the model more room to think can change the kind of program it writes. One setting may produce a simpler player that wins more often; another may produce a more ambitious player that looks clever but performs worse once the games are actually played.