Claude Opus 4.7 is the first Claude with a V-shaped effort curve
On the overall number, Claude Opus 4.7 looks like a small step over 4.6. Inside the
per-track data, the shape changed. Across the three effort tiers we run —
none, medium, highest — Claude 4.7 regressed at
the medium tier and moved up at both ends. GPT-5.4 and Gemini 3.1 Pro Preview already
had this shape. Claude 4.6 did not.
A note on terminology. Throughout this post,
effort tier means one of the three operating points the harness runs each
model at: none, medium, highest. They
correspond to provider knobs for how much extra inference compute a model can spend
per request — Anthropic exposes them via extended-thinking budget plus
verbosity for Claude 4.6 / 4.7, OpenAI via reasoning.effort,
Gemini via thinking-mode policies. We say effort rather than
reasoning because none is a real point on this axis (no extra
inference effort) not the absence of one. Provider-specific mappings are summarized in
the methodology caveats below.
TL;DR.
1. Claude Opus 4.7’s per-track scores form a V across
none → medium → highest: 72.6 → 63.0 → 73.2. Claude 4.6’s scores were
flat-then-flat (64.3 → 71.8 → 71.9).
2. GPT-5.4 and Gemini 3.1 Pro Preview already sit on that same V shape. Claude 4.7 now joins them.
3. Chinese frontier open-weights models are competitive with Western frontier models at the medium tier, but still clearly trail them at the highest tier.
The per-track numbers
Scores below are from the public DuelLab benchmark. Each score is a 0–100 conversion of an Elo-style rating, aggregated across the hidden game suite within a given effort tier. Higher is better.
| Model | none | medium | highest |
|---|---|---|---|
| Claude Opus 4.7 | 72.6 | 63.0 | 73.2 |
| Claude Opus 4.6 | 64.3 | 71.8 | 71.9 |
| GPT-5.4 | 76.4 | 65.3 | 83.0 |
| Gemini 3.1 Pro Preview | 83.1 | 67.7 | 74.1 |
| Claude Sonnet 4.6 | 73.5 | 61.1 | 58.7 |
On the overall top-line, Claude 4.7 Opus looks like a small improvement over 4.6
because the aggregate hides the shape change. Split by track and the new behavior is
clear: at medium, 4.7 drops ~9 points below 4.6, while at
none and highest it moves ~8 and ~1 points above 4.6
respectively.
Chart 1: The V-curve
Theory (speculation)
We don’t know why the shape changed. The simplest story we can tell that fits the data is this, and we want to be clear that it is speculation:
-
The
mediumtier is the default-looking reasoning level most people will use in products. Optimizing the thinking-token budget at that tier is probably the single biggest serving-cost lever a frontier lab has. - Adaptive-thinking policies (where the model decides its own thinking budget based on request hints rather than a fixed reasoning-effort parameter) makes that optimization implementable without publishing a spec change. From the API surface, nothing obviously shifts; internally, the model is allowed to spend less time thinking at the default tier.
- GPT-5.4 and Gemini 3.1 Pro Preview likely made an analogous move in earlier releases. This argues the pattern is economic, not architectural.
We are not claiming labs are deliberately degrading their default tier. The pattern could also be explained by policy changes that optimize for latency and token economy in ways that happen to hurt this specific benchmark’s measure of strategic program quality. Either way, the behavior is visible from outside, and a user comparing 4.6 vs 4.7 at the default tier will see the regression.
Where Chinese frontier models land
The V-curve observation makes a second pattern easy to see. At
medium, Chinese frontier open-weights models are essentially level with
(or slightly ahead of) the Western frontier. At highest, they fall back
by a clear margin.
| Model | medium | highest |
|---|---|---|
| Kimi K2.5 (Moonshot) | 74.2 | 49.9 |
| Qwen3.6 Plus | 73.4 | 64.3 |
| GLM-5 | 73.3 | 48.5 |
| GLM-5.1 | 72.1 | 63.1 |
| Claude Opus 4.7 | 63.0 | 73.2 |
| GPT-5.4 | 65.3 | 83.0 |
| Gemini 3.1 Pro Preview | 67.7 | 74.1 |
This is consistent with the story from Chart 1: if the frontier labs have optimized
their default tier for cost/latency, their best models are only really showing their
full capability at highest. That is where the gap to Chinese frontier
models widens most visibly — and it is also the tier that most product UX will not
light up by default.
Reliability note
On this harness, Claude Opus 4.7 reliably produces plugin code that compiles and plays — its error rate is visibly lower than the previous Claude generation on the same tracks. We are explicitly not putting cross-model error-rate numbers next to it here: the harness has changed enough since older models were first measured that comparing those rates apples-to-apples would be misleading.
Methodology caveats
- Scores are 0–100 projections of Elo-style ratings, aggregated across a hidden, growing suite of games. Full methodology is on the benchmark methodology page.
-
“Highest” is implemented per model family on the harness, not as a single literal
API parameter. For Claude 4.6 / 4.7 it maps to the adaptive-thinking +
verbosity: "max"shape; for providers usingreasoning.effort, it uses that provider’s maximum effort setting. - The sample size is small but from previous tests we know that it is still enough to have a strong signal.
More data
This post is drawn from the public leaderboard.
Objections welcome
If you think we are misreading the V or if you have evidence that the medium-tier policy change is or is not what we think it is — tell us.
Reach us at [email protected], @DuelLab_, or Discord.