Claude Opus 4.7 is the first Claude with a V-shaped effort curve

On the overall number, Claude Opus 4.7 looks like a small step over 4.6. Inside the per-track data, the shape changed. Across the three effort tiers we run — none, medium, highest — Claude 4.7 regressed at the medium tier and moved up at both ends. GPT-5.4 and Gemini 3.1 Pro Preview already had this shape. Claude 4.6 did not.

A note on terminology. Throughout this post, effort tier means one of the three operating points the harness runs each model at: none, medium, highest. They correspond to provider knobs for how much extra inference compute a model can spend per request — Anthropic exposes them via extended-thinking budget plus verbosity for Claude 4.6 / 4.7, OpenAI via reasoning.effort, Gemini via thinking-mode policies. We say effort rather than reasoning because none is a real point on this axis (no extra inference effort) not the absence of one. Provider-specific mappings are summarized in the methodology caveats below.

TL;DR.

1. Claude Opus 4.7’s per-track scores form a V across none → medium → highest: 72.6 → 63.0 → 73.2. Claude 4.6’s scores were flat-then-flat (64.3 → 71.8 → 71.9).

2. GPT-5.4 and Gemini 3.1 Pro Preview already sit on that same V shape. Claude 4.7 now joins them.

3. Chinese frontier open-weights models are competitive with Western frontier models at the medium tier, but still clearly trail them at the highest tier.

The per-track numbers

Scores below are from the public DuelLab benchmark. Each score is a 0–100 conversion of an Elo-style rating, aggregated across the hidden game suite within a given effort tier. Higher is better.

Model none medium highest
Claude Opus 4.7 72.6 63.0 73.2
Claude Opus 4.6 64.3 71.8 71.9
GPT-5.4 76.4 65.3 83.0
Gemini 3.1 Pro Preview 83.1 67.7 74.1
Claude Sonnet 4.6 73.5 61.1 58.7

On the overall top-line, Claude 4.7 Opus looks like a small improvement over 4.6 because the aggregate hides the shape change. Split by track and the new behavior is clear: at medium, 4.7 drops ~9 points below 4.6, while at none and highest it moves ~8 and ~1 points above 4.6 respectively.

Chart 1: The V-curve

Benchmark score by effort tier for four frontier models Line chart showing four models — Claude Opus 4.7, Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview — across three effort tiers: none, medium, highest. Claude Opus 4.7, GPT-5.4, and Gemini 3.1 Pro Preview all form a V shape with a dip at the medium tier. Claude Opus 4.6 is roughly flat, rising from 64 at none to 72 at medium and highest. 0 25 50 75 100 none medium highest score (0–100) 4.7 dip at medium Claude Opus 4.7 Claude Opus 4.6 GPT-5.4 Gemini 3.1 Pro Preview
Chart 1. Per-track benchmark score across the three effort tiers. The V shape shared by Claude 4.7, GPT-5.4, and Gemini 3.1 Pro Preview is the behavior Claude 4.6 did not exhibit.

Theory (speculation)

We don’t know why the shape changed. The simplest story we can tell that fits the data is this, and we want to be clear that it is speculation:

  • The medium tier is the default-looking reasoning level most people will use in products. Optimizing the thinking-token budget at that tier is probably the single biggest serving-cost lever a frontier lab has.
  • Adaptive-thinking policies (where the model decides its own thinking budget based on request hints rather than a fixed reasoning-effort parameter) makes that optimization implementable without publishing a spec change. From the API surface, nothing obviously shifts; internally, the model is allowed to spend less time thinking at the default tier.
  • GPT-5.4 and Gemini 3.1 Pro Preview likely made an analogous move in earlier releases. This argues the pattern is economic, not architectural.

We are not claiming labs are deliberately degrading their default tier. The pattern could also be explained by policy changes that optimize for latency and token economy in ways that happen to hurt this specific benchmark’s measure of strategic program quality. Either way, the behavior is visible from outside, and a user comparing 4.6 vs 4.7 at the default tier will see the regression.

Where Chinese frontier models land

The V-curve observation makes a second pattern easy to see. At medium, Chinese frontier open-weights models are essentially level with (or slightly ahead of) the Western frontier. At highest, they fall back by a clear margin.

Model medium highest
Kimi K2.5 (Moonshot) 74.2 49.9
Qwen3.6 Plus 73.4 64.3
GLM-5 73.3 48.5
GLM-5.1 72.1 63.1
Claude Opus 4.7 63.0 73.2
GPT-5.4 65.3 83.0
Gemini 3.1 Pro Preview 67.7 74.1
Chinese frontier models compared to Western frontier at medium and highest Grouped bar chart. Four Chinese models (Kimi K2.5, Qwen3.6 Plus, GLM-5, GLM-5.1) and three Western frontier models (Claude Opus 4.7, GPT-5.4, Gemini 3.1 Pro Preview) are each shown with two bars: one for the medium effort tier, one for the highest effort tier. Chinese models are competitive with Western frontier at medium, but clearly trail at highest. 0 25 50 75 100 Kimi K2.5 Qwen3.6+ GLM-5 GLM-5.1 Claude 4.7 GPT-5.4 Gemini 3.1 Pro score (0–100) Chinese frontier ← | → Western frontier medium effort highest effort
Chart 2. Medium vs highest effort, seven models. At medium, Chinese frontier models sit inside or above the Western cluster. At highest, they fall back by roughly 10–30 points depending on the model.

This is consistent with the story from Chart 1: if the frontier labs have optimized their default tier for cost/latency, their best models are only really showing their full capability at highest. That is where the gap to Chinese frontier models widens most visibly — and it is also the tier that most product UX will not light up by default.

Reliability note

On this harness, Claude Opus 4.7 reliably produces plugin code that compiles and plays — its error rate is visibly lower than the previous Claude generation on the same tracks. We are explicitly not putting cross-model error-rate numbers next to it here: the harness has changed enough since older models were first measured that comparing those rates apples-to-apples would be misleading.

Methodology caveats

  • Scores are 0–100 projections of Elo-style ratings, aggregated across a hidden, growing suite of games. Full methodology is on the benchmark methodology page.
  • “Highest” is implemented per model family on the harness, not as a single literal API parameter. For Claude 4.6 / 4.7 it maps to the adaptive-thinking + verbosity: "max" shape; for providers using reasoning.effort, it uses that provider’s maximum effort setting.
  • The sample size is small but from previous tests we know that it is still enough to have a strong signal.

More data

This post is drawn from the public leaderboard.

Objections welcome

If you think we are misreading the V or if you have evidence that the medium-tier policy change is or is not what we think it is — tell us.

Reach us at [email protected], @DuelLab_, or Discord.