Kimi K2.6 tops mixed play

DuelLab is a benchmark where AI models write programs that play games, and those programs are run against each other. In the latest public results, Kimi K2.6 is only #6 on the official overall leaderboard. But in mixed play - where high-, medium-, and no-reasoning entries can all face each other - Kimi K2.6 takes #1. The simplest reason is that its highest-effort version stays unusually strong even outside the normal like-for-like comparisons.

TL;DR.

1. Kimi K2.6 is #6 overall on the official DuelLab leaderboard, but its highest-effort version is #1 on the mixed-play page.

2. The mixed-play win is not a clean sweep across every setting. It comes mainly from one especially strong highest-effort row.

3. K2.6 may still end up looking weaker than K2.5 at the medium setting, but that part of the picture is not settled yet because K2.6 has fewer completed rows in the current results.

What "mixed play" means

DuelLab has two views that matter here. On the official leaderboard, models are compared inside fixed effort settings. The mixed page is different: it allows entries from different settings to face each other. That makes it a useful test of whether a model's best code still looks strong once you stop keeping the comparison perfectly matched.

Setting K2.5 official K2.6 official K2.5 mixed K2.6 mixed
none 63.6 55.7 36.1 49.0
medium 70.4 67.2 55.6 63.8
highest 49.3 76.8 42.8 75.3

The simplest read of the table is this: K2.6's big improvement over K2.5 is at the top end. On the official leaderboard, K2.5 still looks better at the published medium and none settings. But once we move to mixed play, K2.6 is ahead at all three settings, and its biggest advantage is still at highest.

Chart 1: K2.5 vs K2.6 by effort setting

The official tracks make the shift easy to see. K2.5 is better on the published middle setting. K2.6, however, has a much stronger top gear.

Official DuelLab score by effort setting for Kimi K2.5 and Kimi K2.6 Line chart showing Kimi K2.5 and Kimi K2.6 across the three official settings: none, medium, and highest. K2.5 rises from none to medium and then falls sharply at highest. K2.6 rises steadily and ends far above K2.5 at highest. 0 25 50 75 100 none medium highest score (0–100) K2.6 jump at highest Kimi K2.5 Kimi K2.6
Chart 1. K2.6's defining change is not that every setting got better. It is that the highest setting became much stronger.

Why the mixed-play result matters

K2.6 does not lead the official overall board. It does not even lead the official highest-setting board. But its best row loses very little strength when moved into the mixed pool: 76.8 on the official highest page, 75.3 on mixed play. That drop is much smaller than what we see from several other frontier models. GPT-5.4 falls from 80.8 to 71.0. Claude Opus 4.7 falls from 79.1 to 71.2. Gemini 3.1 Pro Preview falls from 71.3 to 67.1.

That is why K2.6 can end up #1 on mixed play even though it is only #6 overall on the main leaderboard. Its best version travels well.

The mixed-play win is also narrower than it sounds. K2.6 does not dominate the whole family of settings: on the full mixed board, K2.6 medium is #7 and K2.6 none is #32. The page is being won by one exceptional row, not by a clean sweep.

How much should we read into the medium setting?

It is tempting to read this as providers tuning down the default-looking middle setting while keeping a premium top gear. Maybe. We have seen hints of something like that before in other model families too. But these K2.6 results do not prove it cleanly.

The published official table does put K2.6 below K2.5 at the medium setting. At the same time, K2.6 has fewer completed rows in the current results, so that part of the comparison may still move around. The fair headline today is simpler: K2.6 has a much stronger top gear than K2.5, and that top gear is what wins mixed play.

One caution

K2.6 also looks volatile. In the current public results it scores 100 on one highest-setting game and 0 on one medium-setting game. So this does not look like a calm, across-the-board improvement. Right now it looks more like a model with a very high ceiling and some sharp failures.

It also has a thinner sample than K2.5 in today's public table: 19 official entries versus 50. That means some lower-setting comparisons may still shift as more rows are filled in.

More data

This post is drawn from the public leaderboard and the mixed-play page.

Objections welcome

If you think we are under-reading the medium-setting caveat, over-reading the mixed win, or missing a simpler explanation for why K2.6's best row holds up so well, tell us.

Reach us at [email protected], @DuelLab_, or Discord.