2026-04-18 · Meta · All posts

Introducing the DuelLab blog

DuelLab started as a public benchmark for AI models that generate game-playing programs, but the thing underneath it is bigger than the benchmark. The foundation is a programmable board-game laboratory, a deterministic engine that can create deep simulations of games. The benchmark is one surface built on that foundation. This blog is where we will talk about that foundation, the surfaces we run on top of it, and where we think it can go next.

What DuelLab is today

A public benchmark at benchmarks.duellab.org: AI models receive a game specification and a player API, submit source code, and the submitted programs are compiled and played against each other in match tournaments on a hidden, growing suite of unique games. Rankings and methodology are public; game identities stay private during evaluation.

The less visible — but in our view more important — part is the substrate: a deterministic engine, a rules system that describes games, and a simulation harness that can batch-run millions of games under reproducible seeds. The AI benchmark is built on that stack.

What DuelLab can become

We are deliberately not committing the project to a single product category. Directions are still open to us.

A platform for inventing multiplayer strategy games with AI in the loop. A designer sketches a ruleset; AI helps propose rule variations, detects dominant strategies through mass simulation, and closes the loop between “change the rule” and “see whether the new game is actually interesting.” The hard part of game design is not generating candidates — it is testing them at scale. The simulation harness is built for exactly that.
Broader testing surfaces for games themselves, not just for AI. Mass-simulated win/draw/first-player-advantage analysis. Dominant-strategy detection. Search-depth sensitivity. Replayability proxies. Balance drift as a single rule number changes. Tournament behavior of heuristic bots against AI-generated ones. These are useful to game designers, to researchers studying game complexity, and to anyone who wants to know whether a rule tweak actually changed the game or only felt like it did.
Public AI benchmarking with more dimensions. Today there is one benchmark shape (code-generation → compile → tournament match play). The same substrate supports evaluating models as move-makers in live play, evaluating models as designers, evaluating models as teachers of other models, or hybrid setups. These are areas we intend to explore in the near future.
A testbed for designers and researchers studying games as objects. Game-theory folks, combinatorial-game-theory folks, and people interested in how small rule changes propagate into large strategic differences need a substrate that is deterministic, scriptable, and not locked to a specific geometry or dimension. We would like DuelLab to be that substrate for people who do not want to build it themselves.

We are not promising all of these will ship. We are saying these are the natural directions from the foundation that already exists, and the blog will cover the ones we actually build.

Why open a blog now

We already publish the benchmark and the methodology behind it. What has been missing is a place to put the short-form technical write-ups that the work produces: a new frontier model drops and the numbers move in an odd way; we change something in the harness and need to say why; we discover a rule mechanic that produces a surprisingly robust game under simulation; we try an idea that does not work.

What this blog is

Concrete findings from the work. Per-model benchmark findings, game-design observations from mass simulation, mechanic analyses, and honest write-ups of what the tooling actually revealed. We will name the models; we will name the mechanics.
Methodology and engine notes. What the harness is measuring today, what the engine guarantees and does not, where the edges are, and what we changed since the last snapshots.

Soft-launch

DuelLab is live, but still soft-launching. The public benchmark still contains preview views. Methodology and tooling are still evolving. Over the next few weeks, we plan to add 4-8 new games and benchmark them with a better harness.

For the most stable public reference, see the methodology page and the current published benchmark state.

Where to start

Read next: Claude Opus 4.7 is the first Claude with a V-shaped effort curve. Two inline charts, the per-track numbers, and an explicitly labeled theory about why the curve shape changed between 4.6 and 4.7.

Objections welcome

If you think a finding is misread, a chart is misleading, a methodology note is missing, or a caveat is too quiet — tell us. The benchmark is more useful when the community pushes back on it. Reach us at [email protected] or on Discord.