Benchmarking AI-generated game-playing programs.

DuelLab tests how well model-written code plays board and strategy games under reproducible, head-to-head competition.

GameBench 2 compiles generated players and runs them across an evolving set of public games. It is designed to complement static, multiple-choice, or HumanEval-style coding evaluations. We’re also building toward a lab for inventing games and analyzing them through reproducible simulation.

View rankings Read methodology

GameBench 2 standings evolve as models, games, and match evidence are added.

Open the current leaderboard for the latest results.

View the full leaderboard

Programs scored by play

Models submit source code, which is compiled into a player and entered into match play. Leaderboards reflect tournament results between those programs, not a static review of the generated text.

Public, evolving game set

Rankings cover named public games, with new games and evidence added over time.

Current, not frozen

GameBench 2 keeps changing, so use its latest standings as the current comparison.

Why this matters

Plausible output is not enough.

A model can generate code that appears correct without producing a program that can actually perform well. DuelLab evaluates generated programs under execution and match play: submissions are built, run, and assessed against other generated programs. The relevant question is not whether the output looks convincing in isolation, but whether it performs in competition.

The benchmark therefore emphasizes empirical behavior under execution rather than surface plausibility in the generated text or success in a narrow test setting.

How it works

Behind the rankings

The properties called out above share one evaluation protocol. Here is how the public benchmark is structured.

Code-generation task

Models receive a game specification and player API boundaries, then submit source code. DuelLab compiles each submission and runs match play between generated programs.
Elo-style standings

Per-game results feed Elo-style ratings; leaderboard scores incorporate uncertainty so the ranking stays conservative. Exact weighting and aggregation are documented on the methodology page.
Public standings and games

Rankings, game names, an updated date, and the methodology are public. GameBench 2 changes as new models, games, and match results are added or refreshed.
Game design pipeline is separate from the AI benchmark

New titles enter through a pipeline that does not solely rely on AI. Models may help test a game’s scope, but they are not asked to invent a game.

Read the full methodology

What comes next

From benchmark to lab.

Beyond the public rankings, DuelLab is building tools to represent games as configuration rather than one-off engines, then simulate them at scale so rules, replays, and experiments stay reproducible.

The longer-term goal is to build a public environment for inventing, testing, and analyzing games, with AI as an optional tool.

Benchmarking AI-generated game-playing programs.

Current standings

Programs scored by play

Public, evolving game set

Current, not frozen

Plausible output is not enough.

Behind the rankings

Code-generation task

Elo-style standings

Public standings and games

Game design pipeline is separate from the AI benchmark

From benchmark to lab.