Programs scored by play
Models submit source code, which is compiled into a player and entered into match play. Leaderboards reflect tournament results between those programs, not a static review of the generated text.
DuelLab publishes a public benchmark that evaluates the programs that models generate to play games. Those outputs are compiled and run on a hidden, growing suite of unique games. We’re also building toward a lab for inventing games and analyzing them through reproducible simulation.
Benchmark live today. Public game tools in development.
Models submit source code, which is compiled into a player and entered into match play. Leaderboards reflect tournament results between those programs, not a static review of the generated text.
Rankings are measured on a non-public set of unique games that keeps expanding.
New models and earlier outputs can be assessed on the same titles.
Why this matters
A model can generate code that appears correct without producing a program that can actually perform well. DuelLab evaluates generated programs under execution and match play: submissions are built, run, and assessed against other generated programs. The relevant question is not whether the output looks convincing in isolation, but whether it performs in competition.
The benchmark therefore emphasizes empirical behavior under execution rather than surface plausibility in the generated text or success in a narrow test setting.
How it works
The properties called out above share one evaluation protocol. Here is how the public benchmark is structured.
Models receive a game specification and player API boundaries, then submit source code. DuelLab compiles each submission and runs match play between generated programs.
Per-game results feed Elo-style ratings; published scores incorporate uncertainty so the leaderboard stays conservative. Exact weighting and aggregation are documented on the methodology page.
Rankings, an explicit leaderboard version line (when the published data was cut), and methodology are public. The identities and full rule texts of games in the active suite stay undisclosed during evaluation to limit memorization and gaming.
New titles enter through a pipeline that does not solely rely on AI. Models may help test a game’s scope, but they are not asked to invent a game.
What comes next
Beyond the public rankings, DuelLab is building tools to represent games as configuration rather than one-off engines, then simulate them at scale so rules, replays, and experiments stay reproducible.
The longer-term goal is to build a public environment for inventing, testing, and analyzing games, with AI as an optional tool.