We just landed a bench suite where every test asks the same thing: write a complete, runnable implementation of a classic game in a single response. No tools. No follow-up turns. No retries. Then we run the programs and watch what happens.

Result: a public gallery at /merlin/arcade where visitors click Play and the model performs in real time. CLI games (tic-tac-toe, connect four, checkers, chess, go) run live in the browser through Pyodide — actual Python 3 compiled to WebAssembly, plus a js bridge that pipes user keystrokes from xterm.js into the program’s input() calls. HTML games (snake, tetris, space invaders, asteroids) run in a sandboxed iframe with allow-scripts and nothing else. Either way, what you click on is exactly what the model produced.

The trick: Tier 1 said everyone passed chess

Until last week, the suite only had static checks. Every test prompted for a Python or HTML program, then the bench harness looked for the structural elements a working game would have — a def somewhere, a game-loop primitive, win-detection idioms, board-dimension constants. Lots of keywords_any and regex checks. Every shipped Ollama Cloud provider scored 100% on chess. So did GPT-4.1-mini.

Then we plugged in a real layer: Tier 2. A new CheckKind::Executes extracts the fenced code block, writes it to a tempfile, runs python -I against it with scripted stdin, and regex-matches the captured stdout. For tic-tac-toe it pipes a forced X-wins-top-row sequence (1, 5, 2, 8, 3) and demands “X wins” in the output. For chess it pipes e2 e4 and just demands the program prints any plausible next prompt. Stricter for the games we can drive end-to-end; looser for the ones where format varies too much to over-fit.

What did Tier 2 catch?

  • ollama-gemma4 chess: 100% Tier 1, board renders upside down. White pieces on rows 1 and 2 instead of 2 and 1. White rooks at A2 and pawns at A1. The program runs, accepts moves, prints boards — they’re just wrong.
  • ollama-deepseek-v4-flash chess: 100% Tier 1, program crashes on the first input.
  • ollama-gemma4 checkers: 100% Tier 1, program crashes on launch.

You can’t catch any of that with keyword counting. You catch it by playing.

The public page is the third tier

Tier 1 is automated structure. Tier 2 is automated runtime. The arcade page is the third tier — humans watching. It’s the layer that catches failures no automated check thought to look for. Models that drew Breakout when asked for Space Invaders. Chess programs that draw Unicode pieces correctly but let bishops move like rooks. Checkers where captures fire backwards.

We put it on a public URL on purpose. The point of one-shot is to make a model perform with no safety net. Putting that in front of visitors makes the failures part of the record alongside the wins.

How the browser side works

GitHub Pages doesn’t serve Cross-Origin-Opener-Policy / Cross-Origin-Embedder-Policy headers, which means SharedArrayBuffer is unavailable, which means Pyodide’s Atomics-based blocking stdin is off the table. Our workaround was an AST-rewrite step that runs inside Pyodide on first load: every user def that transitively calls input() becomes async def, every input(...) becomes await __merlin_input(...), and every matching call site picks up an await. Dunders (__init__, __str__, etc.) are exempted — async constructors return coroutines, which break instantiation. The rewritten program runs through runPythonAsync which handles top-level await natively. __merlin_input is a JS-bridge function that resolves to a line of user input from xterm whenever the user hits Enter.

It’s a hack, but it makes the demo work without any cross-origin gymnastics, and the AST pass means we don’t need to rewrite the model’s source by hand for each new game.

Try it

Click into the arcade. Try the tic-tac-toe forced X-win sequence — type 1, 5, 2, 8, 3 and see whether the model declares the right winner. Try the chess programs and see whose board is right-side up. Try the HTML games and see how many of them actually run. The whole thing reloads when a fresh bench sweep produces new artifacts.

If you find a particularly entertaining failure mode, send it our way. Tier 2 hardness is going up next.