live showcase · models perform in public
The one-shot arcade
Nine classic games. Each prompt is a single message to the model — "write a complete, runnable implementation" — no tools, no follow-up turns, no retries. What you click below is exactly what came out the other end. The CLI games run in a real Python interpreter (Pyodide); the arcade games run in a sandboxed iframe. You're playing the model.
Tic-Tac-Toe
Easiest of the CLI lineup. Win-detection on rows / cols / diagonals plus the draw case. Most one-shots forget the tie.
No runnable Python in the model's response.
No runnable Python in the model's response.
Connect Four
Gravity, column-prompts, four-in-a-row including diagonals. Diagonal-check is the gate.
No runnable Python in the model's response.
No runnable Python in the model's response.
Checkers
8×8 board, men + kings, forced captures, back-rank promotion. Few models get capture mechanics right one-shot.
No runnable Python in the model's response.
Chess
The hardest CLI test. Full ruleset including castling and check detection. Pass at your own risk.
No runnable Python in the model's response.
No runnable Python in the model's response.
No runnable Python in the model's response.
No runnable Python in the model's response.
No runnable Python in the model's response.
Go (9×9)
Liberty-count capture, two-pass game end. The unique-to-Go mechanic that separates "drew a grid" from "implemented Go".
No runnable Python in the model's response.
No runnable Python in the model's response.
Snake
Grid + keyboard + game loop. The "did the model produce a real game loop" warm-up.
No runnable HTML in the model's response.
Tetris
Piece rotation, line clearing, gravity. The first one where pretty code breaks at runtime.
No runnable HTML in the model's response.
No runnable HTML in the model's response.
Space Invaders
Waves, bullets, collision, lose-condition. Stacks more concurrent state than the model usually wants.
No runnable HTML in the model's response.
No runnable HTML in the model's response.
Asteroids
Vector math, ship rotation, screen wrap. Hardest of the four — most one-shots produce shaky physics.
No runnable HTML in the model's response.
How this works
Every iframe is served from the same origin as this page, but the sandbox="allow-scripts" attribute removes same-origin privileges and disables form submission, top-level navigation, and storage. Each game runs as if it were on a stranger's domain — it can't read cookies, it can't touch the parent page, it can't navigate anywhere. If you'd rather inspect the source directly before running it, hit Download .html on any card.
Static checks (the percentage on each card) come from benchmarks/suites/oneshot_arcade.toml. They look for things like a real requestAnimationFrame loop, collision detection in Space Invaders, screen-wrap math in Asteroids — the bits weak one-shots forget. The score tells you the response had the right shape; whether it actually plays is what your eyes are for.
The T2 ✓ badge on HTML cards is the runtime check: bun arcade:runtime loads each game in headless Chromium, listens for console errors, and verifies the first 2.5 seconds of paint draw non-blank pixels. A green T2 means the program at least starts; a red T2 means it threw on load or rendered nothing. Subtle gameplay bugs (gemma4's upside-down chess, asteroids without thrust) still need your eyes — Tier 2 is the floor, not the ceiling.