live showcase · models perform in public

The one-shot arcade

Nine classic games. Each prompt is a single message to the model — "write a complete, runnable implementation" — no tools, no follow-up turns, no retries. What you click below is exactly what came out the other end. The CLI games run in a real Python interpreter (Pyodide); the arcade games run in a sandboxed iframe. You're playing the model.

9
games
50
cli runs
40
html runs
30/35
tier 2 ✓
10
models

Last bench sweep: 2026-05-20

Tic-Tac-Toe

Easiest of the CLI lineup. Win-detection on rows / cols / diagonals plus the draw case. Most one-shots forget the tie.

🥈 ollama-deepseek-v4-flash
100%
deepseek-v4-flash · 4.4 s · 2.1 KB
ollama-kimi-thinking
0%
kimi-k2-thinking · 308 ms · 0 B

No runnable Python in the model's response.

No artifact saved
ollama-minimax
0%
minimax-m2.5 · 2 m · 0 B

No runnable Python in the model's response.

No artifact saved

Connect Four

Gravity, column-prompts, four-in-a-row including diagonals. Diagonal-check is the gate.

🥇 ollama-deepseek-v4-flash
100%
deepseek-v4-flash · 3.7 s · 2.1 KB
ollama-kimi-thinking
0%
kimi-k2-thinking · 2 m · 0 B

No runnable Python in the model's response.

No artifact saved
ollama-minimax
0%
minimax-m2.5 · 2 m · 0 B

No runnable Python in the model's response.

No artifact saved

Checkers

8×8 board, men + kings, forced captures, back-rank promotion. Few models get capture mechanics right one-shot.

ollama-deepseek-v4-flash
100%
deepseek-v4-flash · 1.4 m · 7.9 KB
ollama-qwen-coder-next
85.7%
qwen3-coder-next · 17.1 s · 5.5 KB
ollama-kimi-thinking
0%
kimi-k2-thinking · 328 ms · 0 B

No runnable Python in the model's response.

No artifact saved

Chess

The hardest CLI test. Full ruleset including castling and check detection. Pass at your own risk.

🥇 ollama-qwen-coder
100%
qwen3-coder:480b · 34.9 s · 17.4 KB
🥈 ollama-qwen-coder-next
100%
qwen3-coder-next · 47.8 s · 16.1 KB
openai
66.7%
gpt-4.1-mini · 40.6 s · 0 B

No runnable Python in the model's response.

No artifact saved
ollama-kimi-thinking
0%
kimi-k2-thinking · 275 ms · 0 B

No runnable Python in the model's response.

No artifact saved
ollama-deepseek-v4-flash
0%
deepseek-v4-flash · 2 m · 0 B

No runnable Python in the model's response.

No artifact saved
ollama-gemma4
0%
gemma4:31b · 2 m · 0 B

No runnable Python in the model's response.

No artifact saved
ollama-minimax
0%
minimax-m2.5 · 2 m · 0 B

No runnable Python in the model's response.

No artifact saved

Go (9×9)

Liberty-count capture, two-pass game end. The unique-to-Go mechanic that separates "drew a grid" from "implemented Go".

ollama-deepseek-v4-flash
100%
deepseek-v4-flash · 22.3 s · 6.5 KB
ollama-kimi-thinking
0%
kimi-k2-thinking · 2 m · 0 B

No runnable Python in the model's response.

No artifact saved
ollama-minimax
0%
minimax-m2.5 · 2 m · 0 B

No runnable Python in the model's response.

No artifact saved

Snake

Grid + keyboard + game loop. The "did the model produce a real game loop" warm-up.

ollama-minimax
0%
minimax-m2.5 · 2 m · 0 B

No runnable HTML in the model's response.

No artifact saved

Tetris

Piece rotation, line clearing, gravity. The first one where pretty code breaks at runtime.

openai
57.1%
gpt-4.1-mini · 51.2 s · 0 B

No runnable HTML in the model's response.

No artifact saved
ollama-minimax
0%
minimax-m2.5 · 2 m · 0 B

No runnable HTML in the model's response.

No artifact saved

Space Invaders

Waves, bullets, collision, lose-condition. Stacks more concurrent state than the model usually wants.

ollama-minimax
0%
minimax-m2.5 · 2 m · 0 B

No runnable HTML in the model's response.

No artifact saved
ollama-kimi-thinking
0%
kimi-k2-thinking · 2 m · 0 B

No runnable HTML in the model's response.

No artifact saved

Asteroids

Vector math, ship rotation, screen wrap. Hardest of the four — most one-shots produce shaky physics.

ollama-kimi-thinking
0%
kimi-k2-thinking · 4 s · 0 B

No runnable HTML in the model's response.

No artifact saved

How this works

Every iframe is served from the same origin as this page, but the sandbox="allow-scripts" attribute removes same-origin privileges and disables form submission, top-level navigation, and storage. Each game runs as if it were on a stranger's domain — it can't read cookies, it can't touch the parent page, it can't navigate anywhere. If you'd rather inspect the source directly before running it, hit Download .html on any card.

Static checks (the percentage on each card) come from benchmarks/suites/oneshot_arcade.toml. They look for things like a real requestAnimationFrame loop, collision detection in Space Invaders, screen-wrap math in Asteroids — the bits weak one-shots forget. The score tells you the response had the right shape; whether it actually plays is what your eyes are for.

The T2 ✓ badge on HTML cards is the runtime check: bun arcade:runtime loads each game in headless Chromium, listens for console errors, and verifies the first 2.5 seconds of paint draw non-blank pixels. A green T2 means the program at least starts; a red T2 means it threw on load or rendered nothing. Subtle gameplay bugs (gemma4's upside-down chess, asteroids without thrust) still need your eyes — Tier 2 is the floor, not the ceiling.