live showcase · models perform in public

The one-shot arcade

Nine classic games. Each prompt is a single message to the model — "write a complete, runnable implementation" — no tools, no follow-up turns, no retries. What you click below is exactly what came out the other end. The CLI games run in a real Python interpreter (Pyodide); the arcade games run in a sandboxed iframe. You're playing the model.

games

cli runs

html runs

30/35

tier 2 ✓

models

Last bench sweep: 2026-05-20

Tic-Tac-Toe

Easiest of the CLI lineup. Win-detection on rows / cols / diagonals plus the draw case. Most one-shots forget the tie.

🥇 ollama-qwen-coder

100%

qwen3-coder:480b · 3.3 s · 1.7 KB

🥈 ollama-deepseek-v4-flash

100%

deepseek-v4-flash · 4.4 s · 2.1 KB

🥉 ollama-glm

100%

glm-4.7 · 7.9 s · 2.3 KB

ollama-qwen-coder-next

100%

qwen3-coder-next · 9.7 s · 2.4 KB

ollama-devstral

100%

devstral-2:123b · 10.1 s · 1.5 KB

openai

100%

gpt-4.1-mini · 11.2 s · 2.0 KB

ollama-gpt-oss

100%

gpt-oss:120b · 13 s · 2.3 KB

ollama-gemma4

100%

gemma4:31b · 24.9 s · 2.3 KB

ollama-kimi-thinking

kimi-k2-thinking · 308 ms · 0 B

No runnable Python in the model's response.

ollama-minimax

minimax-m2.5 · 2 m · 0 B

No runnable Python in the model's response.

Connect Four

Gravity, column-prompts, four-in-a-row including diagonals. Diagonal-check is the gate.

🥇 ollama-deepseek-v4-flash

100%

deepseek-v4-flash · 3.7 s · 2.1 KB

🥈 ollama-qwen-coder

100%

qwen3-coder:480b · 4.4 s · 2.1 KB

🥉 ollama-gemma4

100%

gemma4:31b · 5.7 s · 2.4 KB

ollama-qwen-coder-next

100%

qwen3-coder-next · 10.4 s · 2.9 KB

ollama-glm

100%

glm-4.7 · 11.9 s · 2.5 KB

openai

100%

gpt-4.1-mini · 14 s · 2.1 KB

ollama-devstral

83.3%

devstral-2:123b · 9 s · 1.7 KB

ollama-gpt-oss

83.3%

gpt-oss:120b · 12.5 s · 1.9 KB

ollama-kimi-thinking

kimi-k2-thinking · 2 m · 0 B

No runnable Python in the model's response.

ollama-minimax

minimax-m2.5 · 2 m · 0 B

No runnable Python in the model's response.

Checkers

8×8 board, men + kings, forced captures, back-rank promotion. Few models get capture mechanics right one-shot.

🥇 ollama-qwen-coder

100%

qwen3-coder:480b · 21.3 s · 8.5 KB

🥈 ollama-devstral

100%

devstral-2:123b · 22.5 s · 7.6 KB

🥉 ollama-glm

100%

glm-4.7 · 24.3 s · 5.6 KB

ollama-gpt-oss

100%

gpt-oss:120b · 32.2 s · 5.3 KB

openai

100%

gpt-4.1-mini · 36.9 s · 7.6 KB

ollama-deepseek-v4-flash

100%

deepseek-v4-flash · 1.4 m · 7.9 KB

ollama-qwen-coder-next

85.7%

qwen3-coder-next · 17.1 s · 5.5 KB

ollama-gemma4

85.7%

gemma4:31b · 21.1 s · 8.2 KB

ollama-minimax

71.4%

minimax-m2.5 · 1.9 m · 5.8 KB

ollama-kimi-thinking

kimi-k2-thinking · 328 ms · 0 B

No runnable Python in the model's response.

Chess

The hardest CLI test. Full ruleset including castling and check detection. Pass at your own risk.

🥇 ollama-qwen-coder

100%

qwen3-coder:480b · 34.9 s · 17.4 KB

🥈 ollama-qwen-coder-next

100%

qwen3-coder-next · 47.8 s · 16.1 KB

🥉 ollama-glm

100%

glm-4.7 · 49.8 s · 13.6 KB

ollama-devstral

85.7%

devstral-2:123b · 32.3 s · 10.7 KB

ollama-gpt-oss

85.7%

gpt-oss:120b · 46.9 s · 11.9 KB

openai

66.7%

gpt-4.1-mini · 40.6 s · 0 B

No runnable Python in the model's response.

ollama-kimi-thinking

kimi-k2-thinking · 275 ms · 0 B

No runnable Python in the model's response.

ollama-deepseek-v4-flash

deepseek-v4-flash · 2 m · 0 B

No runnable Python in the model's response.

ollama-gemma4

gemma4:31b · 2 m · 0 B

No runnable Python in the model's response.

ollama-minimax

minimax-m2.5 · 2 m · 0 B

No runnable Python in the model's response.

Go (9×9)

Liberty-count capture, two-pass game end. The unique-to-Go mechanic that separates "drew a grid" from "implemented Go".

🥇 ollama-gemma4

100%

gemma4:31b · 13.3 s · 6.8 KB

🥈 ollama-qwen-coder

100%

qwen3-coder:480b · 16.7 s · 6.2 KB

🥉 openai

100%

gpt-4.1-mini · 17.4 s · 5.2 KB

ollama-devstral

100%

devstral-2:123b · 18.7 s · 5.1 KB

ollama-glm

100%

glm-4.7 · 19.7 s · 7.7 KB

ollama-deepseek-v4-flash

100%

deepseek-v4-flash · 22.3 s · 6.5 KB

ollama-gpt-oss

100%

gpt-oss:120b · 26.8 s · 4.5 KB

ollama-qwen-coder-next

100%

qwen3-coder-next · 46.7 s · 5.7 KB

ollama-kimi-thinking

kimi-k2-thinking · 2 m · 0 B

No runnable Python in the model's response.

ollama-minimax

minimax-m2.5 · 2 m · 0 B

No runnable Python in the model's response.

Snake

Grid + keyboard + game loop. The "did the model produce a real game loop" warm-up.

🥇 ollama-gpt-oss

T2 ✓ 100%

gpt-oss:120b · 11 s · 4.0 KB

🥈 ollama-gemma4

T2 ✓ 100%

gemma4:31b · 12.6 s · 6.6 KB

🥉 ollama-devstral

T2 ✓ 100%

devstral-2:123b · 17.5 s · 6.4 KB

ollama-glm

T2 ✓ 100%

glm-4.7 · 24.8 s · 9.5 KB

openai

T2 ✓ 100%

gpt-4.1-mini · 28.2 s · 5.0 KB

ollama-qwen-coder

T2 ✓ 100%

qwen3-coder:480b · 29.6 s · 7.4 KB

ollama-qwen-coder-next

T2 ✗ 100%

qwen3-coder-next · 38.3 s · 10.8 KB

ollama-deepseek-v4-flash

T2 ✓ 100%

deepseek-v4-flash · 48.6 s · 12.1 KB

ollama-kimi-thinking

T2 ✓ 100%

kimi-k2-thinking · 1.6 m · 5.1 KB

ollama-minimax

minimax-m2.5 · 2 m · 0 B

No runnable HTML in the model's response.

Tetris

Piece rotation, line clearing, gravity. The first one where pretty code breaks at runtime.

🥇 ollama-devstral

T2 ✗ 100%

devstral-2:123b · 27.4 s · 8.6 KB

🥈 ollama-deepseek-v4-flash

T2 ✓ 100%

deepseek-v4-flash · 1.3 m · 18.3 KB

🥉 ollama-kimi-thinking

T2 ✓ 100%

kimi-k2-thinking · 1.5 m · 7.5 KB

ollama-glm

T2 ✓ 85.7%

glm-4.7 · 16 s · 11.9 KB

ollama-gpt-oss

T2 ✓ 85.7%

gpt-oss:120b · 21.3 s · 5.7 KB

ollama-gemma4

T2 ✓ 85.7%

gemma4:31b · 27.7 s · 9.3 KB

ollama-qwen-coder

T2 ✓ 85.7%

qwen3-coder:480b · 37 s · 9.2 KB

ollama-qwen-coder-next

T2 ✓ 85.7%

qwen3-coder-next · 58.1 s · 14.5 KB

openai

57.1%

gpt-4.1-mini · 51.2 s · 0 B

No runnable HTML in the model's response.

ollama-minimax

minimax-m2.5 · 2 m · 0 B

No runnable HTML in the model's response.

Space Invaders

Waves, bullets, collision, lose-condition. Stacks more concurrent state than the model usually wants.

🥇 ollama-gemma4

T2 ✓ 100%

gemma4:31b · 15.9 s · 8.5 KB

🥈 ollama-glm

T2 ✗ 100%

glm-4.7 · 21.8 s · 13.4 KB

🥉 ollama-gpt-oss

T2 ✓ 100%

gpt-oss:120b · 22.8 s · 5.6 KB

ollama-devstral

T2 ✓ 100%

devstral-2:123b · 26.4 s · 9.3 KB

ollama-qwen-coder

T2 ✓ 100%

qwen3-coder:480b · 30.6 s · 8.5 KB

ollama-qwen-coder-next

T2 ✓ 100%

qwen3-coder-next · 34.8 s · 15.2 KB

openai

T2 ✓ 100%

gpt-4.1-mini · 49.5 s · 9.8 KB

ollama-deepseek-v4-flash

T2 ✓ 100%

deepseek-v4-flash · 57.6 s · 13.7 KB

ollama-minimax

minimax-m2.5 · 2 m · 0 B

No runnable HTML in the model's response.

ollama-kimi-thinking

kimi-k2-thinking · 2 m · 0 B

No runnable HTML in the model's response.

Asteroids

Vector math, ship rotation, screen wrap. Hardest of the four — most one-shots produce shaky physics.

🥇 ollama-glm

T2 ✗ 100%

glm-4.7 · 23.8 s · 14.2 KB

🥈 ollama-qwen-coder

T2 ✗ 100%

qwen3-coder:480b · 24.3 s · 9.4 KB

🥉 ollama-gpt-oss

T2 ✓ 100%

gpt-oss:120b · 49.1 s · 7.1 KB

ollama-deepseek-v4-flash

T2 ✓ 100%

deepseek-v4-flash · 1.1 m · 10.8 KB

ollama-qwen-coder-next

T2 ✓ 100%

qwen3-coder-next · 1.1 m · 15.5 KB

openai

T2 ✓ 100%

gpt-4.1-mini · 1.4 m · 12.1 KB

ollama-gemma4

T2 ✓ 88.9%

gemma4:31b · 17.4 s · 10.1 KB

ollama-devstral

T2 ✓ 88.9%

devstral-2:123b · 32.2 s · 9.9 KB

ollama-minimax

T2 ✓ 88.9%

minimax-m2.5 · 1.6 m · 14.1 KB

ollama-kimi-thinking

kimi-k2-thinking · 4 s · 0 B

No runnable HTML in the model's response.

How this works

Every iframe is served from the same origin as this page, but the sandbox="allow-scripts" attribute removes same-origin privileges and disables form submission, top-level navigation, and storage. Each game runs as if it were on a stranger's domain — it can't read cookies, it can't touch the parent page, it can't navigate anywhere. If you'd rather inspect the source directly before running it, hit Download .html on any card.

Static checks (the percentage on each card) come from benchmarks/suites/oneshot_arcade.toml. They look for things like a real requestAnimationFrame loop, collision detection in Space Invaders, screen-wrap math in Asteroids — the bits weak one-shots forget. The score tells you the response had the right shape; whether it actually plays is what your eyes are for.

The T2 ✓ badge on HTML cards is the runtime check: bun arcade:runtime loads each game in headless Chromium, listens for console errors, and verifies the first 2.5 seconds of paint draw non-blank pixels. A green T2 means the program at least starts; a red T2 means it threw on load or rendered nothing. Subtle gameplay bugs (gemma4's upside-down chess, asteroids without thrust) still need your eyes — Tier 2 is the floor, not the ceiling.