We publish our numbers.
Do they?
Real automated test results, not cherry-picked demos. 23 runs aggregated across 16 suites and 17 providers.
Results as of 2026-05-20
Latest Results
Pass rate is the share of tests that hit every check. Time is total wall-clock summed across the suite's tests. Latest run per (suite, provider) is shown.
nightmare_mode
10 tests| Provider | Model | Pass | Time |
|---|---|---|---|
ollama-qwen-coder-next | qwen3-coder-next | 7/10 (70%) | 4.7m |
ollama-kimi | kimi-k2.5 | 6/10 (60%) | 6.3m |
ollama-deepseek-v4-flash | deepseek-v4-flash | 5/10 (50%) | 6.6m |
ollama-gpt-oss | gpt-oss:120b | 4/10 (40%) | 2.9m |
openai | gpt-4.1-mini | 1/10 (10%) | 1.3m |
hard_mode
10 tests| Provider | Model | Pass | Time |
|---|---|---|---|
ollama-qwen-coder | qwen3-coder:480b | 9/10 (90%) | 3.2m |
ollama-qwen-coder-next | qwen3-coder-next | 9/10 (90%) | 4.2m |
ollama-gpt-oss | gpt-oss:120b | 9/10 (90%) | 4.7m |
ollama-kimi | kimi-k2.5 | 9/10 (90%) | 8.6m |
ollama-deepseek-v4-flash | deepseek-v4-flash | 7/10 (70%) | 5.2m |
ollama | qwen3.5:397b | 6/10 (60%) | 12.1m |
openai | gpt-4.1-mini | 4/10 (40%) | 1.4m |
oneshot_arcade
4 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openai | gpt-4.1-mini | 4/4 (100%) | 3.8m |
ollama-deepseek-v4-flash | deepseek-v4-flash | 4/4 (100%) | 4.2m |
ollama-glm | glm-4.7 | 3/4 (75%) | 1.4m |
ollama-devstral | devstral-2:123b | 3/4 (75%) | 1.7m |
ollama-gpt-oss | gpt-oss:120b | 3/4 (75%) | 1.7m |
ollama-qwen-coder | qwen3-coder:480b | 3/4 (75%) | 2m |
ollama-qwen-coder-next | qwen3-coder-next | 3/4 (75%) | 3.3m |
ollama-gemma4 | gemma4:31b | 2/4 (50%) | 1.2m |
ollama-kimi-thinking | kimi-k2-thinking | 2/4 (50%) | 5.1m |
claude_code_comparison
7 tests| Provider | Model | Pass | Time |
|---|---|---|---|
ollama-qwen-coder-next | qwen3-coder-next | 6/7 (85.7%) | 1.5m |
ollama | qwen3.5:397b | 6/7 (85.7%) | 5.4m |
ollama-qwen-coder | qwen3-coder:480b | 5/7 (71.4%) | 33.3s |
ollama-gemma4 | gemma4:31b | 5/7 (71.4%) | 2.4m |
ollama-kimi | kimi-k2.5 | 5/7 (71.4%) | 9.2m |
openai | gpt-4.1-mini | 3/7 (42.9%) | 57.9s |
ollama-devstral | devstral-2:123b | 3/7 (42.9%) | 3.1m |
ollama-deepseek-v4-flash | deepseek-v4-flash | 3/7 (42.9%) | 3.2m |
oneshot_games
5 tests| Provider | Model | Pass | Time |
|---|---|---|---|
ollama-qwen-coder | qwen3-coder:480b | 5/5 (100%) | 1.3m |
ollama-glm | glm-4.7 | 5/5 (100%) | 1.9m |
ollama-qwen-coder-next | qwen3-coder-next | 4/5 (80%) | 2.2m |
openai | gpt-4.1-mini | 4/5 (80%) | 2.7m |
ollama-deepseek-v4-flash | deepseek-v4-flash | 4/5 (80%) | 3.9m |
ollama-devstral | devstral-2:123b | 3/5 (60%) | 1.5m |
ollama-gpt-oss | gpt-oss:120b | 3/5 (60%) | 2.2m |
ollama-gemma4 | gemma4:31b | 3/5 (60%) | 3.1m |
reasoning
5 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openai | gpt-4.1-mini | 5/5 (100%) | 1.9s |
openrouter-gemini | google/gemini-2.5-flash | 5/5 (100%) | 4.4s |
ollama-gpt-oss | gpt-oss:120b | 5/5 (100%) | 13s |
ollama-kimi | kimi-k2.5 | 5/5 (100%) | 50.3s |
local | qwen3.5:cloud | 5/5 (100%) | 2m |
openrouter-llama | meta-llama/llama-4-maverick | 4/5 (80%) | 3s |
openrouter-deepseek | deepseek/deepseek-chat | 4/5 (80%) | 3.7s |
openrouter | anthropic/claude-sonnet-4-6 | 4/5 (80%) | 13.8s |
openrouter-haiku | anthropic/claude-3.5-haiku | 3/5 (60%) | 4.6s |
hard_mode_augmented
10 tests| Provider | Model | Pass | Time |
|---|---|---|---|
ollama-qwen-coder-next | qwen3-coder-next | 8/10 (80%) | 4.9m |
openai | gpt-4.1-mini | 7/10 (70%) | 1.6m |
ollama | qwen3-coder:480b | 7/10 (70%) | 2.2m |
| python-run ×9 ts-check ×4 rust-check ×1 sql-run ×1 | |||
ollama-qwen-coder | qwen3-coder:480b | 6/10 (60%) | 2.3m |
| python-run ×9 sql-run ×4 ts-check ×4 rust-check ×1 | |||
ollama-kimi | kimi-k2.5 | 5/10 (50%) | 5.3m |
| python-run ×10 3 failed ts-check ×2 rust-check ×1 sql-run ×1 | |||
nightmare_mode_augmented
10 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openai | gpt-4.1-mini | 6/10 (60%) | 1.8m |
ollama-qwen-coder-next | qwen3-coder-next | 4/10 (40%) | 5.5m |
ollama-gpt-oss | gpt-oss:120b | 3/10 (30%) | 2.9m |
ollama-kimi | kimi-k2.5 | 3/10 (30%) | 5.1m |
context_management
7 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openai | gpt-4.1-mini | 5/7 (71.4%) | 1.1m |
ollama-qwen-coder-next | qwen3-coder-next | 3/7 (42.9%) | 2.4m |
design
4 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openrouter-gemini | google/gemini-2.5-flash | 4/4 (100%) | 3.6s |
openai | gpt-4.1-mini | 4/4 (100%) | 4.4s |
openrouter-haiku | anthropic/claude-3.5-haiku | 4/4 (100%) | 7.4s |
openrouter-deepseek | deepseek/deepseek-chat | 4/4 (100%) | 12.9s |
local | qwen3.5:cloud | 4/4 (100%) | 1.7m |
openrouter-llama | meta-llama/llama-4-maverick | 3/4 (75%) | 7.5s |
openrouter | anthropic/claude-sonnet-4-6 | 3/4 (75%) | 11.2s |
communication
5 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openrouter-gemini | google/gemini-2.5-flash | 5/5 (100%) | 3.9s |
openai | gpt-4.1-mini | 5/5 (100%) | 4.5s |
openrouter-deepseek | deepseek/deepseek-chat | 5/5 (100%) | 13.1s |
local | qwen3.5:cloud | 5/5 (100%) | 1.8m |
openrouter | anthropic/claude-sonnet-4-6 | 4/5 (80%) | 7.6s |
openrouter-haiku | anthropic/claude-3.5-haiku | 4/5 (80%) | 7.8s |
openrouter-llama | meta-llama/llama-4-maverick | 4/5 (80%) | 14.1s |
domain
5 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openrouter-gemini | google/gemini-2.5-flash | 5/5 (100%) | 2.9s |
openrouter-llama | meta-llama/llama-4-maverick | 5/5 (100%) | 3s |
openrouter-haiku | anthropic/claude-3.5-haiku | 5/5 (100%) | 3.6s |
openrouter | anthropic/claude-sonnet-4-6 | 5/5 (100%) | 3.8s |
openrouter-deepseek | deepseek/deepseek-chat | 5/5 (100%) | 4.3s |
openai | gpt-4.1-mini | 5/5 (100%) | 6.9s |
local | qwen3.5:cloud | 4/5 (80%) | 3.1m |
basic
3 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openrouter-gemini | google/gemini-2.5-flash | 3/3 (100%) | 1.6s |
openrouter-llama | meta-llama/llama-4-maverick | 3/3 (100%) | 1.6s |
openrouter | anthropic/claude-sonnet-4-6 | 3/3 (100%) | 2.2s |
openai | gpt-4.1-mini | 3/3 (100%) | 2.3s |
openrouter-haiku | anthropic/claude-3.5-haiku | 3/3 (100%) | 2.4s |
openrouter-deepseek | deepseek/deepseek-chat | 3/3 (100%) | 2.7s |
ollama-kimi | kimi-k2.5 | 3/3 (100%) | 9.1s |
local | qwen3.5:cloud | 3/3 (100%) | 27.1s |
coding
5 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openai | gpt-4.1-mini | 5/5 (100%) | 3.1s |
openrouter-gemini | google/gemini-2.5-flash | 5/5 (100%) | 3.3s |
openrouter | anthropic/claude-sonnet-4-6 | 5/5 (100%) | 4.5s |
openrouter-deepseek | deepseek/deepseek-chat | 5/5 (100%) | 7.2s |
openrouter-haiku | anthropic/claude-3.5-haiku | 5/5 (100%) | 7.4s |
openrouter-llama | meta-llama/llama-4-maverick | 5/5 (100%) | 8.8s |
ollama-gpt-oss | gpt-oss:120b | 5/5 (100%) | 9.4s |
local | qwen3.5:cloud | 5/5 (100%) | 1.9m |
reliability
7 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openai | gpt-4.1-mini | 5/7 (71.4%) | 32.9s |
ollama-qwen-coder-next | qwen3-coder-next | 5/7 (71.4%) | 1.1m |
tool_usage
5 tests| Provider | Model | Pass | Time |
|---|---|---|---|
openrouter-gemini | google/gemini-2.5-flash | 5/5 (100%) | 3.8s |
openai | gpt-4.1-mini | 5/5 (100%) | 4.1s |
openrouter | anthropic/claude-sonnet-4-6 | 5/5 (100%) | 4.8s |
openrouter-haiku | anthropic/claude-3.5-haiku | 5/5 (100%) | 4.9s |
openrouter-llama | meta-llama/llama-4-maverick | 5/5 (100%) | 6.8s |
openrouter-deepseek | deepseek/deepseek-chat | 5/5 (100%) | 10.2s |
ollama-kimi | kimi-k2.5 | 5/5 (100%) | 13.5s |
local | qwen3.5:cloud | 5/5 (100%) | 35.3s |
Test Suites
Every suite is defined in TOML under benchmarks/suites/. Each test has prompts, expected behavior, and check assertions.
Code Analysis
Security vulnerabilities, concurrency bugs, performance analysis, and multi-function comprehension
Agent Tasks
Complex multi-step tasks simulating AI agent workflows: planning, error recovery, ambiguity handling
Claude Code Comparison
Head-to-head tasks comparing agent capabilities: code analysis, bug finding, refactoring, spec compliance, multi-step reasoning
Architecture
System design, distributed systems reasoning, and tradeoff analysis at staff+ engineer level
Context Management
Tests context retention, information synthesis across turns, and efficient tool usage — the skills that separate agents from chatbots
Oneshot Arcade
One-shot HTML+JS arcade games. Each test asks for a complete, single-file, browser-runnable arcade game — HTML + CSS + JS inline. Harder than the CLI suite because the model has to compose document structure, canvas rendering, an event-driven game loop, and input handling in one fence. Tier 1 here is static (regex + keyword checks). Tier 2 (Playwright headless render + screenshot non-blank + no console errors) is a follow-up — see issue #471.
Long Session
Long-session benchmark — measures whether an agent retains state across many turns and growing context, not single-call quality. The gap that shows up between a 6-task bench score and real 5-hour multi-PR sessions is almost always here: by turn 20, the model has spilled its earliest state and is re-reading files, contradicting earlier decisions, or inventing new conventions inconsistent with turn 3. These tests intentionally don't time out faster on slow providers — the question is fidelity, not latency. A 480B local-cloud model that takes 3 minutes per turn but holds the thread for 30 turns beats a 200ms frontier API that drifts at turn 8. Failure modes specifically targeted: - Drift: convention set early is violated late - Re-read: model re-fetches a file it already saw in this session - Hallucinated reconciliation: when two earlier turns disagree, model invents a third option instead of asking - Decision amnesia: model forgets a binding decision and re-litigates
Coding
Code generation, analysis, and debugging
Oneshot Games
One-shot CLI game implementations. Each test asks for a complete, runnable Python program for a classic board game in a single response — no tools, no follow-up turns. Two layers of checks: Tier 1 is static (regex / keywords / min_length) and verifies the response is a substantial code block with the structural elements a working game needs. Tier 2 (executes) actually runs the model's Python under a sandboxed subprocess, pipes scripted input, and regexes stdout — it catches programs that look right on paper but crash, hang, or print the wrong thing at runtime.
Tool Usage
Structured output, JSON generation, and schema adherence
Design
Architecture, system design, and tradeoff analysis
Advanced Reasoning
Hard logic, constraint satisfaction, and mathematical reasoning that demands multi-step chains
Nightmare Mode
Tests designed to break frontier models. Multi-step cascading logic, subtle off-by-one traps, adversarial framing, and problems where confident-sounding wrong answers are more natural than careful correct ones.
Basic
Basic provider capability tests
Nightmare Mode Augmented
Nightmare-mode tests with focused fledge plugins. Same adversarial problems, but models can use calc, python-run, ts-check etc. to verify answers rather than relying on pure mental computation.
Domain
Fledge plugin system, specsync, and domain-specific knowledge
Refusal
Adversarial input that tries to bypass merlin's safety stack at the prompt level. Tests the model's behavior when asked to do something destructive in obfuscated language, when fed a prompt injection, or when offered a forged authority. These are PROMPT-level decisions, not plugin-gate decisions — the sql-run gate, memory-delete confirm-token flow, and audit log are separate defenses already validated in #466. This suite checks whether the model itself refuses cleanly.
Reliability
Tests instruction following, format compliance, structured output, and self-verification — the reliability traits that make agents production-ready
Hard Mode Augmented
Same adversarial tests as hard_mode, but with focused fledge plugins — measures how much scoped tools close the gap with frontier models
Hard Mode
Adversarial benchmarks targeting frontier model weak spots: false positive resistance, precise computation, type theory, and problems where the obvious answer is wrong
Multi Turn
Multi-turn conversations that build context across prompts and demand a substantive final answer.
Roleplaying
Persona consistency under specific framings. Tests whether the model honors a role (mentor, auditor, skeptic) for the duration of the response rather than dropping into a default helpful-assistant tone.
Communication
Conversation quality, conciseness, and persona adherence
Reasoning
Logic, math, and multi-step reasoning
Stress Test
Adversarial inputs, constraint-heavy prompts, and edge cases that break weaker models
Engineering
Realistic dev tasks: bug-finding, refactoring, debugging, code review. Closer to merlin's actual coding-agent use case than the games suite.
Expert
Expert-level tasks requiring Claude-caliber reasoning — multi-file code comprehension, proof construction, ambiguity resolution
Methodology
All benchmarks are automated and reproducible. Each test suite defines prompts, expected behaviors, and validation checks in TOML. Results are generated by running every provider against every test; the table above keeps the most recent run per (suite, provider) pair. No retries, no warm-up.
Run locally
# Run all benchmark suites against all providers
cargo run -p merlin-cli -- bench
# Run against a specific provider
cargo run -p merlin-cli -- bench --provider openai
# View accumulated stats
cargo run -p merlin-cli -- bench --history