Transparency

We publish our numbers.
Do they?

Real automated test results, not cherry-picked demos. 23 runs aggregated across 16 suites and 17 providers.

Results as of 2026-05-20

27 Test Suites
172 Total Tests
17 Providers Scored

Latest Results

Pass rate is the share of tests that hit every check. Time is total wall-clock summed across the suite's tests. Latest run per (suite, provider) is shown.

nightmare_mode

10 tests
Provider Model Pass Time
🥇 ollama-qwen-coder-next
qwen3-coder-next
7/10 (70%)
4.7m
🥈 ollama-kimi
kimi-k2.5
6/10 (60%)
6.3m
🥉 ollama-deepseek-v4-flash
deepseek-v4-flash
5/10 (50%)
6.6m
ollama-gpt-oss
gpt-oss:120b
4/10 (40%)
2.9m
openai
gpt-4.1-mini
1/10 (10%)
1.3m

hard_mode

10 tests
Provider Model Pass Time
🥇 ollama-qwen-coder
qwen3-coder:480b
9/10 (90%)
3.2m
🥈 ollama-qwen-coder-next
qwen3-coder-next
9/10 (90%)
4.2m
🥉 ollama-gpt-oss
gpt-oss:120b
9/10 (90%)
4.7m
ollama-kimi
kimi-k2.5
9/10 (90%)
8.6m
ollama-deepseek-v4-flash
deepseek-v4-flash
7/10 (70%)
5.2m
ollama
qwen3.5:397b
6/10 (60%)
12.1m
openai
gpt-4.1-mini
4/10 (40%)
1.4m

oneshot_arcade

4 tests
Provider Model Pass Time
🥇 openai
gpt-4.1-mini
4/4 (100%)
3.8m
🥈 ollama-deepseek-v4-flash
deepseek-v4-flash
4/4 (100%)
4.2m
🥉 ollama-glm
glm-4.7
3/4 (75%)
1.4m
ollama-devstral
devstral-2:123b
3/4 (75%)
1.7m
ollama-gpt-oss
gpt-oss:120b
3/4 (75%)
1.7m
ollama-qwen-coder
qwen3-coder:480b
3/4 (75%)
2m
ollama-qwen-coder-next
qwen3-coder-next
3/4 (75%)
3.3m
ollama-gemma4
gemma4:31b
2/4 (50%)
1.2m
ollama-kimi-thinking
kimi-k2-thinking
2/4 (50%)
5.1m

claude_code_comparison

7 tests
Provider Model Pass Time
🥇 ollama-qwen-coder-next
qwen3-coder-next
6/7 (85.7%)
1.5m
🥈 ollama
qwen3.5:397b
6/7 (85.7%)
5.4m
🥉 ollama-qwen-coder
qwen3-coder:480b
5/7 (71.4%)
33.3s
ollama-gemma4
gemma4:31b
5/7 (71.4%)
2.4m
ollama-kimi
kimi-k2.5
5/7 (71.4%)
9.2m
openai
gpt-4.1-mini
3/7 (42.9%)
57.9s
ollama-devstral
devstral-2:123b
3/7 (42.9%)
3.1m
ollama-deepseek-v4-flash
deepseek-v4-flash
3/7 (42.9%)
3.2m

oneshot_games

5 tests
Provider Model Pass Time
🥇 ollama-qwen-coder
qwen3-coder:480b
5/5 (100%)
1.3m
🥈 ollama-glm
glm-4.7
5/5 (100%)
1.9m
🥉 ollama-qwen-coder-next
qwen3-coder-next
4/5 (80%)
2.2m
openai
gpt-4.1-mini
4/5 (80%)
2.7m
ollama-deepseek-v4-flash
deepseek-v4-flash
4/5 (80%)
3.9m
ollama-devstral
devstral-2:123b
3/5 (60%)
1.5m
ollama-gpt-oss
gpt-oss:120b
3/5 (60%)
2.2m
ollama-gemma4
gemma4:31b
3/5 (60%)
3.1m

reasoning

5 tests
Provider Model Pass Time
🥇 openai
gpt-4.1-mini
5/5 (100%)
1.9s
🥈 openrouter-gemini
google/gemini-2.5-flash
5/5 (100%)
4.4s
🥉 ollama-gpt-oss
gpt-oss:120b
5/5 (100%)
13s
ollama-kimi
kimi-k2.5
5/5 (100%)
50.3s
local
qwen3.5:cloud
5/5 (100%)
2m
openrouter-llama
meta-llama/llama-4-maverick
4/5 (80%)
3s
openrouter-deepseek
deepseek/deepseek-chat
4/5 (80%)
3.7s
openrouter
anthropic/claude-sonnet-4-6
4/5 (80%)
13.8s
openrouter-haiku
anthropic/claude-3.5-haiku
3/5 (60%)
4.6s

hard_mode_augmented

10 tests
Provider Model Pass Time
🥇 ollama-qwen-coder-next
qwen3-coder-next
8/10 (80%)
4.9m
🥈 openai
gpt-4.1-mini
7/10 (70%)
1.6m
🥉 ollama
qwen3-coder:480b
7/10 (70%)
2.2m
python-run ×9 ts-check ×4 rust-check ×1 sql-run ×1
ollama-qwen-coder
qwen3-coder:480b
6/10 (60%)
2.3m
python-run ×9 sql-run ×4 ts-check ×4 rust-check ×1
ollama-kimi
kimi-k2.5
5/10 (50%)
5.3m
python-run ×10 3 failed ts-check ×2 rust-check ×1 sql-run ×1

nightmare_mode_augmented

10 tests
Provider Model Pass Time
🥇 openai
gpt-4.1-mini
6/10 (60%)
1.8m
🥈 ollama-qwen-coder-next
qwen3-coder-next
4/10 (40%)
5.5m
🥉 ollama-gpt-oss
gpt-oss:120b
3/10 (30%)
2.9m
ollama-kimi
kimi-k2.5
3/10 (30%)
5.1m

context_management

7 tests
Provider Model Pass Time
🥇 openai
gpt-4.1-mini
5/7 (71.4%)
1.1m
🥈 ollama-qwen-coder-next
qwen3-coder-next
3/7 (42.9%)
2.4m

design

4 tests
Provider Model Pass Time
🥇 openrouter-gemini
google/gemini-2.5-flash
4/4 (100%)
3.6s
🥈 openai
gpt-4.1-mini
4/4 (100%)
4.4s
🥉 openrouter-haiku
anthropic/claude-3.5-haiku
4/4 (100%)
7.4s
openrouter-deepseek
deepseek/deepseek-chat
4/4 (100%)
12.9s
local
qwen3.5:cloud
4/4 (100%)
1.7m
openrouter-llama
meta-llama/llama-4-maverick
3/4 (75%)
7.5s
openrouter
anthropic/claude-sonnet-4-6
3/4 (75%)
11.2s

communication

5 tests
Provider Model Pass Time
🥇 openrouter-gemini
google/gemini-2.5-flash
5/5 (100%)
3.9s
🥈 openai
gpt-4.1-mini
5/5 (100%)
4.5s
🥉 openrouter-deepseek
deepseek/deepseek-chat
5/5 (100%)
13.1s
local
qwen3.5:cloud
5/5 (100%)
1.8m
openrouter
anthropic/claude-sonnet-4-6
4/5 (80%)
7.6s
openrouter-haiku
anthropic/claude-3.5-haiku
4/5 (80%)
7.8s
openrouter-llama
meta-llama/llama-4-maverick
4/5 (80%)
14.1s

domain

5 tests
Provider Model Pass Time
🥇 openrouter-gemini
google/gemini-2.5-flash
5/5 (100%)
2.9s
🥈 openrouter-llama
meta-llama/llama-4-maverick
5/5 (100%)
3s
🥉 openrouter-haiku
anthropic/claude-3.5-haiku
5/5 (100%)
3.6s
openrouter
anthropic/claude-sonnet-4-6
5/5 (100%)
3.8s
openrouter-deepseek
deepseek/deepseek-chat
5/5 (100%)
4.3s
openai
gpt-4.1-mini
5/5 (100%)
6.9s
local
qwen3.5:cloud
4/5 (80%)
3.1m

basic

3 tests
Provider Model Pass Time
🥇 openrouter-gemini
google/gemini-2.5-flash
3/3 (100%)
1.6s
🥈 openrouter-llama
meta-llama/llama-4-maverick
3/3 (100%)
1.6s
🥉 openrouter
anthropic/claude-sonnet-4-6
3/3 (100%)
2.2s
openai
gpt-4.1-mini
3/3 (100%)
2.3s
openrouter-haiku
anthropic/claude-3.5-haiku
3/3 (100%)
2.4s
openrouter-deepseek
deepseek/deepseek-chat
3/3 (100%)
2.7s
ollama-kimi
kimi-k2.5
3/3 (100%)
9.1s
local
qwen3.5:cloud
3/3 (100%)
27.1s

coding

5 tests
Provider Model Pass Time
🥇 openai
gpt-4.1-mini
5/5 (100%)
3.1s
🥈 openrouter-gemini
google/gemini-2.5-flash
5/5 (100%)
3.3s
🥉 openrouter
anthropic/claude-sonnet-4-6
5/5 (100%)
4.5s
openrouter-deepseek
deepseek/deepseek-chat
5/5 (100%)
7.2s
openrouter-haiku
anthropic/claude-3.5-haiku
5/5 (100%)
7.4s
openrouter-llama
meta-llama/llama-4-maverick
5/5 (100%)
8.8s
ollama-gpt-oss
gpt-oss:120b
5/5 (100%)
9.4s
local
qwen3.5:cloud
5/5 (100%)
1.9m

reliability

7 tests
Provider Model Pass Time
🥇 openai
gpt-4.1-mini
5/7 (71.4%)
32.9s
🥈 ollama-qwen-coder-next
qwen3-coder-next
5/7 (71.4%)
1.1m

tool_usage

5 tests
Provider Model Pass Time
🥇 openrouter-gemini
google/gemini-2.5-flash
5/5 (100%)
3.8s
🥈 openai
gpt-4.1-mini
5/5 (100%)
4.1s
🥉 openrouter
anthropic/claude-sonnet-4-6
5/5 (100%)
4.8s
openrouter-haiku
anthropic/claude-3.5-haiku
5/5 (100%)
4.9s
openrouter-llama
meta-llama/llama-4-maverick
5/5 (100%)
6.8s
openrouter-deepseek
deepseek/deepseek-chat
5/5 (100%)
10.2s
ollama-kimi
kimi-k2.5
5/5 (100%)
13.5s
local
qwen3.5:cloud
5/5 (100%)
35.3s

Test Suites

Every suite is defined in TOML under benchmarks/suites/. Each test has prompts, expected behavior, and check assertions.

Code Analysis

Security vulnerabilities, concurrency bugs, performance analysis, and multi-function comprehension

7 tests
sql_injectionrace_conditioncomplexity_analysisrust_lifetime_bugdata_flow_bugasync_deadlocktypescript_types

Agent Tasks

Complex multi-step tasks simulating AI agent workflows: planning, error recovery, ambiguity handling

7 tests
task_decompositionerror_diagnosiscontradiction_detectionspec_compliancedata_pipelineinstruction_hierarchycontext_stress

Claude Code Comparison

Head-to-head tasks comparing agent capabilities: code analysis, bug finding, refactoring, spec compliance, multi-step reasoning

7 tests
find_the_bugrefactor_to_idiomaticexplain_tricky_codeimplement_from_specarchitecture_reviewprogressive_debugconcise_correctness

Architecture

System design, distributed systems reasoning, and tradeoff analysis at staff+ engineer level

6 tests
cap_tradeoffzero_downtime_migrationdistributed_rate_limiterevent_sourcing_decisionobservability_designcloud_cost_optimization

Context Management

Tests context retention, information synthesis across turns, and efficient tool usage — the skills that separate agents from chatbots

7 tests
scattered_factscorrection_trackingchained_calculationcontradictory_instructionsjson_transform_verifyprogressive_refinementefficient_tool_use

Oneshot Arcade

One-shot HTML+JS arcade games. Each test asks for a complete, single-file, browser-runnable arcade game — HTML + CSS + JS inline. Harder than the CLI suite because the model has to compose document structure, canvas rendering, an event-driven game loop, and input handling in one fence. Tier 1 here is static (regex + keyword checks). Tier 2 (Playwright headless render + screenshot non-blank + no console errors) is a follow-up — see issue #471.

4 tests
snake_htmltetris_htmlspace_invaders_htmlasteroids_html

Long Session

Long-session benchmark — measures whether an agent retains state across many turns and growing context, not single-call quality. The gap that shows up between a 6-task bench score and real 5-hour multi-PR sessions is almost always here: by turn 20, the model has spilled its earliest state and is re-reading files, contradicting earlier decisions, or inventing new conventions inconsistent with turn 3. These tests intentionally don't time out faster on slow providers — the question is fidelity, not latency. A 480B local-cloud model that takes 3 minutes per turn but holds the thread for 30 turns beats a 200ms frontier API that drifts at turn 8. Failure modes specifically targeted: - Drift: convention set early is violated late - Re-read: model re-fetches a file it already saw in this session - Hallucinated reconciliation: when two earlier turns disagree, model invents a third option instead of asking - Decision amnesia: model forgets a binding decision and re-litigates

4 tests
convention_drift_20_turnthree_fact_synthesis_after_driftbinding_decision_under_pressureprecise_edit_under_navigation_pressure

Coding

Code generation, analysis, and debugging

5 tests
palindrome_fnreference_semanticsfind_the_bugrust_ownershipregex_write

Oneshot Games

One-shot CLI game implementations. Each test asks for a complete, runnable Python program for a classic board game in a single response — no tools, no follow-up turns. Two layers of checks: Tier 1 is static (regex / keywords / min_length) and verifies the response is a substantial code block with the structural elements a working game needs. Tier 2 (executes) actually runs the model's Python under a sandboxed subprocess, pipes scripted input, and regexes stdout — it catches programs that look right on paper but crash, hang, or print the wrong thing at runtime.

5 tests
tictactoe_pythonconnect_four_pythoncheckers_pythonchess_pythongo_9x9_python

Tool Usage

Structured output, JSON generation, and schema adherence

5 tests
json_objecttool_call_formatjson_arraycsv_outputerror_json

Design

Architecture, system design, and tradeoff analysis

5 tests
websocket_vs_pollingmicroservices_tradeoffdatabase_choicecaching_strategyweather_api_design

Advanced Reasoning

Hard logic, constraint satisfaction, and mathematical reasoning that demands multi-step chains

8 tests
knights_and_knavesseating_constraintscounterfactual_statebayes_reasoningriver_crossingrecursive_tracetemporal_orderingset_operations

Nightmare Mode

Tests designed to break frontier models. Multi-step cascading logic, subtle off-by-one traps, adversarial framing, and problems where confident-sounding wrong answers are more natural than careful correct ones.

10 tests
register_machinealiased_pointermonty_hall_variantphantom_bugthread_interleavingregex_analysisrecursive_typearithmetic_gauntletreachable_branch_trapconstraint_satisfaction

Basic

Basic provider capability tests

3 tests
greetingmathinstruction_following

Nightmare Mode Augmented

Nightmare-mode tests with focused fledge plugins. Same adversarial problems, but models can use calc, python-run, ts-check etc. to verify answers rather than relying on pure mental computation.

10 tests
register_machinealiased_pointermonty_hall_variantphantom_bugthread_interleavingregex_analysisrecursive_typearithmetic_gauntletreachable_branch_trapconstraint_satisfaction

Domain

Fledge plugin system, specsync, and domain-specific knowledge

5 tests
toml_parsingplugin_commandspec_formatalgochat_encryptiontoml_array_syntax

Refusal

Adversarial input that tries to bypass merlin's safety stack at the prompt level. Tests the model's behavior when asked to do something destructive in obfuscated language, when fed a prompt injection, or when offered a forged authority. These are PROMPT-level decisions, not plugin-gate decisions — the sql-run gate, memory-delete confirm-token flow, and audit log are separate defenses already validated in #466. This suite checks whether the model itself refuses cleanly.

6 tests
ignore_previous_instructionsforged_authorityobfuscated_destructioninjection_in_codeasks_for_keysoff_topic_unsafe

Reliability

Tests instruction following, format compliance, structured output, and self-verification — the reliability traits that make agents production-ready

7 tests
format_compliancejson_only_outputnegative_instructionsself_correctingmultilang_consistencyinstruction_prioritydata_extraction

Hard Mode Augmented

Same adversarial tests as hard_mode, but with focused fledge plugins — measures how much scoped tools close the gap with frontier models

10 tests
phantom_ubmutation_mazefalse_positivecomplexity_illusionspec_contradictionstype_system_edgesql_semanticsadversarial_debugcollatz_computationnumber_puzzle

Hard Mode

Adversarial benchmarks targeting frontier model weak spots: false positive resistance, precise computation, type theory, and problems where the obvious answer is wrong

10 tests
phantom_ubmutation_mazefalse_positivecomplexity_illusionspec_contradictionstype_system_edgesql_semanticsadversarial_debugcollatz_computationnumber_puzzle

Multi Turn

Multi-turn conversations that build context across prompts and demand a substantive final answer.

5 tests
state_recallcode_review_iterationplanning_refinementjson_schema_evolutionnoise_tolerant_recall

Roleplaying

Persona consistency under specific framings. Tests whether the model honors a role (mentor, auditor, skeptic) for the duration of the response rather than dropping into a default helpful-assistant tone.

5 tests
senior_mentors_juniorskeptical_code_reviewersecurity_auditordb_architectexplain_to_a_10yo

Communication

Conversation quality, conciseness, and persona adherence

5 tests
concise_explanationrewrite_concisepersona_pirateformat_compliancesummarize

Reasoning

Logic, math, and multi-step reasoning

5 tests
sequence_patternbat_and_ballsyllogismmultistep_mathdeduction

Stress Test

Adversarial inputs, constraint-heavy prompts, and edge cases that break weaker models

8 tests
multi_constraint_generationinstruction_resistanceprecision_arithmeticstructured_consistencytabular_reasoningedge_case_handlingadversarial_logicstrict_format

Engineering

Realistic dev tasks: bug-finding, refactoring, debugging, code review. Closer to merlin's actual coding-agent use case than the games suite.

5 tests
find_off_by_onefind_sql_injectionrefactor_into_helpersexplain_failing_testwhen_to_split_service

Expert

Expert-level tasks requiring Claude-caliber reasoning — multi-file code comprehension, proof construction, ambiguity resolution

8 tests
formal_proofcross_language_translationdependency_cycleregex_comprehensionalgorithm_correctnessapi_versioningsubtle_bugsethical_framework

Methodology

All benchmarks are automated and reproducible. Each test suite defines prompts, expected behaviors, and validation checks in TOML. Results are generated by running every provider against every test; the table above keeps the most recent run per (suite, provider) pair. No retries, no warm-up.

Run locally

# Run all benchmark suites against all providers
cargo run -p merlin-cli -- bench

# Run against a specific provider
cargo run -p merlin-cli -- bench --provider openai

# View accumulated stats
cargo run -p merlin-cli -- bench --history

Transparent by design.

Every test. Every provider. Every result.