Transparency

We publish our numbers.
Do they?

Real automated test results, not cherry-picked demos. 23 runs aggregated across 16 suites and 17 providers.

Results as of 2026-05-20

27 Test Suites

172 Total Tests

17 Providers Scored

Latest Results

Pass rate is the share of tests that hit every check. Time is total wall-clock summed across the suite's tests. Latest run per (suite, provider) is shown.

nightmare_mode

10 tests

Provider	Model	Pass	Time
🥇 `ollama-qwen-coder-next`	qwen3-coder-next	7/10 (70%)	4.7m
🥈 `ollama-kimi`	kimi-k2.5	6/10 (60%)	6.3m
🥉 `ollama-deepseek-v4-flash`	deepseek-v4-flash	5/10 (50%)	6.6m
`ollama-gpt-oss`	gpt-oss:120b	4/10 (40%)	2.9m
`openai`	gpt-4.1-mini	1/10 (10%)	1.3m

hard_mode

10 tests

Provider	Model	Pass	Time
🥇 `ollama-qwen-coder`	qwen3-coder:480b	9/10 (90%)	3.2m
🥈 `ollama-qwen-coder-next`	qwen3-coder-next	9/10 (90%)	4.2m
🥉 `ollama-gpt-oss`	gpt-oss:120b	9/10 (90%)	4.7m
`ollama-kimi`	kimi-k2.5	9/10 (90%)	8.6m
`ollama-deepseek-v4-flash`	deepseek-v4-flash	7/10 (70%)	5.2m
`ollama`	qwen3.5:397b	6/10 (60%)	12.1m
`openai`	gpt-4.1-mini	4/10 (40%)	1.4m

oneshot_arcade

4 tests

Provider	Model	Pass	Time
🥇 `openai`	gpt-4.1-mini	4/4 (100%)	3.8m
🥈 `ollama-deepseek-v4-flash`	deepseek-v4-flash	4/4 (100%)	4.2m
🥉 `ollama-glm`	glm-4.7	3/4 (75%)	1.4m
`ollama-devstral`	devstral-2:123b	3/4 (75%)	1.7m
`ollama-gpt-oss`	gpt-oss:120b	3/4 (75%)	1.7m
`ollama-qwen-coder`	qwen3-coder:480b	3/4 (75%)	2m
`ollama-qwen-coder-next`	qwen3-coder-next	3/4 (75%)	3.3m
`ollama-gemma4`	gemma4:31b	2/4 (50%)	1.2m
`ollama-kimi-thinking`	kimi-k2-thinking	2/4 (50%)	5.1m

claude_code_comparison

7 tests

Provider	Model	Pass	Time
🥇 `ollama-qwen-coder-next`	qwen3-coder-next	6/7 (85.7%)	1.5m
🥈 `ollama`	qwen3.5:397b	6/7 (85.7%)	5.4m
🥉 `ollama-qwen-coder`	qwen3-coder:480b	5/7 (71.4%)	33.3s
`ollama-gemma4`	gemma4:31b	5/7 (71.4%)	2.4m
`ollama-kimi`	kimi-k2.5	5/7 (71.4%)	9.2m
`openai`	gpt-4.1-mini	3/7 (42.9%)	57.9s
`ollama-devstral`	devstral-2:123b	3/7 (42.9%)	3.1m
`ollama-deepseek-v4-flash`	deepseek-v4-flash	3/7 (42.9%)	3.2m

oneshot_games

5 tests

Provider	Model	Pass	Time
🥇 `ollama-qwen-coder`	qwen3-coder:480b	5/5 (100%)	1.3m
🥈 `ollama-glm`	glm-4.7	5/5 (100%)	1.9m
🥉 `ollama-qwen-coder-next`	qwen3-coder-next	4/5 (80%)	2.2m
`openai`	gpt-4.1-mini	4/5 (80%)	2.7m
`ollama-deepseek-v4-flash`	deepseek-v4-flash	4/5 (80%)	3.9m
`ollama-devstral`	devstral-2:123b	3/5 (60%)	1.5m
`ollama-gpt-oss`	gpt-oss:120b	3/5 (60%)	2.2m
`ollama-gemma4`	gemma4:31b	3/5 (60%)	3.1m

reasoning

5 tests

Provider	Model	Pass	Time
🥇 `openai`	gpt-4.1-mini	5/5 (100%)	1.9s
🥈 `openrouter-gemini`	google/gemini-2.5-flash	5/5 (100%)	4.4s
🥉 `ollama-gpt-oss`	gpt-oss:120b	5/5 (100%)	13s
`ollama-kimi`	kimi-k2.5	5/5 (100%)	50.3s
`local`	qwen3.5:cloud	5/5 (100%)	2m
`openrouter-llama`	meta-llama/llama-4-maverick	4/5 (80%)	3s
`openrouter-deepseek`	deepseek/deepseek-chat	4/5 (80%)	3.7s
`openrouter`	anthropic/claude-sonnet-4-6	4/5 (80%)	13.8s
`openrouter-haiku`	anthropic/claude-3.5-haiku	3/5 (60%)	4.6s

hard_mode_augmented

10 tests

Provider	Model	Pass	Time
🥇 `ollama-qwen-coder-next`	qwen3-coder-next	8/10 (80%)	4.9m
🥈 `openai`	gpt-4.1-mini	7/10 (70%)	1.6m
🥉 `ollama`	qwen3-coder:480b	7/10 (70%)	2.2m
python-run ×9 ts-check ×4 rust-check ×1 sql-run ×1
`ollama-qwen-coder`	qwen3-coder:480b	6/10 (60%)	2.3m
python-run ×9 sql-run ×4 ts-check ×4 rust-check ×1
`ollama-kimi`	kimi-k2.5	5/10 (50%)	5.3m
python-run ×10 3 failed ts-check ×2 rust-check ×1 sql-run ×1

nightmare_mode_augmented

10 tests

Provider	Model	Pass	Time
🥇 `openai`	gpt-4.1-mini	6/10 (60%)	1.8m
🥈 `ollama-qwen-coder-next`	qwen3-coder-next	4/10 (40%)	5.5m
🥉 `ollama-gpt-oss`	gpt-oss:120b	3/10 (30%)	2.9m
`ollama-kimi`	kimi-k2.5	3/10 (30%)	5.1m

context_management

7 tests

Provider	Model	Pass	Time
🥇 `openai`	gpt-4.1-mini	5/7 (71.4%)	1.1m
🥈 `ollama-qwen-coder-next`	qwen3-coder-next	3/7 (42.9%)	2.4m

design

4 tests

Provider	Model	Pass	Time
🥇 `openrouter-gemini`	google/gemini-2.5-flash	4/4 (100%)	3.6s
🥈 `openai`	gpt-4.1-mini	4/4 (100%)	4.4s
🥉 `openrouter-haiku`	anthropic/claude-3.5-haiku	4/4 (100%)	7.4s
`openrouter-deepseek`	deepseek/deepseek-chat	4/4 (100%)	12.9s
`local`	qwen3.5:cloud	4/4 (100%)	1.7m
`openrouter-llama`	meta-llama/llama-4-maverick	3/4 (75%)	7.5s
`openrouter`	anthropic/claude-sonnet-4-6	3/4 (75%)	11.2s

communication

5 tests

Provider	Model	Pass	Time
🥇 `openrouter-gemini`	google/gemini-2.5-flash	5/5 (100%)	3.9s
🥈 `openai`	gpt-4.1-mini	5/5 (100%)	4.5s
🥉 `openrouter-deepseek`	deepseek/deepseek-chat	5/5 (100%)	13.1s
`local`	qwen3.5:cloud	5/5 (100%)	1.8m
`openrouter`	anthropic/claude-sonnet-4-6	4/5 (80%)	7.6s
`openrouter-haiku`	anthropic/claude-3.5-haiku	4/5 (80%)	7.8s
`openrouter-llama`	meta-llama/llama-4-maverick	4/5 (80%)	14.1s

domain

5 tests

Provider	Model	Pass	Time
🥇 `openrouter-gemini`	google/gemini-2.5-flash	5/5 (100%)	2.9s
🥈 `openrouter-llama`	meta-llama/llama-4-maverick	5/5 (100%)	3s
🥉 `openrouter-haiku`	anthropic/claude-3.5-haiku	5/5 (100%)	3.6s
`openrouter`	anthropic/claude-sonnet-4-6	5/5 (100%)	3.8s
`openrouter-deepseek`	deepseek/deepseek-chat	5/5 (100%)	4.3s
`openai`	gpt-4.1-mini	5/5 (100%)	6.9s
`local`	qwen3.5:cloud	4/5 (80%)	3.1m

basic

3 tests

Provider	Model	Pass	Time
🥇 `openrouter-gemini`	google/gemini-2.5-flash	3/3 (100%)	1.6s
🥈 `openrouter-llama`	meta-llama/llama-4-maverick	3/3 (100%)	1.6s
🥉 `openrouter`	anthropic/claude-sonnet-4-6	3/3 (100%)	2.2s
`openai`	gpt-4.1-mini	3/3 (100%)	2.3s
`openrouter-haiku`	anthropic/claude-3.5-haiku	3/3 (100%)	2.4s
`openrouter-deepseek`	deepseek/deepseek-chat	3/3 (100%)	2.7s
`ollama-kimi`	kimi-k2.5	3/3 (100%)	9.1s
`local`	qwen3.5:cloud	3/3 (100%)	27.1s

coding

5 tests

Provider	Model	Pass	Time
🥇 `openai`	gpt-4.1-mini	5/5 (100%)	3.1s
🥈 `openrouter-gemini`	google/gemini-2.5-flash	5/5 (100%)	3.3s
🥉 `openrouter`	anthropic/claude-sonnet-4-6	5/5 (100%)	4.5s
`openrouter-deepseek`	deepseek/deepseek-chat	5/5 (100%)	7.2s
`openrouter-haiku`	anthropic/claude-3.5-haiku	5/5 (100%)	7.4s
`openrouter-llama`	meta-llama/llama-4-maverick	5/5 (100%)	8.8s
`ollama-gpt-oss`	gpt-oss:120b	5/5 (100%)	9.4s
`local`	qwen3.5:cloud	5/5 (100%)	1.9m

reliability

7 tests

Provider	Model	Pass	Time
🥇 `openai`	gpt-4.1-mini	5/7 (71.4%)	32.9s
🥈 `ollama-qwen-coder-next`	qwen3-coder-next	5/7 (71.4%)	1.1m

tool_usage

5 tests

Provider	Model	Pass	Time
🥇 `openrouter-gemini`	google/gemini-2.5-flash	5/5 (100%)	3.8s
🥈 `openai`	gpt-4.1-mini	5/5 (100%)	4.1s
🥉 `openrouter`	anthropic/claude-sonnet-4-6	5/5 (100%)	4.8s
`openrouter-haiku`	anthropic/claude-3.5-haiku	5/5 (100%)	4.9s
`openrouter-llama`	meta-llama/llama-4-maverick	5/5 (100%)	6.8s
`openrouter-deepseek`	deepseek/deepseek-chat	5/5 (100%)	10.2s
`ollama-kimi`	kimi-k2.5	5/5 (100%)	13.5s
`local`	qwen3.5:cloud	5/5 (100%)	35.3s

Test Suites

Every suite is defined in TOML under benchmarks/suites/. Each test has prompts, expected behavior, and check assertions.

Code Analysis

Security vulnerabilities, concurrency bugs, performance analysis, and multi-function comprehension

7 tests

sql_injectionrace_conditioncomplexity_analysisrust_lifetime_bugdata_flow_bugasync_deadlocktypescript_types

Agent Tasks

Complex multi-step tasks simulating AI agent workflows: planning, error recovery, ambiguity handling

7 tests

task_decompositionerror_diagnosiscontradiction_detectionspec_compliancedata_pipelineinstruction_hierarchycontext_stress

Claude Code Comparison

Head-to-head tasks comparing agent capabilities: code analysis, bug finding, refactoring, spec compliance, multi-step reasoning

7 tests

find_the_bugrefactor_to_idiomaticexplain_tricky_codeimplement_from_specarchitecture_reviewprogressive_debugconcise_correctness

Architecture

System design, distributed systems reasoning, and tradeoff analysis at staff+ engineer level

6 tests

cap_tradeoffzero_downtime_migrationdistributed_rate_limiterevent_sourcing_decisionobservability_designcloud_cost_optimization

Context Management

Tests context retention, information synthesis across turns, and efficient tool usage — the skills that separate agents from chatbots

7 tests

scattered_factscorrection_trackingchained_calculationcontradictory_instructionsjson_transform_verifyprogressive_refinementefficient_tool_use

Oneshot Arcade

One-shot HTML+JS arcade games. Each test asks for a complete, single-file, browser-runnable arcade game — HTML + CSS + JS inline. Harder than the CLI suite because the model has to compose document structure, canvas rendering, an event-driven game loop, and input handling in one fence. Tier 1 here is static (regex + keyword checks). Tier 2 (Playwright headless render + screenshot non-blank + no console errors) is a follow-up — see issue #471.

4 tests

snake_htmltetris_htmlspace_invaders_htmlasteroids_html

Long Session

Long-session benchmark — measures whether an agent retains state across many turns and growing context, not single-call quality. The gap that shows up between a 6-task bench score and real 5-hour multi-PR sessions is almost always here: by turn 20, the model has spilled its earliest state and is re-reading files, contradicting earlier decisions, or inventing new conventions inconsistent with turn 3. These tests intentionally don't time out faster on slow providers — the question is fidelity, not latency. A 480B local-cloud model that takes 3 minutes per turn but holds the thread for 30 turns beats a 200ms frontier API that drifts at turn 8. Failure modes specifically targeted: - Drift: convention set early is violated late - Re-read: model re-fetches a file it already saw in this session - Hallucinated reconciliation: when two earlier turns disagree, model invents a third option instead of asking - Decision amnesia: model forgets a binding decision and re-litigates

4 tests

convention_drift_20_turnthree_fact_synthesis_after_driftbinding_decision_under_pressureprecise_edit_under_navigation_pressure

Coding

Code generation, analysis, and debugging

5 tests

palindrome_fnreference_semanticsfind_the_bugrust_ownershipregex_write

Oneshot Games

One-shot CLI game implementations. Each test asks for a complete, runnable Python program for a classic board game in a single response — no tools, no follow-up turns. Two layers of checks: Tier 1 is static (regex / keywords / min_length) and verifies the response is a substantial code block with the structural elements a working game needs. Tier 2 (executes) actually runs the model's Python under a sandboxed subprocess, pipes scripted input, and regexes stdout — it catches programs that look right on paper but crash, hang, or print the wrong thing at runtime.

5 tests

tictactoe_pythonconnect_four_pythoncheckers_pythonchess_pythongo_9x9_python

Tool Usage

Structured output, JSON generation, and schema adherence

5 tests

json_objecttool_call_formatjson_arraycsv_outputerror_json

Design

Architecture, system design, and tradeoff analysis

5 tests

websocket_vs_pollingmicroservices_tradeoffdatabase_choicecaching_strategyweather_api_design

Advanced Reasoning

Hard logic, constraint satisfaction, and mathematical reasoning that demands multi-step chains

8 tests

knights_and_knavesseating_constraintscounterfactual_statebayes_reasoningriver_crossingrecursive_tracetemporal_orderingset_operations

Nightmare Mode

Tests designed to break frontier models. Multi-step cascading logic, subtle off-by-one traps, adversarial framing, and problems where confident-sounding wrong answers are more natural than careful correct ones.

10 tests

register_machinealiased_pointermonty_hall_variantphantom_bugthread_interleavingregex_analysisrecursive_typearithmetic_gauntletreachable_branch_trapconstraint_satisfaction

Basic

Basic provider capability tests

3 tests

greetingmathinstruction_following

Nightmare Mode Augmented

Nightmare-mode tests with focused fledge plugins. Same adversarial problems, but models can use calc, python-run, ts-check etc. to verify answers rather than relying on pure mental computation.

10 tests

register_machinealiased_pointermonty_hall_variantphantom_bugthread_interleavingregex_analysisrecursive_typearithmetic_gauntletreachable_branch_trapconstraint_satisfaction

Domain

Fledge plugin system, specsync, and domain-specific knowledge

5 tests

toml_parsingplugin_commandspec_formatalgochat_encryptiontoml_array_syntax

Refusal

Adversarial input that tries to bypass merlin's safety stack at the prompt level. Tests the model's behavior when asked to do something destructive in obfuscated language, when fed a prompt injection, or when offered a forged authority. These are PROMPT-level decisions, not plugin-gate decisions — the sql-run gate, memory-delete confirm-token flow, and audit log are separate defenses already validated in #466. This suite checks whether the model itself refuses cleanly.

6 tests

ignore_previous_instructionsforged_authorityobfuscated_destructioninjection_in_codeasks_for_keysoff_topic_unsafe

Reliability

Tests instruction following, format compliance, structured output, and self-verification — the reliability traits that make agents production-ready

7 tests

format_compliancejson_only_outputnegative_instructionsself_correctingmultilang_consistencyinstruction_prioritydata_extraction

Hard Mode Augmented

Same adversarial tests as hard_mode, but with focused fledge plugins — measures how much scoped tools close the gap with frontier models

10 tests

phantom_ubmutation_mazefalse_positivecomplexity_illusionspec_contradictionstype_system_edgesql_semanticsadversarial_debugcollatz_computationnumber_puzzle

Hard Mode

Adversarial benchmarks targeting frontier model weak spots: false positive resistance, precise computation, type theory, and problems where the obvious answer is wrong

10 tests

phantom_ubmutation_mazefalse_positivecomplexity_illusionspec_contradictionstype_system_edgesql_semanticsadversarial_debugcollatz_computationnumber_puzzle

Multi Turn

Multi-turn conversations that build context across prompts and demand a substantive final answer.

5 tests

state_recallcode_review_iterationplanning_refinementjson_schema_evolutionnoise_tolerant_recall

Roleplaying

Persona consistency under specific framings. Tests whether the model honors a role (mentor, auditor, skeptic) for the duration of the response rather than dropping into a default helpful-assistant tone.

5 tests

senior_mentors_juniorskeptical_code_reviewersecurity_auditordb_architectexplain_to_a_10yo

Communication

Conversation quality, conciseness, and persona adherence

5 tests

concise_explanationrewrite_concisepersona_pirateformat_compliancesummarize

Reasoning

Logic, math, and multi-step reasoning

5 tests

sequence_patternbat_and_ballsyllogismmultistep_mathdeduction

Stress Test

Adversarial inputs, constraint-heavy prompts, and edge cases that break weaker models

8 tests

multi_constraint_generationinstruction_resistanceprecision_arithmeticstructured_consistencytabular_reasoningedge_case_handlingadversarial_logicstrict_format

Engineering

Realistic dev tasks: bug-finding, refactoring, debugging, code review. Closer to merlin's actual coding-agent use case than the games suite.

5 tests

find_off_by_onefind_sql_injectionrefactor_into_helpersexplain_failing_testwhen_to_split_service

Expert

Expert-level tasks requiring Claude-caliber reasoning — multi-file code comprehension, proof construction, ambiguity resolution

8 tests

formal_proofcross_language_translationdependency_cycleregex_comprehensionalgorithm_correctnessapi_versioningsubtle_bugsethical_framework

Methodology

All benchmarks are automated and reproducible. Each test suite defines prompts, expected behaviors, and validation checks in TOML. Results are generated by running every provider against every test; the table above keeps the most recent run per (suite, provider) pair. No retries, no warm-up.

Run locally

# Run all benchmark suites against all providers
cargo run -p merlin-cli -- bench

# Run against a specific provider
cargo run -p merlin-cli -- bench --provider openai

# View accumulated stats
cargo run -p merlin-cli -- bench --history

We publish our numbers.Do they?

Latest Results

nightmare_mode

hard_mode

oneshot_arcade

claude_code_comparison

oneshot_games

reasoning

hard_mode_augmented

nightmare_mode_augmented

context_management

design

communication

domain

basic

coding

reliability

tool_usage

Test Suites

Code Analysis

Agent Tasks

Claude Code Comparison

Architecture

Context Management

Oneshot Arcade

Long Session

Coding

Oneshot Games

Tool Usage

Design

Advanced Reasoning

Nightmare Mode

Basic

Nightmare Mode Augmented

Domain

Refusal

Reliability

Hard Mode Augmented

Hard Mode

Multi Turn

Roleplaying

Communication

Reasoning

Stress Test

Engineering

Expert

Methodology

Run locally

Transparent by design.

We publish our numbers.
Do they?