Benchmarks

Merlin includes a built-in benchmark system that evaluates provider quality and latency across 14 test suites with 82 tests. Results accumulate over time, giving you historical data to make informed provider choices.

Running Benchmarks

merlin bench                         # run all suites, all providers
merlin bench --suite reasoning       # specific suite
merlin bench --provider openai       # specific provider
merlin bench --history               # accumulated historical stats

Results are additive: each run accumulates into the history stored in .fledge/benchmarks/ as timestamped JSON files.

Provider Scorecard

Results from 250+ benchmark runs across all configured providers:

Provider	Model	Hard Suites	Avg Latency
`ollama-qwen-coder-next`	Qwen3 Coder Next	93%	40s
`ollama-devstral`	Devstral 2 123B	86%	8s
`openai`	GPT-4.1 Mini	84%	3.5s
`ollama`	Qwen 3.5 397B	90%*	94s
`ollama-qwen-coder`	Qwen3 Coder 480B	84%*	35s
`ollama-kimi`	Kimi K2.5	75%*	120s
`ollama-gemma4`	Gemma 4 31B	46%*	60s

*= affected by Ollama Cloud timeouts (524 errors)

These results reflect raw model capability on structured tasks: not end-to-end agent performance. See Interpreting Results for what the numbers mean and what they don’t.

Per-Suite Breakdown

Each suite tests a different capability. Here’s how providers perform across the hard suites:

Provider	Adv Reasoning	Code Analysis	Agent Tasks	Stress	Expert	Architecture
Qwen3 Coder Next	83%	100%	95%	83%	88%*	100%
Devstral 2	79%	95%	100%	79%	96%	67%*
GPT-4.1 Mini	73%	95%	95%	73%	—	—
Qwen 3.5	100%	100%	95%	75%*	75%*	—
Qwen3 Coder	83%	100%	—	83%*	71%*	—
Kimi K2.5	100%	86%*	38%*	67%*	75%*	83%*
Gemma 4	88%	86%*	0%*	8%*	96%	—

*= affected by Ollama Cloud timeouts

The per-suite breakdown reveals patterns invisible in aggregate scores. A provider scoring 95% overall might be failing every tool-usage test while acing everything else: which matters if your workflow is tool-heavy.

Choosing a Provider

See the dedicated Choosing a Provider guide for practical recommendations based on your use case: speed, accuracy, cost, or self-hosting.

Test Suites

Suite	Tests	What it measures	Details
`basic`	3	Instruction following, format compliance	Greeting, math, list formatting
`reasoning`	5	Logic, math, deduction	Sequences, syllogisms, word problems
`coding`	5	Code generation, bug detection	Palindromes, ownership, regex, debugging
`design`	5	Architecture, trade-offs	WebSocket vs polling, caching strategies
`tool_usage`	5	Structured output, JSON	JSON objects, tool call format, CSV
`domain`	5	Domain knowledge	TOML, fledge commands, spec format
`communication`	5	Clarity, conciseness	Summarization, rewriting, persona
`multi_turn`	5	Multi-turn context, recall	State tracking, iterative refinement
`advanced_reasoning`	8	Hard logic, constraints	Knights & knaves, Bayes, river crossing
`code_analysis`	7	Security, concurrency, types	SQL injection, race conditions, Rust lifetimes
`agent_tasks`	7	Planning, diagnosis, compliance	Task decomposition, error diagnosis, spec compliance
`stress_test`	8	Adversarial, precision	Instruction resistance, arithmetic, tabular reasoning
`expert`	8	Proofs, translation, algorithms	Formal proofs, cross-language, dependency cycles
`architecture`	6	Distributed systems	CAP theorem, zero-downtime migration, rate limiters

Writing Custom Suites

Suites are TOML files in benchmarks/suites/. Each test specifies a prompt, expected behavior, and validation checks:

[[test]]
name = "fibonacci"
prompt = "Write a Rust function that returns the nth Fibonacci number"
system = "Write only code. No explanation."
max_tokens = 256
temperature = 0.0

[[test.checks]]
type = "contains"
value = "fn "

[[test.checks]]
type = "contains"
value = "fibonacci"

[[test.checks]]
type = "not_empty"

Run your custom suite:

merlin bench --suite my_custom_suite

See Methodology for the full list of check types and how scoring works.