Benchmarks
Merlin includes a built-in benchmark system that evaluates provider quality and latency across 14 test suites with 82 tests. Results accumulate over time, giving you historical data to make informed provider choices.
Running Benchmarks
merlin bench # run all suites, all providers
merlin bench --suite reasoning # specific suite
merlin bench --provider openai # specific provider
merlin bench --history # accumulated historical stats
Results are additive: each run accumulates into the history stored in
.fledge/benchmarks/ as timestamped JSON files.
Provider Scorecard
Results from 250+ benchmark runs across all configured providers:
| Provider | Model | Hard Suites | Avg Latency |
|---|---|---|---|
ollama-qwen-coder-next | Qwen3 Coder Next | 93% | 40s |
ollama-devstral | Devstral 2 123B | 86% | 8s |
openai | GPT-4.1 Mini | 84% | 3.5s |
ollama | Qwen 3.5 397B | 90%* | 94s |
ollama-qwen-coder | Qwen3 Coder 480B | 84%* | 35s |
ollama-kimi | Kimi K2.5 | 75%* | 120s |
ollama-gemma4 | Gemma 4 31B | 46%* | 60s |
*= affected by Ollama Cloud timeouts (524 errors)
These results reflect raw model capability on structured tasks: not end-to-end agent performance. See Interpreting Results for what the numbers mean and what they don’t.
Per-Suite Breakdown
Each suite tests a different capability. Here’s how providers perform across the hard suites:
| Provider | Adv Reasoning | Code Analysis | Agent Tasks | Stress | Expert | Architecture |
|---|---|---|---|---|---|---|
| Qwen3 Coder Next | 83% | 100% | 95% | 83% | 88%* | 100% |
| Devstral 2 | 79% | 95% | 100% | 79% | 96% | 67%* |
| GPT-4.1 Mini | 73% | 95% | 95% | 73% | — | — |
| Qwen 3.5 | 100% | 100% | 95% | 75%* | 75%* | — |
| Qwen3 Coder | 83% | 100% | — | 83%* | 71%* | — |
| Kimi K2.5 | 100% | 86%* | 38%* | 67%* | 75%* | 83%* |
| Gemma 4 | 88% | 86%* | 0%* | 8%* | 96% | — |
*= affected by Ollama Cloud timeouts
The per-suite breakdown reveals patterns invisible in aggregate scores. A provider scoring 95% overall might be failing every tool-usage test while acing everything else: which matters if your workflow is tool-heavy.
Choosing a Provider
See the dedicated Choosing a Provider guide for practical recommendations based on your use case: speed, accuracy, cost, or self-hosting.
Test Suites
| Suite | Tests | What it measures | Details |
|---|---|---|---|
basic | 3 | Instruction following, format compliance | Greeting, math, list formatting |
reasoning | 5 | Logic, math, deduction | Sequences, syllogisms, word problems |
coding | 5 | Code generation, bug detection | Palindromes, ownership, regex, debugging |
design | 5 | Architecture, trade-offs | WebSocket vs polling, caching strategies |
tool_usage | 5 | Structured output, JSON | JSON objects, tool call format, CSV |
domain | 5 | Domain knowledge | TOML, fledge commands, spec format |
communication | 5 | Clarity, conciseness | Summarization, rewriting, persona |
multi_turn | 5 | Multi-turn context, recall | State tracking, iterative refinement |
advanced_reasoning | 8 | Hard logic, constraints | Knights & knaves, Bayes, river crossing |
code_analysis | 7 | Security, concurrency, types | SQL injection, race conditions, Rust lifetimes |
agent_tasks | 7 | Planning, diagnosis, compliance | Task decomposition, error diagnosis, spec compliance |
stress_test | 8 | Adversarial, precision | Instruction resistance, arithmetic, tabular reasoning |
expert | 8 | Proofs, translation, algorithms | Formal proofs, cross-language, dependency cycles |
architecture | 6 | Distributed systems | CAP theorem, zero-downtime migration, rate limiters |
Writing Custom Suites
Suites are TOML files in benchmarks/suites/. Each test specifies a
prompt, expected behavior, and validation checks:
[[test]]
name = "fibonacci"
prompt = "Write a Rust function that returns the nth Fibonacci number"
system = "Write only code. No explanation."
max_tokens = 256
temperature = 0.0
[[test.checks]]
type = "contains"
value = "fn "
[[test.checks]]
type = "contains"
value = "fibonacci"
[[test.checks]]
type = "not_empty"
Run your custom suite:
merlin bench --suite my_custom_suite
See Methodology for the full list of check types and how scoring works.