Benchmarks

Merlin includes a built-in benchmark system that evaluates provider quality and latency across 14 test suites with 82 tests. Results accumulate over time, giving you historical data to make informed provider choices.

Running Benchmarks

merlin bench                         # run all suites, all providers
merlin bench --suite reasoning       # specific suite
merlin bench --provider openai       # specific provider
merlin bench --history               # accumulated historical stats

Results are additive: each run accumulates into the history stored in .fledge/benchmarks/ as timestamped JSON files.

Provider Scorecard

Results from 250+ benchmark runs across all configured providers:

ProviderModelHard SuitesAvg Latency
ollama-qwen-coder-nextQwen3 Coder Next93%40s
ollama-devstralDevstral 2 123B86%8s
openaiGPT-4.1 Mini84%3.5s
ollamaQwen 3.5 397B90%*94s
ollama-qwen-coderQwen3 Coder 480B84%*35s
ollama-kimiKimi K2.575%*120s
ollama-gemma4Gemma 4 31B46%*60s

*= affected by Ollama Cloud timeouts (524 errors)

These results reflect raw model capability on structured tasks: not end-to-end agent performance. See Interpreting Results for what the numbers mean and what they don’t.

Per-Suite Breakdown

Each suite tests a different capability. Here’s how providers perform across the hard suites:

ProviderAdv ReasoningCode AnalysisAgent TasksStressExpertArchitecture
Qwen3 Coder Next83%100%95%83%88%*100%
Devstral 279%95%100%79%96%67%*
GPT-4.1 Mini73%95%95%73%
Qwen 3.5100%100%95%75%*75%*
Qwen3 Coder83%100%83%*71%*
Kimi K2.5100%86%*38%*67%*75%*83%*
Gemma 488%86%*0%*8%*96%

*= affected by Ollama Cloud timeouts

The per-suite breakdown reveals patterns invisible in aggregate scores. A provider scoring 95% overall might be failing every tool-usage test while acing everything else: which matters if your workflow is tool-heavy.

Choosing a Provider

See the dedicated Choosing a Provider guide for practical recommendations based on your use case: speed, accuracy, cost, or self-hosting.

Test Suites

SuiteTestsWhat it measuresDetails
basic3Instruction following, format complianceGreeting, math, list formatting
reasoning5Logic, math, deductionSequences, syllogisms, word problems
coding5Code generation, bug detectionPalindromes, ownership, regex, debugging
design5Architecture, trade-offsWebSocket vs polling, caching strategies
tool_usage5Structured output, JSONJSON objects, tool call format, CSV
domain5Domain knowledgeTOML, fledge commands, spec format
communication5Clarity, concisenessSummarization, rewriting, persona
multi_turn5Multi-turn context, recallState tracking, iterative refinement
advanced_reasoning8Hard logic, constraintsKnights & knaves, Bayes, river crossing
code_analysis7Security, concurrency, typesSQL injection, race conditions, Rust lifetimes
agent_tasks7Planning, diagnosis, complianceTask decomposition, error diagnosis, spec compliance
stress_test8Adversarial, precisionInstruction resistance, arithmetic, tabular reasoning
expert8Proofs, translation, algorithmsFormal proofs, cross-language, dependency cycles
architecture6Distributed systemsCAP theorem, zero-downtime migration, rate limiters

Writing Custom Suites

Suites are TOML files in benchmarks/suites/. Each test specifies a prompt, expected behavior, and validation checks:

[[test]]
name = "fibonacci"
prompt = "Write a Rust function that returns the nth Fibonacci number"
system = "Write only code. No explanation."
max_tokens = 256
temperature = 0.0

[[test.checks]]
type = "contains"
value = "fn "

[[test.checks]]
type = "contains"
value = "fibonacci"

[[test.checks]]
type = "not_empty"

Run your custom suite:

merlin bench --suite my_custom_suite

See Methodology for the full list of check types and how scoring works.