Changelog
The full, ongoing changelog lives at CHANGELOG.md in the
repository root (CorvidLabs internal). The high-level story below
is a hand-curated summary suitable for public reading.
The high-level story:
Sprint 1: Make the agent competent
- Typed tool schemas replace the brittle
args: stringround-trip with proper JSON Schemas declared per command inplugin.toml. - Real spec-aware planning: the agent reads relevant specs at task start and injects their constraints into the system prompt.
- Integration test harness with a scripted LLM provider gives the agent loop deterministic regression coverage.
Sprint 2: Make the agent feel alive
- Streaming output: text deltas render incrementally; structured events flush cleanly.
- Cancellation:
Ctrl+Creturns control immediately, dropping the in-flight LLM call; partial results returned with acancelled = trueflag. /modelslash command swaps providers mid-session.TaskResult.files_changedlists every file the task mutated.- README rewrite with the real Rust setup.
Sprint 4: Desktop panels + hard benchmarks
- 6 hard benchmark suites (48 tests):
advanced_reasoning,code_analysis,agent_tasks,stress_test,expert,architecture. Designed to separate Claude-level models from smaller ones. Tests include formal proofs, constraint satisfaction, security audits, concurrency bugs, and distributed systems design. - Total: 14 suites, 82 tests covering basic through expert-level evaluation across 10+ providers.
- 7 Ollama Cloud models benchmarked: Devstral 2, Kimi K2.5, Qwen 3.5, Qwen3 Coder, Qwen3 Coder Next, DeepSeek V4 Flash, Gemma 4. Qwen3 Coder Next is the top scorer at 93% on hard suites.
- 6 new desktop panels (19 total):
- Test Runner: verify lane with per-step pass/fail tracking
- Log Viewer: buffered, filterable log viewer with severity
- Spec Viewer: browse module specs with drill-in detail
- Git: branch status, changed files, branch list
- Cost Tracker: per-session token usage + USD spend estimate
- Plugin Manager: view installed plugins, commands, dependencies
Sprint 3: Production-ready providers
- Benchmark system: 7 test suites (32 tests) evaluate provider
quality and latency. Results accumulate as JSON history.
merlin bench --historyshows the scorecard. - Secure credential storage:
merlin keysmanages API keys in the OS keychain (macOS Keychain, Linux secret-service, Windows Credential Manager). Resolution chain: env var → .env → keychain. - 17 pre-configured providers: Anthropic, OpenAI, 5 OpenRouter
variants (Sonnet, Haiku, Gemini, DeepSeek, Llama), Groq, Together,
7 Ollama Cloud models. One
OPENROUTER_API_KEYcovers 5 of them. - Live streaming validation: 8 integration tests hitting real provider APIs to verify streaming behavior end-to-end.
- Provider health checks:
merlin healthtests every configured provider with a real API call and reports latency/status. - Session management:
--resume,--sessions,--no-sessionflags. Sessions auto-cleanup based on configurable TTL. - Ollama temperature passthrough: temperature parameter now correctly forwarded to the Ollama API.
- Configurable chat path:
chat_pathfield in provider config for non-standard API endpoints.
Polish pass: CorvidLabs-style
agent.rs, plugin.rs, spec_loader.rs, and output.rs reorganized
with // MARK: - sections and doc comments. Agent::provider_info
returns a named ProviderInfo struct. Fluent setters return &mut Self. Public enums are #[non_exhaustive]. Magic numbers extracted
to constants. Helpers extracted from long methods.
v0.1.0: 2026-05-07
Initial functional release: agent loop, three LLM providers, five internal plugins, memory and AlgoChat adapters, working CLI.
For the full per-line breakdown, see CHANGELOG.md in the
repository.