Changelog

The full, ongoing changelog lives at CHANGELOG.md in the repository root (CorvidLabs internal). The high-level story below is a hand-curated summary suitable for public reading.

The high-level story:

Sprint 1: Make the agent competent

Typed tool schemas replace the brittle args: string round-trip with proper JSON Schemas declared per command in plugin.toml.
Real spec-aware planning: the agent reads relevant specs at task start and injects their constraints into the system prompt.
Integration test harness with a scripted LLM provider gives the agent loop deterministic regression coverage.

Sprint 2: Make the agent feel alive

Streaming output: text deltas render incrementally; structured events flush cleanly.
Cancellation: Ctrl+C returns control immediately, dropping the in-flight LLM call; partial results returned with a cancelled = true flag.
/model slash command swaps providers mid-session.
TaskResult.files_changed lists every file the task mutated.
README rewrite with the real Rust setup.

Sprint 4: Desktop panels + hard benchmarks

6 hard benchmark suites (48 tests): advanced_reasoning, code_analysis, agent_tasks, stress_test, expert, architecture. Designed to separate Claude-level models from smaller ones. Tests include formal proofs, constraint satisfaction, security audits, concurrency bugs, and distributed systems design.
Total: 14 suites, 82 tests covering basic through expert-level evaluation across 10+ providers.
7 Ollama Cloud models benchmarked: Devstral 2, Kimi K2.5, Qwen 3.5, Qwen3 Coder, Qwen3 Coder Next, DeepSeek V4 Flash, Gemma 4. Qwen3 Coder Next is the top scorer at 93% on hard suites.
6 new desktop panels (19 total):
- Test Runner: verify lane with per-step pass/fail tracking
- Log Viewer: buffered, filterable log viewer with severity
- Spec Viewer: browse module specs with drill-in detail
- Git: branch status, changed files, branch list
- Cost Tracker: per-session token usage + USD spend estimate
- Plugin Manager: view installed plugins, commands, dependencies

Sprint 3: Production-ready providers

Benchmark system: 7 test suites (32 tests) evaluate provider quality and latency. Results accumulate as JSON history. merlin bench --history shows the scorecard.
Secure credential storage: merlin keys manages API keys in the OS keychain (macOS Keychain, Linux secret-service, Windows Credential Manager). Resolution chain: env var → .env → keychain.
17 pre-configured providers: Anthropic, OpenAI, 5 OpenRouter variants (Sonnet, Haiku, Gemini, DeepSeek, Llama), Groq, Together, 7 Ollama Cloud models. One OPENROUTER_API_KEY covers 5 of them.
Live streaming validation: 8 integration tests hitting real provider APIs to verify streaming behavior end-to-end.
Provider health checks: merlin health tests every configured provider with a real API call and reports latency/status.
Session management: --resume, --sessions, --no-session flags. Sessions auto-cleanup based on configurable TTL.
Ollama temperature passthrough: temperature parameter now correctly forwarded to the Ollama API.
Configurable chat path: chat_path field in provider config for non-standard API endpoints.

Polish pass: CorvidLabs-style

agent.rs, plugin.rs, spec_loader.rs, and output.rs reorganized with // MARK: - sections and doc comments. Agent::provider_info returns a named ProviderInfo struct. Fluent setters return &mut Self. Public enums are #[non_exhaustive]. Magic numbers extracted to constants. Helpers extracted from long methods.

v0.1.0: 2026-05-07

Initial functional release: agent loop, three LLM providers, five internal plugins, memory and AlgoChat adapters, working CLI.

For the full per-line breakdown, see CHANGELOG.md in the repository.