Changelog

The full, ongoing changelog lives at CHANGELOG.md in the repository root (CorvidLabs internal). The high-level story below is a hand-curated summary suitable for public reading.

The high-level story:

Sprint 1: Make the agent competent

  • Typed tool schemas replace the brittle args: string round-trip with proper JSON Schemas declared per command in plugin.toml.
  • Real spec-aware planning: the agent reads relevant specs at task start and injects their constraints into the system prompt.
  • Integration test harness with a scripted LLM provider gives the agent loop deterministic regression coverage.

Sprint 2: Make the agent feel alive

  • Streaming output: text deltas render incrementally; structured events flush cleanly.
  • Cancellation: Ctrl+C returns control immediately, dropping the in-flight LLM call; partial results returned with a cancelled = true flag.
  • /model slash command swaps providers mid-session.
  • TaskResult.files_changed lists every file the task mutated.
  • README rewrite with the real Rust setup.

Sprint 4: Desktop panels + hard benchmarks

  • 6 hard benchmark suites (48 tests): advanced_reasoning, code_analysis, agent_tasks, stress_test, expert, architecture. Designed to separate Claude-level models from smaller ones. Tests include formal proofs, constraint satisfaction, security audits, concurrency bugs, and distributed systems design.
  • Total: 14 suites, 82 tests covering basic through expert-level evaluation across 10+ providers.
  • 7 Ollama Cloud models benchmarked: Devstral 2, Kimi K2.5, Qwen 3.5, Qwen3 Coder, Qwen3 Coder Next, DeepSeek V4 Flash, Gemma 4. Qwen3 Coder Next is the top scorer at 93% on hard suites.
  • 6 new desktop panels (19 total):
    • Test Runner: verify lane with per-step pass/fail tracking
    • Log Viewer: buffered, filterable log viewer with severity
    • Spec Viewer: browse module specs with drill-in detail
    • Git: branch status, changed files, branch list
    • Cost Tracker: per-session token usage + USD spend estimate
    • Plugin Manager: view installed plugins, commands, dependencies

Sprint 3: Production-ready providers

  • Benchmark system: 7 test suites (32 tests) evaluate provider quality and latency. Results accumulate as JSON history. merlin bench --history shows the scorecard.
  • Secure credential storage: merlin keys manages API keys in the OS keychain (macOS Keychain, Linux secret-service, Windows Credential Manager). Resolution chain: env var → .env → keychain.
  • 17 pre-configured providers: Anthropic, OpenAI, 5 OpenRouter variants (Sonnet, Haiku, Gemini, DeepSeek, Llama), Groq, Together, 7 Ollama Cloud models. One OPENROUTER_API_KEY covers 5 of them.
  • Live streaming validation: 8 integration tests hitting real provider APIs to verify streaming behavior end-to-end.
  • Provider health checks: merlin health tests every configured provider with a real API call and reports latency/status.
  • Session management: --resume, --sessions, --no-session flags. Sessions auto-cleanup based on configurable TTL.
  • Ollama temperature passthrough: temperature parameter now correctly forwarded to the Ollama API.
  • Configurable chat path: chat_path field in provider config for non-standard API endpoints.

Polish pass: CorvidLabs-style

agent.rs, plugin.rs, spec_loader.rs, and output.rs reorganized with // MARK: - sections and doc comments. Agent::provider_info returns a named ProviderInfo struct. Fluent setters return &mut Self. Public enums are #[non_exhaustive]. Magic numbers extracted to constants. Helpers extracted from long methods.

v0.1.0: 2026-05-07

Initial functional release: agent loop, three LLM providers, five internal plugins, memory and AlgoChat adapters, working CLI.


For the full per-line breakdown, see CHANGELOG.md in the repository.