Merlin enters beta

For the last few months Merlin lived behind an invite. We used that internal alpha to prove the hard parts on real work. The agent loop holds up. The safety gates hold up. The desktop app launches, installs its own CLI, and keeps your keys where they belong. So we’re moving to the next stage.

Merlin is now in beta.

If you’ve never seen it before, this post is the whole pitch from the start. No prior context assumed.

What Merlin is

Merlin is a local-first AI coding agent. It runs on your machine, against your repo, using your own provider keys. There’s a desktop app for the day to day, and a real CLI underneath it for scripts, automation, and CI.

That’s the one-line version. The reason it exists is the rest of this post.

It reads your specs first

Most coding agents start from a prompt and guess the rest. Merlin starts from your module specs. Before it writes a line, it pulls the relevant spec into the system prompt: invariants, public API, and error cases, treated as hard constraints rather than suggestions.

The effect is fewer hallucinated APIs and fewer “that’s not how this module works” corrections. The spec is the source of truth, and Merlin treats it that way.

Your machine, your keys

Merlin is local-first on purpose. The agent runs on your hardware. Your code never has to leave it to reach some opaque backend. Provider keys are bring-your-own, and they’re stored through your operating system’s keychain (macOS Keychain, Linux secret-service, Windows Credential Manager) instead of being scattered through plaintext project files. Anything Merlin saves locally, such as its working memory, is encrypted at rest under a device key in that same keychain.

You stay in control of which model sees your code, and what it costs, because you’re talking to the provider directly.

Any model, one interface

Merlin speaks to more than 30 providers behind a single flag: Anthropic, OpenAI (including the gpt-5 line, o3, o4-mini, 4o), five OpenRouter vendors on one key, Groq, Together, and the full Ollama Cloud lineup (Qwen3-Coder, Kimi, GLM, MiniMax, DeepSeek, Devstral, Gemma, and more). Swap providers mid-session with /model. Run a cheap local model for the boring parts and a frontier model for the hard ones, with the same loop, the same tools, and the same project context.

We also publish our benchmarks: 26 suites, 168 tests, spanning basic, long-session, adversarial, tool-augmented, refusal, and expert-level work, updated every release. We’d rather show the numbers than make claims about them.

Every tool is a plugin you can swap

Merlin’s tools cover filesystem, code search, shell, git, spec-sync, runtime checks for several languages, vision, voice, and the Discord and Telegram bridges. Each one is a small binary that speaks a simple JSON-lines protocol (fledge-v1). That means every tool call and every result is inspectable, and you can add your own tool in any language without touching Merlin’s core. If it can read stdin and write stdout, it can be a Merlin tool.

Sub-agents that keep the main context clean

When a task fans out, like summarizing forty files or auditing each module, Merlin can hand each piece to a sub-agent: a child process that runs its own full loop and returns a compact summary. The parent sees the conclusion (a few hundred tokens), not the entire transcript. The honest pitch isn’t that it’s cheaper. It’s that each child gives one subtask its full attention, and the parent’s working memory stays small no matter how wide the work gets.

Built to survive long work

Short tasks were always easy. Long ones used to be brittle: context fills, focus drifts, one bad step poisons the rest. We spent a real engineering arc on this. Merlin condenses context as it grows, checkpoints progress, resumes cleanly, and rolls back when a step goes wrong. Long-running work is now something Merlin is designed for, not something it survives by luck.

Safety we actually red-teamed

Agents that can run shell commands and delete files need guardrails that hold. Merlin gates destructive operations, and we didn’t take the first version on faith. We red-teamed it until we’d found and closed every deletion path we could surface, then ran an adversarial sweep to confirm the chain holds. Every blocked, previewed, or confirmed destructive action is written to an audit log you can read. There’s a full write-up of that hunt if you want the details.

The desktop app is the product now

Early Merlin assumed a cloned repo and a hand-edited config. The beta doesn’t. The macOS app opens into onboarding, shows you provider and key readiness, installs the merlin command for you, surfaces project context, and gives you a chat workspace with the tool activity in its own panel. The CLI is still right there underneath, for agents, scripts, fledge lanes, and CI, but you no longer need to know where the runtime lives to use it.

What beta means here

Plainly: Merlin is good enough to do real work, and we’re widening the door. It isn’t 1.0.

Apple Silicon macOS comes first. Linux follows, and Windows is on the roadmap.
Things will still change. We pin the machine-readable contracts (JSON and NDJSON output, exit codes, the protocol version) with tests and a documented breaking-change policy, so integrations stay stable even while the surface around them evolves.
It’s honest software. When verification is skipped, Merlin says so. When a task is cancelled, you get partial results flagged as cancelled, not a fake success. We’d rather under-claim.

What’s next

We’re heading toward 1.0: broader platform coverage, more polish on the desktop experience, and the distribution work that makes installing Merlin a non-event. We’ll keep publishing benchmarks and keep writing these posts as things land.

If you want to follow along or get into the beta, find us at CorvidLabs. Thanks for reading, and for trying it.