Making Merlin good at long work

For a while Merlin was great at short tasks. Ask it a focused question, hand it a bounded edit, watch it stream a clean answer and exit. Single-turn streaming, session persistence, clean exit code. Done.

Long tasks were the problem.

A “long task” here means anything where the conversation history matters across many turns: a multi-file refactor, a wide-ranging codebase exploration, a multi-hour delegation where you walk away and let the agent work. The same agent that handled a one-shot edit beautifully would silently degrade as the conversation grew. By the time you noticed, you’d be staring at a half-finished refactor with a worktree full of broken state and no obvious way back.

This is the engineering arc that fixed it.

The three problems

Three things break when an agent’s session grows:

The context window fills up. Every model has a hard ceiling — call it 100k or 200k tokens — and accumulating turn-by-turn, you’ll hit it. The naive response is to drop the oldest messages, but that’s how an agent forgets the file paths it was working with, the decisions it just made, the errors it just diagnosed. It keeps “the original task” because that gets pinned, but everything in between vanishes.

State is lost on a crash. If the agent dies mid-task — provider timeout, OOM, network blip — you’ve lost not just the conversation but any narrative thread. Replaying the raw message log is technically possible but expensive, and the model has to re-derive everything it already figured out.

There’s no safety net when the work goes wrong. An agent that confidently edits five files and then can’t make tests pass leaves you with five broken files. Maybe one of them is right and the other four are wrong; maybe all five are wrong but you can’t tell from the diff alone. Without a known-good anchor, recovery means a manual git checkout and rerunning the whole task.

What we built

The shape that emerged was four layers, each cheap on its own, all of which compound:

Context condensation

When the active session crosses an 80% threshold of the provider’s context window, the agent doesn’t drop the old messages — it asks a cheap secondary model (Ollama by default; configurable) to write a tight prose narrative of the slice it would otherwise discard. The summary lands as the second message in the rebuilt context (the original task stays pinned as the first), with a “[N earlier messages condensed]” tag so the primary model knows it’s a recap, not a verbatim transcript.

When the summarizer can’t run — no API key, network down, empty response — the agent falls back to the original “[context truncated]” marker. The agent loop never blocks on condensation.

Checkpoints

Every time the verify lane passes cleanly inside the agent loop, we write a checkpoint row to the session database: kind, timestamp, files-touched summary. Same with each summarization event. Checkpoints are cheap (a single row insert) and they accumulate.

A checkpoint is, structurally, both a resume marker and a recovery anchor. That dual role is the key insight: the same persistence does two jobs.

Resume from checkpoint

--resume used to replay every message from the session log into the new context. Now it looks for the most recent checkpoint first. If one exists, the agent rebuilds context as: pinned original task, plus the checkpoint’s narrative summary, plus the messages from after the checkpoint timestamp. The result is a resumed run that starts well under the budget instead of immediately tripping the truncation threshold.

--full-history is the escape hatch for when you genuinely want the raw replay (debugging, forking off an earlier turn). The default is the smarter path.

Auto-rollback on verify exhaustion

Opt-in: when the verify lane fails its retries, the agent walks the files it touched and runs git checkout HEAD -- <file> against the most recent verify_pass checkpoint’s tree state. Files that couldn’t be restored (newly created, untracked) are reported separately. The agent’s task summary explicitly says what was restored and what couldn’t be.

This is the payoff of the dual-role checkpoint design. Resume points and recovery anchors are the same persistence; the agent uses whichever it needs.

Two adjacent fixes that came out of the same arc

Building the long-task stack surfaced two unrelated reliability problems that had been hiding in the short-task path:

Exit codes were lying. The agent had been reporting success any time the loop completed without an error — including runs where the verify lane was explicitly skipped (--no-verify or no file changes). Bridges and cron wrappers branching on $? couldn’t tell “verify passed” from “verify wasn’t asked to run.” We added a verify_skipped field that distinguishes them, and a four-way JSON status (success / skipped / failed / cancelled) so consumers can branch on intent rather than guessing.

Shell-exec could escape its sandbox. When the agent ran with a --project flag pointing at an isolated worktree, the shell plugin would still happily execute cd /elsewhere && rm -rf . and modify files outside the worktree. The fix lives in the plugin itself: when MERLIN_PROJECT_ROOT is set, it refuses any cd or pushd that lexically resolves outside the root. Plus a defense-in-depth current-dir pin on the spawned shell.

Neither was specifically a long-task bug. But both surfaced when we started leaning on the agent for the kind of multi-hour work that the long-task improvements were designed to support — they would have been invisible under one-shot usage.

The dogfooding loop

The story behind these changes is mostly the dogfooding loop. We needed Merlin to be good at long work, so we started running long tasks against the codebase itself. The first session validated the verify-pass-checkpoint half end-to-end: the agent did a real refactor, the verify lane passed, the checkpoint landed automatically, --resume rebuilt cleanly from it.

The second session failed messily. The agent attempted a second refactor, broke the file with duplicate function definitions, tried to recover via shell-exec, and escaped the worktree to the main checkout. Three real bugs in one run. Each got filed, each got fixed, and the safety chain we wrote was the response.

You learn what an agent’s reliability actually needs when you let it work on something that takes longer than a coffee break.

What’s next

The two paths that still need real production stress: the context condensation summary path (well-tested by unit tests, but never observed live in a session that organically filled past the 80% threshold), and auto-rollback (verified at the git-invocation level, but never live-triggered from a real failed verify chain). The unit tests cover the behavior; what’s missing is the load.

If you’re running multi-hour autonomous agent sessions and want to help us actually trip those paths, that’s the thing we’re watching for next.

Short work and long work shouldn’t need different tools. The same agent should handle a five-second question and a five-hour refactor, and the only difference should be how long you wait. We’re closer to that now than we were a week ago.