Merlin shipped a new plugin: fledge-plugin-merlin-subagent. It exposes one command, subagent-spawn. A parent agent can call it during its own loop to hand off a self-contained subtask to a child Merlin process. The parent gets back a compact JSON envelope — summary, files changed, tool calls, tokens used — instead of the child’s full transcript.

The headline up front, because the first draft of this post buried it: sub-agents are not a cost-saving feature. Total LLM tokens across parent plus children go up, not down, in almost every scenario we measured. What sub-agents actually buy you is quality per item and bounded, predictable behavior when input sizes are unknown. The rest of this post is those two wins, the honest tradeoff, and the design notes.

The shape

Parent merlin agent (depth 0)
  └─ subagent-spawn {
       prompt: "Read plugins/fledge-plugin-files/plugin.toml,
                describe its purpose in one sentence.",
       label: "files-purpose",
       tier: "tool"          ← optional override
     }

       │  Plugin reads MERLIN_SUBAGENT_DEPTH from env.
       │  Spawns child merlin process with depth+1.

       └─ Child merlin (depth 1, fresh context)
            ├─ files-read plugin.toml  (3ms ✓)
            └─ returns "Offers file system operations…"

  ← envelope:
       { ok: true, depth: 1, label: "files-purpose",
         provider: "ollama-qwen-coder", tier: "tool",
         duration_ms: 7000,
         summary: "Offers file system operations…",
         files_changed: [], tool_calls: 1,
         input_tokens: 8312, output_tokens: 42,
         error: null }

The parent’s growing conversation gets an envelope (~250 tokens) instead of whatever the child read (which might be 250 tokens or 5000 tokens — the parent doesn’t see it).

Defaults

These live under [merlin.subagent] in fledge.toml and can be overridden per spawn:

[merlin.subagent]
provider = "ollama-qwen-coder"   # cheap, fast, frontier-coder-class
default_tier = "tool"            # files-read / search-grep / memory-recall
default_timeout_secs = 300
max_depth = 2                    # parent → sub → sub-sub, then refuse

The provider default points at ollama-qwen-coder deliberately: if a parent on Claude Sonnet spawns unlimited children at Sonnet rates, the bill compounds fast. Cheap default; override per spawn when the subtask genuinely needs a stronger model.

The tier default is tool, not read. We tried read first because it sounded safer — but most research plugins (files-read, search-grep, memory-recall, files-glob) declare min_tier = "tool" and get filtered out at the read tier. A “safe” sub-agent with no tools just hallucinates from training. tool gives the child a useful research surface while still locking out shell-exec and destructive writes.

When subagent-spawn pays off

Use it when:

  • You need to do the same shape of inspection across many similar items — every plugin, every spec, every test file.
  • The items might be large or variable in size — files where you can’t predict the context impact.
  • You want per-task provider routing — a frontier model for the synthesis, a cheap one for the rote per-item work.
  • You’re driving a long session and want each delegation to add a known, bounded amount of context.

Don’t bother when:

  • The task is one or two items — direct tool calls are faster and cheaper.
  • The children need to see each other’s findings — they can’t communicate; the parent is the only coordinator.
  • You’re optimising for total dollar cost — sub-agents cost more, not less.

The quality story (the real win)

We ran the same prompt — “for each of these 30 plugins, read its plugin.toml and describe the plugin’s purpose in one sentence” — two ways, against the same provider:

  • Inline: the parent reads each plugin.toml itself, then writes 30 summaries from a single context full of raw TOML content.
  • Sub-agent: the parent dispatches 30 subagent-spawn calls, each child reads one file in isolation, the parent stitches the envelopes into the final list.

Both completed all 30 entries. The difference was in what the summaries actually said. Three examples, picked from the same positions in both lists:

fledge-plugin-files Inline: “Performs file operations like reading, writing, editing, and listing.” Sub-agent: “Provides file system operations such as read, write, edit, glob, list, mkdir, and stat.”

fledge-plugin-cargo Inline: “Provides Cargo package management commands for Rust projects.” Sub-agent: “Provides typed cargo subcommands (init/add/build/test/run/check/fmt/clippy) as a secure alternative to shell-exec cargo ..., offering clear errors, observable metrics, and preventing shell injection.”

fledge-plugin-pattern Inline: “Shows example sibling files.” Sub-agent: “Provides a command to show sibling files of the same shape, helping users match project conventions when writing new files like plugin.toml, Cargo.toml, or *.spec.md.”

The pattern held across the entire list. Inline summaries were short and generic — surface-level descriptions that could have come from any file with a similar name. Sub-agent summaries were specific, accurate, and useful — they named the actual commands, the actual configuration knobs, the actual safety properties.

Why? Each child had a fresh context with one job. No accumulated history of reading 29 other plugins crowding its attention. The model spent its budget on understanding this file instead of triaging which detail to mention across 30.

Measured output difference at 30-fanout: 2,948 sub-agent output tokens versus 1,582 inline output tokens — 86% more text, but the extra tokens went to richer per-item content, not padding.

Predictability — the secondary technical win

The other thing sub-agents buy you is bounded per-item cost in the parent’s context.

When the parent does an inline files-read on a 200-line plugin.toml, ~500 tokens of raw content land in the parent’s growing conversation history. If the parent then reads a 2,000-line source file, that’s ~5,000 tokens. Same tool call, very different cost.

When the parent does a subagent-spawn, the envelope is roughly constant — ~250 tokens regardless of whether the child read a 500-byte config or a 50,000-byte source file. The variable cost is hidden inside the child (and discarded when the child exits).

For workflows where you don’t know what you’re reading, the bounded envelope is the safer shape. You can do the same operation 50 times and know roughly how much the parent’s context will grow, instead of “well, depends on how big the files turn out to be.”

The honest tradeoff: total cost goes up

Plainly, since the first draft of this post buried it in caveats:

Each sub-agent is a full Merlin process with its own LLM call overhead — system prompt, tool catalog, conversation framing. Even with a cheap default provider, summing the parent’s input tokens plus every child’s input tokens gives you a bigger number than running the same task inline.

In the 30-plugin fanout, the inline parent used ~875K input tokens and the sub-agent parent used ~866K (basically the same, well within run-to-run noise). But on top of that 866K, each of the 30 children also burned its own input tokens. The total bill at the provider was substantially higher with sub-agents.

You’re not paying for context savings. You’re paying for richer output and bounded growth. If neither matters to your task, don’t reach for sub-agents.

Recursion safety

The recursion guard is enforced by one env variable: MERLIN_SUBAGENT_DEPTH. The plugin reads it, refuses to spawn if it’s already at the configured max_depth, and otherwise propagates depth+1 to the child it spawns. Default cap is 2:

  • Depth 0 (the top-level agent you started) calls subagent-spawn → child runs at depth 1.
  • That child can also call subagent-spawn → its child runs at depth 2.
  • At depth 2, subagent-spawn refuses with a structured error. No further recursion.

Single point of enforcement. The agent loop doesn’t also filter subagent-spawn out of deep children’s tool surfaces — refusal at the plugin is the only barrier. One place to audit, one place to break. The cap is configurable in fledge.toml if you genuinely want deeper nesting, but the default exists because a parent prompt with a runaway recursion idea could otherwise spawn dozens of nested processes before any timeout fires.

The refusal is structured: it returns ok=false with a clear error message rather than crashing or stalling. The parent sees “depth N already at configured max” and handles it like any other tool refusal — usually by doing the work itself at the current level.

How it works under the hood

The plugin shells out to merlin directly:

MERLIN_SUBAGENT_DEPTH=1 merlin \
  --non-interactive --output json --yes --no-session \
  --tier tool --timeout 300 \
  --provider ollama-qwen-coder \
  "<prompt>"

The child runs a fresh, normal Merlin loop — its own tool calls, its own refusals, its own verification — and emits its result to stdout as JSON. The plugin parses that with a streaming deserializer (terminal cursor-show sequences and other ANSI noise after the JSON object don’t trip the parse), extracts the summary plus telemetry, and wraps it in the envelope the parent sees.

--no-session keeps sub-agent runs out of the project’s session history. Their work shows up in the parent’s reply, not as a separate resumable session.

The system prompt teaches the pattern

Merlin’s system prompt includes a tool selection guidance block that matches kinds of question to families of tool. When subagent-spawn is installed, the guidance now includes:

Delegating fanout work to sub-agents — for tasks that involve looking at many similar things (“for each plugin, summarize”, “for each test file, count entries”, “for each spec, extract X”) call subagent-spawn once per item rather than reading them all into your own context. The sub-agent runs its own loop with a fresh context and returns a compact JSON envelope; you only see the summary, not the raw file content. Use it whenever the alternative is reading 5+ files just to answer one synthesis question.

The bullet is gated on the plugin actually being installed, so the surface never claims a tool the agent doesn’t have. AGENT.md (Merlin’s project persona) got a matching section so the pattern is part of the agent’s identity, not just a runtime hint.

The long-session benchmark

This cycle also added a new long_session bench suite — four text-only tests for multi-turn drift, three-fact recall, binding-decision pressure, and precise-edit-under-navigation-pressure. Single-call benches miss these failure modes; long_session catches them.

Headline numbers across the Ollama Cloud lineup:

ProviderScoreTime
ollama-qwen-coder (qwen3-coder:480b)4/4 = 100%3.6 min
ollama-kimi (k2.5)4/4 = 100%9.1 min
ollama-devstral (123b)4/4 = 100%11.2 min
ollama-glm (4.7)3/4 = 93.8%4.5 min
ollama-deepseek-v4-flash3/4 = 75%14.7 min
openai-4o3/4 = 75%2.3 min (rate-limited on T3)
openai-5-mini (reasoning)killed at 19 minreasoning models are impractical for multi-turn benches

Three Ollama Cloud frontier coder models hit clean 100%. Reasoning-model latency is the one stack we don’t yet have a good answer for; everything else lands above 75%.

What’s next

  • Parallel sub-agent dispatch. Today multiple subagent-spawn calls in one parent turn run sequentially. They’re independent processes — they could run in parallel — but the existing read-only fast path’s invariants assume no side effects, and a tier=code spawn can mutate state. Conditional parallelism (parallel only when spawn-tier ≤ tool) is the next opt-in.
  • Cost summing across parent + children. Each spawn surfaces its own tokens in the envelope. A future helper could roll up “total session cost including all sub-agents” so a daemon-mode owner can budget properly across a fanout.
  • Larger-file benchmarks. The data here measured small plugin.toml files. For source-file-sized inputs the predictability argument (bounded envelope vs. unbounded inline content) matters more — that’s the regime where sub-agents shift from “quality multiplier” to “the only way the task fits at all.”