This cycle made it easy to see exactly what Merlin is doing — and then used what we saw to make the agent better. Here’s the tour.
Tools used
Every Merlin run now ends with a compact summary:
✓ Done.
Total tokens: 16,832 in / 1,247 out
Tools used:
files-write × 4
files-edit × 3
files-read × 8
shell-exec × 0 (denied 1)
A per-tool tally of what fired, sorted by call count. Denials are tracked separately from failures — so a --non-interactive session that gave up because it couldn’t get shell approval doesn’t get to claim everything was fine in the failure column. You see exactly what the agent reached for, including what it tried and was refused.
The bench harness picked up the same instrumentation. Every test now records its per-tool breakdown, and the public benchmarks page renders that as a sub-row of chips under each provider’s score. A run that hits 100% with one cargo build is a different signal than 100% with eleven shell commands — and now the page shows you which is which. Runs that lean more than half on raw shell get a quiet warning border, so the page tells you on sight which providers are reaching for the escape hatch.
Typed plugins
The first time we looked at the data, the picture was loud: shell-exec, shell-exec, shell-exec. Most agent runs that touched Rust were spending the majority of their tool budget wrapping cargo in shell. Same for ls, mkdir, cat | wc -l, npm. So we shipped typed plugins for the common cases: cargo (eight subcommands), files (list / mkdir / stat), git (add / checkout / stash / log), and a new node plugin that auto-detects npm / bun / pnpm / yarn from the lockfile.
One new line in the system prompt steered the agent toward the typed lineup. The next run reached for cargo-build directly instead of wrapping cargo build in shell.
Three fresh runs of one of the harder benchmark suites are now on the bench page — same suite across three Ollama providers, each row carrying the new tool-usage chips so you can see at a glance whether a high score came from one well-aimed plugin call or a stack of shell escapes.
Approve only what you trust
Granular approval landed alongside the new plugins: --allow and --deny flags that take comma-separated glob patterns.
merlin --non-interactive --allow 'cargo-*,files-*' --deny shell-exec \
"rebuild and run the test suite"
--deny always wins. --yes is still the “approve everything” override, but you rarely need it now — you can be precise about which tools the agent gets and which ones stay gated. When a call lands, the log line tells you exactly which rule fired:
⚠ Auto-approved: cargo-build (by --allow 'cargo-*')
No more guessing whether something got through because --yes was set or because a specific allow matched. The provenance is right there.
Merlin building its own plugins
The most fun part of the cycle: the last three plugins (files, git, node) weren’t written by hand. Merlin wrote them, given a description and a reference plugin to imitate. The agent read an existing plugin’s layout, wrote the new commands following the same shape, wired the dispatch, added unit tests, ran cargo-build and cargo-test, and produced a PR-shaped change.
Steering toward a real, working sibling file turned out to be much cheaper than describing the convention. We also rotated different models through the writing — same scaffold, different tool-call patterns — which sharpened our sense of where the next plugin needs to come from.
What’s next
- More typed plugins for the remaining common workflows: Python (
pip/uv) and Go (build/test/mod-tidy). - A small
pwd-infoplugin so agents don’t have to shell-exec just to figure out where they are. - A cross-provider benchmark across the full new plugin lineup, so the chips on the bench page tell a complete story.
If you’ve got a Merlin install handy, pull the latest and try it. Run something that exercises files and shell. Watch the end-of-run tally. That little block is the most informative thing in the CLI right now.
— The Merlin team