Blog
Updates, insights, and deep dives from the Merlin team.
Making Merlin good at long work
Short tasks were always easy. Long ones were always brittle. Here's the engineering arc that closed the gap — condense, checkpoint, resume, roll back.
Five layers deep
We shipped a destructive-op gate and thought safety was done. Over the next six hours of red-teaming we found five more deletion paths, each surfaced while testing the previous fix. Then we ran a six-probe adversarial sweep to confirm the chain holds. Here's the full audit, the sharpening pass, the validation, and what we learned about how safety thinking generalizes.
Sub-Agents and the Parent's Clean Context
Merlin now spawns sub-agents inside the agent loop. The honest pitch: not a cost-saver. A quality multiplier — each child has its full attention on one thing — plus predictability when input sizes are unknown.
The one-shot arcade: making models perform in public
Nine classic games, written in a single prompt by an LLM, playable live on the public site. Pyodide runs the Python ones; sandboxed iframes run the HTML ones. A new tier of bench checks runs the programs and asserts the output. Models that scored 100% on the static checks turned out to ship chess boards that are upside down. The kind of failure you can only catch by playing.
Watching Merlin work: per-tool telemetry and the plugin-first push
A per-tool summary on every run. Tool-usage chips on the benchmarks page. Typed cargo / files / git / node plugins replacing common shell-exec calls. Granular approval flags. And the last three plugins of the cycle were written by Merlin itself.
Discord, Run-Anywhere CLI, and a Better Place to Start
Discord bridge, run-anywhere CLI, NDJSON streaming, a new specsync-create plugin. A tour of what landed in Merlin this cycle.
Introducing Merlin
Merlin is a spec-driven AI agent runner built on fledge. Here's why we're building it.