The one-shot arcade: making models perform in public
Nine classic games, written in a single prompt by an LLM, playable live on the public site. Pyodide runs the Python ones; sandboxed iframes run the HTML ones. A new tier of bench checks runs the programs and asserts the output. Models that scored 100% on the static checks turned out to ship chess boards that are upside down. The kind of failure you can only catch by playing.
updatemerlinbenchmarks