Introduction

Bunsen is a general-purpose experiment runner for agentic systems. You give an agent an environment, run it reproducibly, capture everything it did, and evaluate the result.

Environment-first, agent-agnostic. You describe the environment and task — not how a particular agent works — and Bunsen figures out how to invoke whatever agent you point at it. The same experiment can be run against Claude Code, a custom Python agent, an npm-published CLI, or a downloaded binary, and scored the same way.

What a run produces

Every bn run executes in its own Docker container and captures, automatically:

Artifacts — the final workspace, a diff of everything that changed, and an optional exportable tarball.
AI traces — the full request/response history of the agent's model calls, organized into conversation threads.
Cost — token usage and dollar cost, derived from the captured traces.
Scores — the result of your evaluation criteria, normalized to [0, 1], plus an optional narrative report.
Logs and a terminal recording (when enabled) for replay.

You inspect all of it with bn runs … or in the local web viewer (bn runs open). See Run Manifest & Events for the full output layout.

The two-file model

A Bunsen experiment is two declarative files:

experiment.yaml defines the environment: the task prompt, the container substrate (image, runtimes, packages), the workspace inputs, and the evaluation criteria.
agent.yaml defines a pluggable agent: where to get it, how to install it, how to invoke it, and which model env var it reads.

Keeping them separate is the whole point: one experiment runs against many agents, and one agent runs against many experiments. The agent ships its own toolkit as a sealed closure; the experiment owns the substrate. They compose in one container without a merge contract — see How Bunsen Works.

Who it's for

Bunsen is for researchers and engineers who want to:

Benchmark agents against published suites like Terminal Bench — or against each other.
Wrap their own task as a reproducible, scored experiment and iterate on it.
Inspect agent behavior deeply — traces, diffs, cost — to turn a run into evidence rather than a vibe.

Where to go next

Getting Started — install the CLI and run your first experiment end to end.
Then pick a path:
- Run a Terminal Bench Task — point Bunsen at an existing benchmark and score a real agent, with zero authoring.
- Bring Your Own Task — wrap your own task as an experiment with your own pass/fail check.

New to the vocabulary? The Glossary defines every overloaded term in one place.