Getting Started

This page takes you from nothing to a scored run in a few minutes, then points you at the path that fits what you're doing.

Prerequisites

  • Docker — every run executes in a container. Docker Desktop or Engine, running.
  • Node.js ≥ 22 — the CLI ships as an npm package.
  • An Anthropic API key — Bunsen's orchestrator and LLM evaluation run on Claude. Set ANTHROPIC_API_KEY in your environment (or a .env file in your project — Bunsen loads it automatically).

Install the CLI

npm i -g @bunsen-dev/cli
bn doctor   # verify Docker, git, and your environment

This puts bn (and the bunsen alias) on your PATH. bn doctor tells you if anything is missing before you run.

Run your first experiment

Scaffold a project with a tiny bundled example, then run it:

mkdir my-lab && cd my-lab
bn init --example          # writes bunsen.config.yaml + a hello-world experiment + echo-agent
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env

bn run hello-world echo-agent

What just happened: Bunsen built a container from the experiment's base image, the orchestrator worked out how to invoke echo-agent, ran it against the task, captured everything, and scored the result with a deterministic script criterion. echo-agent makes no model calls itself — but the orchestrator and evaluation do, which is why the API key is needed even here.

View the result

bn runs show              # summary of the most recent run: score, cost, status
bn runs open              # open the run in the local web viewer

bn runs show prints the score and a per-criterion breakdown. bn runs open serves an interactive viewer (traces, diff, artifacts) at http://localhost:3456. For everything a run captures and where it lives on disk, see Run Manifest & Events.

Run a real coding agent

echo-agent proves the loop but makes no model calls. To run an actual coding agent, copy a bundled starter into your project — Bunsen ships the three frontier coding CLIs inside the CLI itself:

bn agents add              # copies claude-code, codex-cli, gemini-cli into agents/
                           # (or run `bn init --starter-agents` at scaffold time)
bn agents list             # confirm they resolve

# Set the matching key in .env, then run one:
bn run hello-world claude-code

Each starter needs its provider's key in .envANTHROPIC_API_KEY for claude-code, OPENAI_API_KEY for codex-cli, GEMINI_API_KEY for gemini-cli. The copied agents are plain files in agents/<name>/ that you own: pin a different CLI version, add a variant, or swap the model with bn run … claude-code --model <id>. See agent.yaml for the full schema.

Where to next

You've seen the full loop. Now pick the path that matches your goal — each is self-contained:

  • Run a Terminal Bench Task → Point Bunsen at an existing benchmark suite and score a real coding agent, with zero authoring. Best if you want to evaluate agents.

  • Bring Your Own Task → Wrap your own task or codebase as a reproducible experiment with your own pass/fail check. Best if you want to measure agents on your work.

Along the way you'll want the two reference specs — experiment.yaml and agent.yaml — and the Glossary for any unfamiliar term.