Glossary

Bunsen reuses a few words a lot — agent, scorer, platform — in ways that are easy to conflate. This page is the canonical definition for each term. Other docs link here rather than redefining them.

Core model

Experiment — the unit defined by an experiment.yaml: a task, the container environment it runs in, and how the result is evaluated. An experiment is the thing you bn run. Do not call the whole experiment a "task" — the task is one field inside it.

Task — the work the agent under test must accomplish, expressed in natural language as task.prompt. One experiment has exactly one task.

Agent under test — the agent being evaluated: the thing you are running the experiment on. Defined by an agent.yaml. Bunsen is agent-agnostic — an agent under test can be a CLI, a script, an npm package, or a downloaded binary. In prose we always write "agent under test" (never "AUT") to keep it distinct from the two senses below.

Platform agents — the three Bunsen-supplied agents that run inside the container to operate the experiment: the orchestrator (works out how to invoke the agent under test), the supervisor (keeps an interactive agent moving in supervised mode), and the scorer (executes LLM criteria). These are infrastructure; you do not author them. See How Bunsen Works.

Run — one execution of an experiment with an agent. Each run gets its own container, its own directory of outputs, and an entry in the run manifest.

Environment

Substrate — the container baseline an experiment declares: the base image plus any runtimes, packages, and services the task needs. The experiment owns the substrate. See The Environment Model.

Sealed closure — how an agent ships its own toolkit (and any runtime it pins) via install, independent of the substrate. The agent and the substrate coexist in one container without negotiating a merge — this is the asymmetric composition model. See The Environment Model.

Workspace (/workspace) — the mutable directory the agent under test works in. Materialized at run start from the immutable sources.

workspace-source (/workspace-source) — an immutable snapshot of the initial seeded inputs. /workspace is materialized from it, so the original inputs are always available for comparison and re-runs.

Evaluation

Criterion — one user-authored entry under evaluation.criteria in an experiment, with a type and a weight. Criteria are what you write to define "did the agent succeed?" See Scorers & Evaluation.

Criterion type — one of five values a criterion can take:

TypeWhat it does
scriptRuns a shell command; deterministic, near-zero cost
judgeA single LLM call with attached evidence, no tools
agentA full agent loop with tools
browser-agentAn agent loop with screenshots / Playwright
aggregateA mathematical combination of other criteria

Scorer — the engine that executes a criterion. "Scorer" is the runner; "criterion" is the rubric entry it runs. (The run manifest records a scorerType field for each result — that manifest vocabulary is separate from the authoring type above; see Run Manifest & Events.)

Verifier — a helper file or asset placed in an experiment's verifiers/ directory, mounted read-only at /bunsen/verifiers during scoring. A script criterion typically calls a verifier. A verifier is a file; a criterion is the rubric entry that runs it.

Report — the optional narrative summary configured at evaluation.report. It runs once after all criteria and produces a markdown artifact with no numeric score. It is not a criterion type.

Gate — a criterion marked gate: true that short-circuits the rest of scoring when it fails — e.g. skip expensive LLM judging when the cheap script tests already failed. See Scorers & Evaluation.

Configuration

Variant — a named overlay on an experiment.yaml or agent.yaml that tweaks it without duplicating the whole file (a harder prompt, an extra flag). Variants are behavioral only.

Model axis — the model is chosen separately from variants, with bn run … --model <id>, against the agent's model block. Do not author per-model variants; pick the model on the command line. See agent.yaml Reference.

Suite — a versioned group of related experiments distributed as one git repository (e.g. Terminal Bench). See Suites.

Experiment ref — how you name an experiment to bn run. Unqualified (fizzbuzz) resolves locally first, then across suites; qualified forms (terminal-bench/<task>, github.com/org/repo/<task>) are unambiguous. See Suites.

Platforms

Platform — a linux/<arch> value: linux/amd64 or linux/arm64. Use "platform" for the full value.

Architecture (arch) — the bare amd64 / arm64 part. Reserve "arch" for when you specifically mean that suffix. See Platforms & Architecture.

Cost

Headline model — the model Bunsen labels a run with: its highest-cost model. Used so runs can be grouped and compared by the model that dominated their spend. See Cost Accounting.