The Environment Model
How Bunsen composes the run container for an agent against an experiment. Covers the environment, workspace, and run blocks in experiment.yaml; the install block in agent.yaml; and the asymmetric way they coexist.
For the authoritative schema reference, see the hosted JSON schemas: experiment.v1.json, agent.v1.json, project.v1.json, and suite.v1.json. These ship in the @bunsen-dev/types package (see Packages & Schemas).
Overview
The experiment provides task substrate. The agent provides a sealed toolkit. They coexist in the same container without a merge contract. Bunsen does not negotiate a combined environment from agent + experiment requirements; the agent ships everything it pins to specific versions (via
install.depsandinstall.build), and the experiment provides the substrate the task needs (compilers, language runtimes the codebase under test depends on, services, apt packages). The two run side by side; the agent's PATH precedence wins for tools it ships.
This split keeps each unit addressable on its own (any-agent × any-experiment composes), and it removes the hidden environment variance that a merge surface inevitably introduces. The agent is a sealed closure that walks into whatever experiment image is supplied; experiments declare task substrate without negotiating with the agent.
- Experiment declares task substrate (
environment.image,environment.requires.*,workspace.*). - Agent declares its sealed toolkit (
install.deps,install.build,install.configure). It does not declare runtime version requirements — if it needs Node, it ships Node. - Bunsen builds/mounts the agent's deps + build artifacts, prepares the substrate image, runs setup phases, and records any cross-boundary binary shadows in the run manifest.
Asymmetric composition
The "any agent × any experiment" promise is honest because the agent isn't asking the experiment for anything. The agent walks in self-contained:
- If the substrate is
bunsen/headless(Ubuntu 22.04 + Python 3.11 + Node 20), the agent's shipped runtimes shadow substrate ones for tools the agent invokes. - If the substrate is a custom
Dockerfilepointing atdebian:bookworm-slimor a CUDA-heavy ML image, the agent's tools still run — same closure, different substrate. - If the substrate is minimal Alpine, the agent works only if its closures are musl-targeted. Bunsen base images are glibc; agents that need Alpine portability must declare
abi.libc: muslon the relevant deps.
The only cross-boundary signal is the structured cross-boundary-binary-shadow diagnostic recorded in the run manifest when an agent dep ships a binary that the substrate's apt layer also installs under the same name. That diagnostic is a record-and-proceed warning, not a build blocker — the agent's PATH precedence is the deterministic resolver.
The anti-contract
Bunsen base images happen to ship Node 20 and Python 3.11 because those are useful for install.configure shell scripts, the orchestrator, and the supervisor. Agents do not depend on this. An agent that needs a runtime ships its own via install.deps. That's what makes the same agent run against any experiment image — including custom Dockerfiles, Alpine, distroless images — without modification.
If you find yourself wanting to declare "this agent requires Node 20", the migration is "this agent ships Node 20 as a closure dep". See Shipping a language runtime in the cookbook.
The platform follows the same rule. Bunsen's own tools — the orchestrator, supervisor, and scorers — also need a Node interpreter inside the run container. On a Bunsen base image they use the image's Node 20 (a substrate Bunsen controls and pins). On a custom Dockerfile or non-bunsen base image, the platform does exactly what the anti-contract asks of agents: it ships its own Node as a closure dependency, mounted read-only at /bunsen/runtime/node. That binary is the official Node Linux tarball — the canonical closure example from the linkage taxonomy — resolved on demand and verified against a pinned sha256 (no per-image baking). Like any glibc closure it runs on every Bunsen base image and the common custom bases (debian/ubuntu/CUDA/distroless-glibc); a musl/Alpine base is not yet supported for the platform tools — the same abi.libc asymmetry agents face. See Platform Tools for the layered resolution + host-cache details.
Setup phase ordering
Bunsen's setup ordering is what makes large-seed experiments fast and predictable. Steps run after platform resolution; non-applicable steps are skipped.
install.deps— cached, platform-keyed dep builds (each declared tool produces a tree at/bunsen/deps/<name>/).install.build— cached, platform-keyed agent artifact build. Seesinstall.depsmounted at/bunsen/deps/<name>/and onPATH.- Mount build artifacts, dep artifacts, and image-backed inputs into the run container.
workspace.sourcesassembled into/workspace-source(read-only after assembly, world-readable, root-owned).- Execution-user creation and ownership handoff —
bunsenis created (skipped whenenvironment.user: root);/workspace,/bunsen,/home/bunsenare chown'd while/workspaceis still empty, so the chown is trivial. /workspacematerialized from/workspace-sourceas the execution user, so files land owned by that user without any recursivechown -Rover a populated tree.install.configure— fast per-run runtime config from the agent.workspace.setup— fast per-run workspace prep from the experiment.- Agent execution.
- Evaluation against final
/workspaceplus immutable/workspace-source.
The ordering matters for large-seed experiments: large immutable seeds (gigabyte models, prebuilt build trees) never force a recursive chown -R over the materialized workspace, because materialization runs as the execution user and produces correctly-owned files in /workspace directly.
Conceptual precedent: devcontainer features
Bunsen's environment model — agents and experiments contributing pieces that compose into a shared run container — has its closest mainstream analog in devcontainer features. A devcontainer feature is a small YAML/JSON unit that adds a tool or capability: declared inline in devcontainer.json or pulled from a file or an OCI registry, composable with other features, multi-platform-aware.
What devcontainer features get right and Bunsen adopts:
- Declarative
- Composable
- Multi-source (inline, file, registry)
- Schema centered on what does this install rather than what's the full environment shape
What's different in Bunsen, and why it doesn't reuse them directly:
- Devcontainer features run at image-bake time; Bunsen mounts built artifacts at agent invocation time, so the same dep can compose against whichever experiment image the agent runs in.
- Devcontainer features are coupled to the VS Code / devcontainer ecosystem.
- Different consuming surface (
devcontainer.jsonvsagent.yaml). - Bunsen's runtime model is intentionally lighter — no image baking per agent.
Adjacent tools (asdf, mise, nix, pkgx, homebrew) solve pieces of this but are too opinionated about runtime or platform to slot in cleanly. Devcontainer features remains the cleanest conceptual precedent.
A worked pair: experiment.yaml + agent.yaml
A single bn run <experiment> <agent> pairs one of each. Here they are side by side for a small "fix the failing test" experiment run with Claude Code.
experiment.yaml (the task substrate):
$schema: https://schemas.bunsen.dev/experiment.v1.json
version: v1
name: fix-the-bug
task:
prompt: |
The test suite in /workspace is failing. Find and fix the bug.
workspace:
sources:
- path: ./workspace # seeded repo, copied from the experiment dir
setup:
- run: cd /workspace && npm install
timeout: 5m
environment:
image:
base: bunsen/headless
requires:
runtimes:
node: ">=18" # substrate the codebase-under-test needs
packages:
apt: [git]
user: user # 'user' (default) or 'root'
run:
timeout: 15magent.yaml (the sealed toolkit):
$schema: https://schemas.bunsen.dev/agent.v1.json
version: v1
name: claude-code
install:
source:
type: local
build:
image: ubuntu:22.04
run:
- |
if ! command -v curl >/dev/null 2>&1; then
apt-get update && apt-get install -y curl
fi
curl -fsSL https://claude.ai/install.sh | bash
mkdir -p /output/bin
cp "$HOME/.local/bin/claude" /output/bin/claude
chmod +x /output/bin/claude
timeout: 10m
configure:
- run: |
if [ -n "$ANTHROPIC_API_KEY" ]; then
MODEL="${ANTHROPIC_MODEL:-claude-sonnet-4-6}"
printf '{"primaryApiKey":"%s","model":"%s"}\n' "$ANTHROPIC_API_KEY" "$MODEL" > ~/.claude.json
fi
as: root
timeout: 2m
entrypoint:
command: claude
args:
- -p
- --dangerously-skip-permissions
- --no-session-persistence
- --output-format
- text
- --verbose
help: claude --help
interaction:
mode: direct
model:
env: ANTHROPIC_MODEL
default: claude-sonnet-4-6The experiment knows nothing about Claude Code; the agent knows nothing about the test fixture. Bunsen prepares the substrate from the experiment, mounts the agent's closure on top, and runs them together.
Experiment Environment (experiment.yaml)
$schema: https://schemas.bunsen.dev/experiment.v1.json
version: v1
name: my-experiment
task:
prompt: ...
workspace:
sources:
- path: ./workspace
- imagePath: /app/reference.png
target: reference.png
setup:
- run: cd /workspace && npm install
timeout: 5m
environment:
image:
base: bunsen/headless
# or:
# dockerfile: ./Dockerfile
requires:
runtimes:
python: "3.11"
node: ">=18"
packages:
apt: [git, gcc, make]
npm: [typescript]
pip: [pytest, coverage]
platforms: [linux/amd64]
user: user # 'user' (default) or 'root'
run:
timeout: 15m
platform: auto # 'auto', 'linux/amd64', or 'linux/arm64'
artifactCaptureTimeout: 5mThe environment.requires block declares substrate — the runtimes and packages the task depends on (e.g. the Python the codebase-under-test imports, the apt build deps for the project's native extensions). It is not a contract negotiated with the agent. The agent operates on top of this without merging into it.
See the full field-level reference in experiment.yaml Reference.
Workspace sources
Initial immutable workspace inputs are declared under workspace.sources and assembled into /workspace-source before the agent runs. /workspace is then materialized from this snapshot.
workspace:
sources:
- path: ./workspace
- imagePath: /app/reference.png
target: reference.pngRules:
- Each entry declares exactly one of
pathorimagePath. pathrefers to a file or directory in the experiment repo (resolved relative toexperiment.yaml).imagePathrefers to a file or directory already present in the built image.targetis an optional relative destination inside the workspace. Defaults: basename for files; directory contents merge into the workspace root.- Sources are applied in declared order; path collisions fail validation.
/workspace-sourceis always created — empty when no sources are declared — so scorers can rely on its presence./workspace-sourceis part of the public scorer contract (see Scorers & Evaluation).- A
workspace:block with nosourcesis valid. There is no implicit auto-include; any directory used as a workspace source must be declared explicitly.
Workspace setup
workspace.setup is an ordered list of per-run shell commands run after /workspace has been materialized. Each step uses the shared step shape:
workspace:
setup:
- run: npm install
as: user # 'user' (default) or 'root'
timeout: 5m # Duration string; default 5m per stepIf a step needs root, set as: root on that step or set environment.user: root for the whole experiment (see Running as Root (environment.user)).
Step variants: run and writeFile
workspace.setup (and install.configure) steps share the same shape: each step is one of two variants — a run step or a writeFile step.
run—{run, as?, timeout?}. Executes a shell command.writeFile—{writeFile, from?|content?, as?, timeout?}. Drops a file at a path inside the container (parent directories are auto-created; existing files are overwritten; mode644). Set exactly one offrom(a path relative to the directory holdingexperiment.yaml/agent.yaml, copied from the host) orcontent(inline UTF-8, no env interpolation). ThewriteFiletarget path supports shell variable expansion (e.g.$BUNSEN_WORKSPACE_DIR/config.json).writeFilesteps default to a 30 s timeout.
Environment
| Field | Type | Default | Description |
|---|---|---|---|
image.base | string | bunsen/headless | Bunsen base image to start from. Mutually exclusive with image.dockerfile. |
image.dockerfile | string | — | Path to a custom Dockerfile (relative to experiment.yaml). Mutually exclusive with image.base. |
requires.runtimes | Record<RuntimeName, VersionSpec> | — | Substrate runtime names the task needs. Parsed, validated, and logged to the run log; the base image supplies the actual runtime (see "Substrate runtime syntax"). |
requires.packages | PackageSpecs (apt, npm, pip, cargo) | — | Substrate packages installed during image preparation (skipped for Dockerfile experiments). apt/npm/pip are installed; declare cargo dependencies via install.build instead (see "Packages and Dockerfiles"). |
platforms | RunPlatform[] | — | Restricts the supported execution platforms. If exactly one entry, Bunsen auto-selects it. |
user | 'user' | 'root' | 'user' | Execution user inside the agent container. The default 'user' runs as a non-root bunsen user; 'root' skips non-root user creation entirely (see Running as Root). |
Run
| Field | Type | Default | Description |
|---|---|---|---|
timeout | duration string | 15m | Overall agent timeout. |
onTimeout | score | fail | fail | What to do when the agent hits timeout. score reaps the agent's process tree (so the captured workspace is stable), then runs evaluation against whatever it left — the run completes, flagged extensions.timed_out: true. Right for open-ended, fixed-budget tasks. fail (default) fails the run. |
platform | auto | linux/amd64 | linux/arm64 | auto | Per-experiment platform preference (see Platforms & Architecture). |
artifactCaptureTimeout | duration string | 2m | Post-run artifact capture (diff, tar export, log retrieval). |
Substrate runtime syntax
requires.runtimes values are parsed, validated, and logged to the run log; the base image supplies whatever runtime it ships (Node 20, Python 3.11). Version constraints do not change the container image, and Bunsen does not switch runtimes (nvm, pyenv, rustup, etc.). If your task needs a specific runtime version that the base image does not ship, supply it via a custom Dockerfile, or have the agent ship it as a dep.
Packages and Dockerfiles
requires.packages installs apt, npm, and pip during image preparation. Declare cargo dependencies via install.build steps instead.
For experiments with a custom Dockerfile, requires.packages is ignored during image preparation. Install dependencies in the Dockerfile itself; install.configure is for fast runtime-only config, not for installing tooling.
Agent (agent.yaml)
Agents are sealed closures. agent.yaml declares the agent's source, its dep tree (install.deps), an optional cached build phase (install.build), and fast per-run wiring (install.configure). There is no runtime requirements block — the agent ships any runtime it pins to a specific version as a dep.
See the hosted agent.v1.json schema and agent.yaml Reference for the canonical schema.
$schema: https://schemas.bunsen.dev/agent.v1.json
version: v1
name: claude-code
install:
source:
type: local
build:
image: ubuntu:22.04
run:
- |
if ! command -v curl >/dev/null 2>&1; then
apt-get update && apt-get install -y curl
fi
curl -fsSL https://claude.ai/install.sh | bash
mkdir -p /output/bin
cp "$HOME/.local/bin/claude" /output/bin/claude
chmod +x /output/bin/claude
timeout: 10m
network: default
cacheSalt: claude-code-build
configure:
- run: |
if [ -n "$ANTHROPIC_API_KEY" ]; then
MODEL="${ANTHROPIC_MODEL:-claude-sonnet-4-6}"
printf '{"primaryApiKey":"%s","model":"%s"}\n' "$ANTHROPIC_API_KEY" "$MODEL" > ~/.claude.json
fi
as: root
timeout: 2m
entrypoint:
command: claude
args:
- -p
- --dangerously-skip-permissions
- --no-session-persistence
- --output-format
- text
- --verbose
help: claude --help
interaction:
mode: direct
# Declares the env var the harness reads its model from, so `bn run --model
# <id>` (and the `default` below) can target it without a per-model variant.
model:
env: ANTHROPIC_MODEL
default: claude-sonnet-4-6Fields
| Field | Type | Default | Description |
|---|---|---|---|
install.source | InstallSource (local/git/npm/binary) | required (no default) | Where the agent code comes from. |
install.deps | AgentDepSpec[] | — | Declarative tool dependencies. Each entry produces a read-only mount at /bunsen/deps/<name>/. See "Install Deps". |
install.build | BuildConfig | — | Cached artifact build phase (produces read-only /bunsen/artifacts mount). Runs after install.deps. |
install.build.image | string | — | Docker image used to run the build. |
install.build.run | string[] | — | Ordered build commands. |
install.build.timeout | duration string | 10m | Build timeout. |
install.build.network | "default" | "none" | default | Build network mode. |
install.build.cacheSalt | string | — | Manual cache-bust knob. |
install.configure | StepConfig[] | — | Fast per-run runtime configuration steps. Each step is either a run step ({run, as?, timeout?}) or a writeFile step ({writeFile, from?|content?, as?, timeout?}, 30 s default timeout) — see "Step variants: run and writeFile". |
entrypoint.command | string | — | Executable invoked at run start. |
entrypoint.args | string[] | — | Guaranteed argv tokens appended to every invocation. |
entrypoint.help | string | — | Help command consulted by the orchestrator. |
interaction.mode | "direct" | "supervised" | required (no default) | Run-loop mode (see Supervised Mode). |
model.env | string | — | Env var the harness reads its model id from (e.g. ANTHROPIC_MODEL). Declaring it enables bn run --model <id>. See "Model selection". |
model.default | string | — | Model id used when --model is not passed. Seeds model.env at the agent-defaults tier. |
defaults.env | Record<string, string> | — | Default env merged into the container before variant defaults and CLI overrides. |
defaults.passEnv | string[] | — | Host env var names this agent allows through (host passthrough allowlist). |
Model selection
The model is an orthogonal axis from the variant. An agent declares the env var
its harness reads the model from in the top-level model block; the model id
itself is chosen at the command line:
bn run fix-the-bug claude-code --model claude-opus-4-8
bn run fix-the-bug gemini-cli --model gemini-2.5-flash
bn run fix-the-bug claude-code:headed --model claude-opus-4-8 # model ⟂ variant--model <id> sets the agent's declared model.env variable. It rides the CLI
--env tier (precedence 7 below), so it overrides a model baked into a selected
variant; with no flag, model.default seeds the same variable at the
agent-defaults tier (precedence 2). The value the agent was configured with is
recorded on the run manifest (agent.model), distinct from agent.models, which
is what actually ran (observed from captured traces).
The model env var name is harness-specific — ANTHROPIC_MODEL, CODEX_MODEL,
GEMINI_MODEL, and so on — which is exactly why each agent declares it. The
harness consumes that variable directly, or via the config file the agent's
install.configure step generates from it. An agent that exposes no model knob
(a no-AI test agent, or a harness that routes models server-side) simply omits
the model block; --model is then rejected with a clear error.
Because model is its own axis, variants are behavioral overlays (run mode,
output format, turn caps, system prompts) rather than per-model duplicates — see
agent.yaml Reference for variant authoring. A variant should
pin a model only when its behavior genuinely requires one — e.g. claude-code's
auto variant, whose permission-mode auto classifier is only supported on a
specific model. When --model is passed alongside such a variant, the CLI wins
and prints a notice that the variant's model was overridden.
Build artifacts and PATH
- Build outputs are written to
/outputin the build container. - Preferred convention: executables in
/output/bin. install.buildoutputs are mounted read-only at/bunsen/artifactsin run containers.- Each
install.depsentry is mounted read-only at/bunsen/deps/<name>/. - Bunsen builds
PATHas/bunsen/artifacts/bin : /bunsen/artifacts : /bunsen/deps/<dep1>/bin : /bunsen/deps/<dep2>/bin : … : $PATH(the agent's own build artifacts win, then deps in declared order, then substrate). The same PATH applies for:install.configureworkspace.setup- agent execution
install.builditself (so the agent's build script can use any binary a dep provides)
Scorers do not run with this deps-prefixed PATH: the scorer engine is a platform tool, and the commands it runs (a script criterion, an agentic scorer's run_command) get the container's baseline PATH. evaluation.container: agent preserves the agent's filesystem/process state (installed packages, running services, final /workspace), not its closure-dep PATH — see Scoring in the Agent Container and Environment Internals.
This precedence is what makes asymmetric composition deterministic: tools the agent ships always shadow substrate-installed binaries with the same name. The cross-boundary shadow detector records each shadowing in the run manifest (see Run Manifest & Events).
Install Deps (install.deps)
install.deps lets agent authors declare the CLIs, language runtimes, and tools their agent needs without burying the agent's identity under packaging boilerplate. Each entry produces an artifact tree mounted at /bunsen/deps/<name>/ and is built once per (name, version, target, image, network, timeout, run, provides, linkage, abi, requires).
For copy-pasteable recipes (GitHub release binaries, archives, bundled Node/Python, shipping a runtime, Alpine/musl), see the Agent Dependencies Cookbook.
Linkage taxonomy
Every dep falls into one of three categories. Marking them explicitly with linkage makes cross-image expectations honest and informs the build cache key.
static— the binary contains everything including its libc. Drop it anywhere with the right CPU arch. Examples:ripgrepmusl build,jq, pure Go binaries. Noabiblock.closure— self-contained except for libc. The dominant case for language-runtime agents (Node, Python, Ruby). Examples: Bun-compiled native binaries, Astral'spython-build-standalone, the official Node Linux tarballs. Requiresabi.libc(glibcormusl) and optionally a version range.dynamic— depends on substrate libraries beyond libc. The author must declare expected libraries viarequires.libraries. Reach forclosurewhen possible;dynamicshould be rare and explicit.
When linkage is omitted, it is recorded as null in the cache key (portability unknown). New deps should declare linkage explicitly.
Authoring shape
install:
source:
type: local
deps:
- name: ripgrep
version: "14.1.1"
image: debian:bookworm-slim
linkage: static
provides:
binaries: [rg]
install:
- target: linux/amd64
run:
- apt-get update -qq && apt-get install -y -qq --no-install-recommends curl ca-certificates
- curl -fsSL https://github.com/BurntSushi/ripgrep/releases/download/14.1.1/ripgrep-14.1.1-x86_64-unknown-linux-musl.tar.gz | tar xz -C /tmp
- cp /tmp/ripgrep-14.1.1-x86_64-unknown-linux-musl/rg /output/bin/rg
- chmod +x /output/bin/rg
- target: linux/arm64
run: [...]
- name: node
version: "20.18.1"
image: debian:bookworm-slim
linkage: closure
abi:
libc: glibc
libc_version: ">=2.28"
provides:
binaries: [node, npm, npx]
install: [...]| Field | Required | Description |
|---|---|---|
name | yes | Kebab-case identifier. Used as the mount path under /bunsen/deps/. |
version | no | Recorded in the run manifest and included in the cache key. Recommended for reproducibility. |
description | no | Human-readable docs. |
image | dep- or per-target | Docker image used to run the install commands. Not the image the binary runs in at experiment time — see "Build image vs. experiment image" below. Either declared on the dep (default for every target) or on each install[] entry. |
linkage | recommended | static, closure, or dynamic. Drives portability expectations and is included in the cache key. |
abi.libc | for closure/dynamic | glibc or musl. The substrate libc the artifact targets. Forbidden on static. |
abi.libc_version | no | Optional version range. Recorded; not enforced. |
requires.libraries | for dynamic | List of {name, version?} substrate libraries the dep depends on. Forbidden on static. |
provides.binaries | no | Bare names of executables expected under /output/bin/. Verified at build time; used for conflict detection across deps and for cross-boundary shadow diagnostics. |
install[].target | yes | One of linux/amd64, linux/arm64. Each target appears at most once. |
install[].run | yes | Ordered shell commands. The build container starts with /output/bin precreated; write artifacts to /output/.... |
install[].image | no | Overrides the dep-level image for this specific target. |
install[].network | no | default (online) or none. Defaults to default. |
install[].timeout | no | Duration string (e.g. 10m). Defaults to 10 minutes. |
File reference (lightweight reuse)
When the same dep is used by several agents in the same project, pull its spec into its own file and reference it:
install:
deps:
- file: ./shared-deps/ripgrep-14.yaml
- name: my-other-tool
install: [...]Resolution rule: the file path is resolved relative to the referring agent.yaml. No project-root search, no magic. Inline and file-referenced deps may be mixed freely.
The referenced file contains exactly the same name/version/linkage/abi/install shape as the inline form. Nested file references are rejected.
Runtime contract
- Each dep's artifacts mount read-only at
/bunsen/deps/<name>/. /bunsen/deps/<name>/binis appended toPATHafter/bunsen/artifacts/bin(in declared dep order). The agent's owninstall.buildartifacts win on collisions.install.depsresolve and build/mount beforeinstall.buildruns. The agent's build script can shell out to any binary a dep provides.- The
provides.binarieslist is verified at build time — missing binaries fail the build loudly, preventing silent install regressions. - The run manifest records each resolved dep's
(name, version, cache_key, binaries)for reproducibility.
Cross-boundary binary shadow diagnostic
When an agent dep ships a binary whose name matches a substrate apt package the experiment installs, Bunsen records a structured diagnostic in the run manifest:
{
"diagnostic": "cross-boundary-binary-shadow",
"binary": "rg",
"winner": { "source": "agent-dep", "name": "ripgrep", "version": "14.1.1" },
"shadowed": { "source": "substrate-apt", "name": "rg" },
"resolution": "agent dep wins on PATH (deterministic precedence: /bunsen/artifacts/bin → /bunsen/deps/<name>/bin → substrate)."
}This is record-and-proceed, not a build blocker. The agent's PATH precedence is the resolver; the diagnostic is recorded in the run manifest so the shadowing is captured for inspection instead of silently corrupting cross-run comparisons. Detection is by name: an apt package whose installed binary has a different name than the package itself won't be caught.
Cache invalidation
Each dep is keyed by (name, version, target, image, network, timeout, run, provides, linkage, abi, requires). Editing any of those — including the install[].run command list — automatically changes the cache key and forces a rebuild on the next bn run / bn agents build. There is no manual cache-bust knob to flip and no cacheSalt field on deps: change the inputs, get a fresh build.
Note that an inline dep's version is only metadata for the cache key and run manifest — it does not pin the version that gets downloaded. If you bump version but leave the URL in install[].run unchanged, the cache rebuilds with the same binary. Either change both (when bumping the upstream version) or leave both alone (when iterating on shell-only details that don't affect the artifact).
A changed dep also invalidates the dependent install.build cache, because the dep's cache key is part of install.build's key.
Build image vs. experiment image
The dep's image is the container in which the install commands execute. It is not the container the dep's binary runs in at experiment time — that's the experiment's image (environment.image.base or the experiment's Dockerfile). The artifacts that the install commands write to /output/ get mounted into the experiment container at run time; only the binaries cross the boundary, not the build image.
What this means for the author: compatibility runs through the binary's ABI, not the build image. The linkage field above is the contract.
A good model is a ripgrep dep that publishes per-target builds:
linux/amd64target → downloads thex86_64-unknown-linux-muslbuild (linkage: static— runs anywhere).linux/arm64target → downloads theaarch64-unknown-linux-gnubuild (glibc-static on every glibc base Bunsen runs on).
So the question to ask when picking image is: "does this image have the tools I need to produce the right shape of binary?" Not "does this image match the experiment image."
Conflict detection
When two declared deps claim the same provides.binaries entry, Bunsen fails fast before any build runs. The error names every contributor and its version:
install.deps conflict detected:
- binary "rg" is provided by multiple deps: ripgrep@14.1.1, ripgrep-mirror@13.0.0Each binary may be provided by at most one dep — drop or rename the duplicate. (Substrate apt packages are not errors; they generate the diagnostic above.)
Configure vs Workspace Setup
| Phase | Field | Runs as | Default timeout | Purpose |
|---|---|---|---|---|
| Agent artifact build | install.build | root (build container) | 10m (install.build.timeout) | Build/download agent artifacts once and cache. |
| Agent runtime configure | install.configure | root by default; per-step as: allowed | 2m per step | Fast runtime config (env-based files, links). |
| Workspace setup | workspace.setup | execution user by default; per-step as: allowed | 5m per step | Per-run workspace prep (npm install, pip install -e .). |
Rules for install.configure:
- Keep it fast and deterministic.
- Use it for env-dependent config files, symlinks, permissions.
- Do not install / download dependencies (use
install.depsorinstall.build, or the experiment's Dockerfile).
Build Cache Operations
Use these commands to manage install.build artifacts:
# Build artifacts ahead of time
bn agents build claude-code
bn agents build claude-code --platform linux/amd64
# Force rebuild and bypass the cache
bn agents build claude-code --rebuild
bn run fix-the-bug claude-code --rebuild-agent
# Inspect and clean the cache
bn cache list
bn cache rm <cache-key>
bn cache prune --forceSee Platforms & Architecture for how Bunsen chooses a single platform for image prep, platform runtimes, helper containers, and artifact cache keys.
Resolution Logic
When bn run <experiment> <agent> executes:
- Resolve substrate from the experiment alone: default runtimes (
node: "20",python: "3.11") overlaid withenvironment.requires.runtimes, and substrateenvironment.requires.packages. - Prepare the substrate image (base image + apt/npm/pip installs).
- Build (or fetch from cache) every
install.depsentry in declared order. - Build (or fetch from cache) the agent's
install.buildartifacts. The dep tree is mounted and on PATH before this step runs. - Detect any cross-boundary binary shadows (agent dep binary names that match substrate apt package names) and record them as diagnostics in the run manifest. Non-blocking.
- Mount the agent dep trees and build artifacts read-only into the run container.
- Run
install.configure(agent-side per-run wiring). - Run
workspace.setup(experiment-side per-run wiring). - Execute the agent.
There is no agent/experiment runtime negotiation, no version intersection, no package merge. The agent walks in self-contained; the substrate provides whatever it provides.
Docker Images
Bunsen base images
| Image | Contents |
|---|---|
bunsen/headless | Ubuntu 22.04, Python 3.11, Node.js 20, tmux, asciinema |
bunsen/visual | Headless + Playwright/Chromium |
bunsen/desktop | Full desktop environment |
Bunsen base images happen to ship Node 20 and Python 3.11; those exist for the orchestrator, the supervisor, and install.configure shell scripts. Agents do not depend on this. An agent that needs a runtime ships its own via install.deps.
Custom Dockerfiles
If an experiment provides a Dockerfile (environment.image.dockerfile), it takes precedence over image.base.
- Dockerfile experiments skip package-layer installs from
requires.packages. - Dockerfile experiments can provide immutable starter files via explicit
workspace.sources[]imagePathentries (for exampleimagePath: /workspace/reference.pngwithtarget: reference.png). install.configureandworkspace.setupstill run at container start.install.buildstill works (artifacts are mounted at runtime, not baked into image layers).
For benchmark design, use that split intentionally:
- Prebuild expensive immutable artifacts in the Docker image when the verifier does not require the agent to produce them.
- Seed those artifacts into
/workspacewithworkspace.sources[]. - Leave expensive work in-run only when that expensive work is the thing being benchmarked.
Scoring Contract
Scorer containers (dedicated or agent-shared) receive both:
/workspace— a mutable copy (or live tree, in agent-container mode) of the agent's final workspace./workspace-source— an immutable snapshot of the initial seeded inputs.
Use /workspace-source in verifiers when checking original fixtures or seeded inputs, and /workspace when checking agent-authored outputs or final workspace state. See Scorers & Evaluation and Scoring in the Agent Container.
evaluation.container: agent runs scorers inside the agent's own container (default is dedicated, a separate scorer container). The narrative report is configured at evaluation.report and is not a criterion. Both are documented in Scorers & Evaluation.
Environment Variables
Environment variables are merged from several sources, later wins:
bunsen.config.yaml→defaults.envagent.yaml→defaults.envexperiment.yaml→env- Selected agent variant's
defaults.env - Selected experiment variant's
env - CLI
--env-filefiles (in order) - CLI
-e/--envflags - Platform-reserved
BUNSEN_*vars — immutable; collisions are rejected
bn run --model <id> is sugar over this list: it sets the agent's declared
model.env variable at the CLI --env tier (7), while the declared
model.default contributes at the agent-defaults tier (2). An explicit --env <model.env>=... still wins over --model (it lands later in the flag list). See
"Model selection".
Host passthrough only happens through explicit passEnv (project, agent, experiment, or --pass-env on the CLI). The major LLM provider API keys (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GEMINI_API_KEY) are allowlisted by default.
Reserved BUNSEN_* names
The runtime injects these; user config cannot override them (parsers reject BUNSEN_* keys in env / passEnv blocks):
BUNSEN_RUN_ID,BUNSEN_EXPERIMENT,BUNSEN_AGENTBUNSEN_EXPERIMENT_VARIANT,BUNSEN_AGENT_VARIANT(set only when selected)BUNSEN_WORKSPACE_DIR(/workspace)BUNSEN_WORKSPACE_SOURCE_DIR(/workspace-source)BUNSEN_OUTPUT_DIR(/bunsen/output)BUNSEN_TASK_FILE(/bunsen/task/prompt.md),BUNSEN_TASK_DIR(/bunsen/task)BUNSEN_RUN_DIR(/bunsen/run)BUNSEN_AGENT_HOME(/home/bunsenfor non-root runs,/rootwhenenvironment.user: root). Use this ininstall.configurescripts to write user-level config files ($BUNSEN_AGENT_HOME/.codex/config.toml,$BUNSEN_AGENT_HOME/.claude.json, etc.) without needing to know the execution user. The runtime chowns this directory to the execution user afterinstall.configurefinishes.BUNSEN_PLATFORM(resolved run platform)BUNSEN_RUN_TIMEOUT(the run's total wall-clock budget as a human-readable duration string, e.g.30m, resolved after any variant override; set only when the run has a timeout). Lets the agent budget its work — note it's the total budget, not elapsed/remaining time.BUNSEN_SUITE_ID,BUNSEN_SUITE_VERSION(set only when running via a suite)
See also
- How Bunsen Works — the end-to-end run lifecycle.
- experiment.yaml Reference and agent.yaml Reference — full field-level schemas.
- Agent Dependencies Cookbook — copy-pasteable
install.depsrecipes. - Running as Root (environment.user) — when and how to run as root.
- Platforms & Architecture and Packages & Schemas.
- Glossary — definitions of agent under test, platform agents, substrate, criterion, scorer, and verifier.