Run Manifest & Events

Every Bunsen run records two public artifacts in its run directory:

  • manifest.json — the canonical, on-disk end-state summary of the run. Its shape is RunManifestV1 (exported from @bunsen-dev/types).
  • events.jsonl — the ordered, append-only execution timeline. One JSON object per line, each a RunEvent (also from @bunsen-dev/types).

Both are stable public contract. The manifest is the "what is the final state" record; the event log is the "what happened, in order" record. They live side by side in the run directory, which defaults to .bunsen/runs/<run-id>/:

.bunsen/runs/<run-id>/
  manifest.json
  events.jsonl
  ...

Open a run's full directory (including a local web viewer at http://localhost:3456) with bn runs open [run-id]; see CLI.md for the rest of the bn runs commands.

Naming convention: manifest JSON keys are snake_case (the on-disk contract), while event data payload keys are camelCase. This is intentional — match each file's convention when you parse it.

RunManifestV1

A populated manifest looks like this (trimmed for length):

{
  "schema_version": 1,
  "run_id": "2026-06-21T17-04-12_a1b2c3",
  "manifest_revision": 3,
  "run_source": "local",
  "created_at": "2026-06-21T17:04:12.001Z",
  "updated_at": "2026-06-21T17:09:48.512Z",
  "started_at": "2026-06-21T17:04:12.118Z",
  "completed_at": "2026-06-21T17:09:48.500Z",
  "duration_ms": 336382,
  "status": "succeeded",
  "exit_code": 0,
  "platform": "linux/arm64",
  "experiment": { "id": "fibonacci-server", "path": "experiments/fibonacci-server" },
  "agent": {
    "id": "claude-code",
    "args": [],
    "model": "claude-opus-4-8",
    "models": [
      { "model": "claude-opus-4-8", "calls": 14, "input_tokens": 38211,
        "output_tokens": 9044, "cost_usd": 1.2243 }
    ]
  },
  "usage": {
    "total_ai_calls": 17,
    "total_input_tokens": 41902,
    "total_output_tokens": 9711,
    "estimated_cost_usd": 1.31,
    "accounting_status": "captured"
  },
  "evaluation": {
    "weighted_score": 0.83,
    "criteria": [
      { "id": "server-responds", "weight": 1, "score": 1, "summary": "200 OK on /fib/10",
        "scorer_type": "script", "status": "completed" }
    ]
  },
  "provenance": { "verification_tier": "self_reported", "replayable": true },
  "artifacts": [
    { "key": "diff", "kind": "diff", "rel_path": "artifacts/workspace.diff",
      "content_type": "text/x-diff", "bytes": 2841, "created_at": "2026-06-21T17:09:48.300Z" }
  ]
}

Top-level fields

FieldTypeNotes
schema_version1Always 1. The manifest schema discriminator.
run_idstringUnique id for the run.
manifest_revisionnumberStarts at 1; incremented on each manifest-changing update.
run_sourceRunSourceWhere the run originated.
created_atstringISO8601. When the run record was created.
updated_atstringISO8601. Last manifest write.
started_atstringISO8601. When execution began.
completed_atstring?ISO8601. When execution finished; absent until terminal.
duration_msnumberRun duration in milliseconds.
statusRunStatusSee RunStatus values.
exit_codenumber?Agent process exit code, when known.
platformRunPlatform?Resolved platform for the run.
experimentRunManifestExperimentThe experiment that ran. See Experiment.
agentRunManifestAgentThe agent under test. See Agent.
orchestrationRunManifestOrchestration?Setup commands and the structured invocation.
usageRunManifestUsageAPI call / token / cost accounting. See Usage.
evaluationRunManifestEvaluation?Scores and report; absent on unevaluated runs.
human_scoringRunManifestHumanScoring?Human calibration scores, when recorded.
provenanceRunManifestProvenanceVerification tier, replayability, attestation.
artifactsRunManifestArtifact[]Captured artifacts (diffs, logs, screenshots, exports).
diagnosticsRunManifestDiagnostic[]?Non-blocking signals; see Diagnostics.

RunStatus values

status is a RunStatus, one of:

'pending' | 'running' | 'succeeded' | 'failed' | 'canceled'

RunSource and RunPlatform values

  • run_source (RunSource) records where the run executed: 'local' or 'remote'.
  • platform (RunPlatform) is a linux/<arch> value: 'linux/amd64' or 'linux/arm64'. See PLATFORMS.md.

Experiment

experiment: RunManifestExperiment:

FieldTypeNotes
idstringExperiment id.
pathstring?On-disk path, when run from disk.
variantstring?Experiment variant, when selected.
suite_idstring?Canonical suite id (<host>/<org>/<repo> for git-cloned suites, local/<dirname> for on-disk suites). Only present for suite runs; not the local alias from bunsen.config.yaml#suites[].as.
suite_versionstring?Commit sha of the cloned suite ref. Suite runs only.
suite_source_urlstring?Git URL the suite was cloned from. Git-cloned suite runs only.
config_hashstring?SHA-256 over the normalized experiment config.

Agent

agent: RunManifestAgent:

FieldTypeNotes
idstringAgent id.
pathstring?On-disk path, when run from disk.
variantstring?Agent variant, when selected.
argsstring[]Invocation args.
modelstring?The model the agent was configured with — the merged value of its declared model.env var (--model, a variant pin, or the declared default). Launch-time intent; absent when the agent declares no model. Contrast with models (observed).
modelsAgentModelUsage[]?Models the agent under test drove. See below.
config_hashstring?SHA-256 over the normalized agent config.
depsRunManifestAgentDep[]?Resolved install.deps (version + cache key, in declared order).

agent.models is observed from captured traces and sorted highest-cost first, so models[0] is the run's headline/primary model — the one that carried the run's compute, not necessarily the one with the most calls. It is absent when no agent traces were captured: there is no declared/placeholder fallback, so it is always a record of what actually ran. It counts only successful (2xx) inference — errored calls (e.g. a 404 for an unavailable model) are excluded — and excludes platform-model calls (orchestrator, supervisor, scorer), which live in usage.by_source. Because errored and platform calls are filtered out, the per-model calls here do not sum to usage.total_ai_calls.

AgentModelUsage (one model's slice of the run's API usage):

FieldType
modelstring
callsnumber
input_tokensnumber
output_tokensnumber
cost_usdnumber

RunManifestAgentDep:

FieldTypeNotes
namestringDep name.
versionstring?Resolved version, when known.
cache_keystringinstall.deps build-cache key.
binariesstring[]Binary names declared in provides.binaries.

Orchestration

orchestration: RunManifestOrchestration:

FieldTypeNotes
setup_commandsstring[]Setup commands run before the agent.
invocation{ command: string; args: string[] }Structured invocation.

Usage

usage: RunManifestUsage — the run's API accounting. See docs/COST.md for how these numbers are priced and attributed.

FieldTypeNotes
total_ai_callsnumberAll captured inference calls (agent + platform).
total_input_tokensnumberFresh (non-cached) input tokens.
total_output_tokensnumberOutput tokens.
total_cache_read_input_tokensnumber?Run-wide cache-read input tokens, summed across sources. Disjoint from total_input_tokens. Absent on runs with no captured traces.
total_cache_creation_input_tokensnumber?Run-wide cache-creation input tokens. Same disjointness; absent on runs with no captured traces.
estimated_cost_usdnumberTotal estimated cost.
agent_cost_usdnumber?Cost attributable to the agent under test.
platform_cost_usdnumber?Cost attributable to platform agents.
pricing_fallback_callsnumber?Calls whose model was absent from the vendored pricing snapshot and priced with a coarse per-provider default — that much of the cost is a rough estimate. Present only when > 0.
unpriced_modelsstring[]?Distinct unrecognized model ids behind pricing_fallback_calls, sorted.
by_sourceRecord<string, RunManifestUsageSource>?Per-source breakdown (agent vs orchestrator/supervisor/scorer).
accounting_status'captured' | 'missing' | 'skipped' (optional)Whether the numbers reflect actual captured traffic. See below.

RunManifestUsageSource:

FieldTypeNotes
callsnumberCalls from this source.
input_tokensnumberFresh (non-cached) input tokens.
output_tokensnumberOutput tokens.
cache_read_input_tokensnumber?Cache-read input tokens for this source, disjoint from input_tokens.
cache_creation_input_tokensnumber?Cache-creation input tokens for this source.
cost_usdnumberCost for this source.

usage.accounting_status semantics

This field distinguishes a "trustworthy zero" from "no proxy data — we don't know":

  • 'captured' — the proxy intercepted at least one inference call. The numbers are accurate within the limits of the in-proxy parser.
  • 'missing' — the proxy was active but recorded no traces. Could mean a deterministic agent that made no LLM calls, or an agent whose HTTP client bypassed the proxy (e.g. Node native fetch / undici, which does not honor HTTPS_PROXY). Treat the totals as a lower bound.
  • 'skipped' — tracing was deliberately disabled (--skip-traces).
  • undefined — a run that errored before the trace-finalization step.

Evaluation

evaluation: RunManifestEvaluation:

FieldTypeNotes
weighted_scorenumberOverall weighted score.
criteriaRunManifestCriterion[]Per-criterion results.
reportstring?Narrative evaluation report, when generated.

RunManifestCriterion:

FieldTypeNotes
idstringCriterion id (matches Criterion.id from the experiment).
titlestring?Human-readable title.
weightnumberCriterion weight in the aggregate.
scorenumber | nullResolved score; null when not scored.
summarystringScorer summary.
status'completed' | 'skipped' | 'not_run' (optional)Lifecycle status.
scorer_typeRunManifestScorerType?The criterion type that produced the score: 'script' | 'judge' | 'agent' | 'browser-agent' | 'aggregate'. See SCORERS.md.
allowed_scoresAllowedScores?Allowed score set or range for the criterion (e.g. a discrete set or a min/max range).
screenshotsstring[]?Artifact keys for screenshots the scorer produced.
log_pathstring?Artifact key for the scorer's log output.

Human scoring

human_scoring: RunManifestHumanScoring:

FieldTypeNotes
scored_bystringWho recorded the human scores.
scored_atstringISO8601.
criteriaRunManifestHumanCriterion[]Per-criterion human scores.

RunManifestHumanCriterion:

FieldTypeNotes
idstringCriterion id.
human_scorenumberThe human-assigned score.
llm_scorenumber | nullThe LLM score being calibrated against.
notesstring?Reviewer notes.
allowed_scoresAllowedScores?Allowed score set/range.

Provenance

provenance: RunManifestProvenance:

FieldTypeNotes
verification_tierVerificationTierHow strongly the run's reproducibility is guaranteed: 'self_reported' | 'reproducible' | 'attested'.
replayablebooleanWhether the run is reproducible.
image_digeststring?Pinned image digest.
suite_version_lockedboolean?Whether the suite version is pinned.
attestation_idstring?Attestation id, when present.

Diagnostics

diagnostics: RunManifestDiagnostic[] are non-blocking signals Bunsen surfaces about the run. The type is a discriminated union; the cross-boundary-binary-shadow variant is emitted by the cross-boundary binary-shadow detector (see ENVIRONMENT.md#asymmetric-composition). Absent on runs that recorded no diagnostics.

RunManifestCrossBoundaryShadow:

FieldTypeNotes
diagnostic'cross-boundary-binary-shadow'Discriminator.
binarystringBinary name (e.g. rg); matches provides.binaries[] on the agent dep.
winner{ source: 'agent-dep'; name: string; version?: string }Resolved winner on the run's PATH.
shadowed{ source: 'substrate-apt' | 'substrate-npm' | 'substrate-pip'; name: string; version?: string }The substrate-side package that also declared this binary and lost the precedence contest.
resolutionstringHuman-readable reason explaining how the conflict resolved.

Artifacts

artifacts: RunManifestArtifact[]:

FieldTypeNotes
keystringStable artifact key.
kindArtifactKindWhat the artifact is: 'diff' | 'log' | 'screenshot' | 'export' | 'trace' | 'other'.
rel_pathstring?Path relative to the run dir, for local artifacts.
object_urlstring?Remote object URL, for uploaded artifacts.
content_typestring?MIME type.
bytesnumber?Size in bytes.
sha256string?Content hash.
redaction_state'unknown' | 'clean' | 'redacted' | 'blocked' (optional)Outcome of secret scanning before the artifact is surfaced. See below.
created_atstringISO8601.
titlestring?Human-readable title surfaced in UIs.

redaction_state values:

  • 'clean' — scanned, no secrets found; safe to surface as-is.
  • 'redacted' — secrets were found and masked; the artifact is safe to share.
  • 'blocked' — secrets were found that could not be safely masked, so the artifact is withheld.
  • 'unknown' — not scanned.

See TRUST_MODEL.md for what triggers redaction and how safe-sharing works.

Run events (events.jsonl)

events.jsonl is the canonical, ordered, append-only execution record for a run — the timeline complement to manifest.json (the end-state summary). Each line is one JSON object: a RunEvent from the discriminated union below. Every event carries an event name (the discriminator), a ts: string ISO8601 timestamp, and a data payload (whose keys are camelCase).

A few lines from a real events.jsonl:

{"event":"run.started","ts":"2026-06-21T17:04:12.118Z","data":{"id":"2026-06-21T17-04-12_a1b2c3"}}
{"event":"agent.started","ts":"2026-06-21T17:04:18.940Z","data":{"id":"claude-code"}}
{"event":"agent.completed","ts":"2026-06-21T17:08:51.002Z","data":{"exitCode":0,"durationMs":272062}}
{"event":"criterion.completed","ts":"2026-06-21T17:09:40.221Z","data":{"id":"server-responds","score":1,"durationMs":1840,"status":"completed"}}
{"event":"run.completed","ts":"2026-06-21T17:09:48.500Z","data":{"id":"2026-06-21T17-04-12_a1b2c3","durationMs":336382}}

The full set of 19 event variants:

Event namedata payload
install.build.started{ agent: string; variant?: string }
install.build.completed{ cacheHit: boolean; durationMs: number }
workspace.sources.started{}
workspace.sources.completed{ sourceCount: number; durationMs: number }
install.configure.started{}
install.configure.completed{ stepCount: number; durationMs: number }
workspace.setup.started{}
workspace.setup.completed{ stepCount: number; durationMs: number }
run.started{ id: string }
agent.started{ id: string }
agent.completed{ exitCode: number; durationMs: number }
evaluation.started{ criterionCount: number }
criterion.started{ id: string }
criterion.completed{ id: string; score: number | null; durationMs: number; status?: 'completed' | 'skipped' }
evaluation.report.started{}
evaluation.report.completed{ durationMs: number }
run.completed{ id: string; durationMs: number }
run.failed{ phase: string; reason: string }
run.canceled{ reason?: string }

Durability

events.jsonl is written append-only. Each emit opens the file with O_APPEND, writes one JSON object plus a trailing newline, fsyncs the descriptor, and closes. Lines arrive whole to disk on normal termination, even across process.exit from a signal handler. A kill -9 is allowed to lose the trailing partial line; no preceding event is lost.

See also

  • SCORERS.md — how criteria are evaluated and what scorer_type means.
  • COST.md — how the usage numbers are priced and attributed.
  • CLI.md — the bn runs commands for inspecting runs.