The Trial/Scene API is the primary way to run agent benchmarks programmatically.
uv tool install benchflowimport asyncio
import benchflow as bf
result = asyncio.run(bf.run("gemini", task_path="tasks/my-task", model="gemini-3.1-flash-lite-preview"))
print(f"Reward: {result.rewards}")
print(f"Tool calls: {result.n_tool_calls}")Declarative configuration for a trial — a sequence of Scenes in a shared sandbox.
from benchflow.trial import TrialConfig, Scene, Role, Turn
# Single-agent (simplest)
config = TrialConfig(
task_path=Path("tasks/my-task"),
scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
environment="daytona",
sandbox_setup_timeout=120,
)
# Multi-scene BYOS (skill-gen → solve)
config = TrialConfig(
task_path=Path("tasks/my-task"),
scenes=[
Scene(name="prep", roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
turns=[Turn("gen", "Generate a skill for this task...")]),
Scene(name="solve", roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
turns=[Turn("solver")]),
],
environment="daytona",
sandbox_setup_timeout=120,
)Set sandbox_setup_timeout when sandbox user setup needs more than the default 120 seconds.
The same field is also available on JobConfig and RuntimeConfig.
One interaction region — roles take turns executing prompts.
# Single-role shortcut
scene = Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")
# Multi-role with turn order (coder-reviewer pattern)
# Agents communicate via outbox: write /app/.outbox/{recipient}.json
# Scheduler reads outbox after each turn, injects into next role's prompt
scene = Scene(
name="coder-reviewer",
roles=[
Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"),
],
turns=[
Turn("coder"), # None prompt = instruction.md
Turn("reviewer", "Review the code. Write feedback to "
'/app/.outbox/coder.json as {"to":"coder","content":"..."}'),
Turn("coder", "Fix the issues."), # reviewer's feedback auto-injected
],
)The execution engine — decomposed into independently-callable phases.
from benchflow.trial import Trial
trial = await Trial.create(config)
# Full lifecycle (most common)
result = await trial.run()
# Manual composition (for custom flows)
await trial.setup()
await trial.start()
await trial.install_agent()
await trial.connect()
await trial.execute(prompts=["custom prompt"])
await trial.disconnect()
await trial.verify()
await trial.cleanup()Runtime-level configuration for the Agent + Environment execution path.
from benchflow.runtime import Agent, Environment, Runtime, RuntimeConfig
config = RuntimeConfig(sandbox_setup_timeout=300)
agent = Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = Environment.from_task("tasks/X", backend="daytona")
runtime = Runtime(env, agent, config=config)
result = await runtime.execute()Convenience function — multiple calling conventions:
import benchflow as bf
# 1. TrialConfig (full control)
result = await bf.run(config)
# 2. Agent + Environment (0.3 style)
agent = bf.Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = bf.Environment.from_task("tasks/X", backend="daytona")
runtime_config = bf.RuntimeConfig(sandbox_setup_timeout=300)
result = await bf.run(agent, env, runtime_config)
# 3. String shortcut (simplest)
result = await bf.run(
"gemini",
task_path="tasks/X",
model="gemini-3.1-flash-lite-preview",
config=bf.RuntimeConfig(sandbox_setup_timeout=300),
)Trial.run()
│
├─ setup() — resolve config, create env object
├─ start() — spin up sandbox, upload task files, start services
├─ install_agent() — install agent binary, credentials, sandbox user
│ (sandbox user setup: create non-root user, prepare
│ small config/auth dirs, chown the workspace — no
│ recursive copy of /root tool trees; agent binaries
│ must live on shared prefixes like /usr/local/bin)
├─ for scene in scenes:
│ └─ _run_scene(scene)
│ ├─ setup /app/.outbox/ — (multi-role scenes only)
│ └─ for turn in scene.turns:
│ ├─ read outbox — inject messages into prompt
│ ├─ connect_as(role) — open ACP session for this role
│ ├─ execute(prompts) — send prompts, collect trajectory
│ └─ disconnect() — kill agent process, clean up
├─ verify() — run verifier, collect rewards
└─ cleanup() — stop sandbox
Key: disconnect() kills the agent process between scenes to prevent context bleed. Each scene gets a fresh agent session.
| Pattern | Roles | Turns | Communication | Example |
|---|---|---|---|---|
| Single-turn | 1 | 1 | — | Baseline benchmark |
| Multi-turn | 1 | 2+ | Same session, sequential prompts | Self-review |
| Multi-round | 2+ | 2+ | Outbox files between roles | Coder + Reviewer |
Multi-turn = same agent gets multiple prompts. Use when a second pass catches errors (self-review, iterative refinement). The agent keeps its context across turns.
Multi-round = different agents exchange turns. Use when tasks need multiple perspectives (code review, client-advisor). The scheduler reads outbox files and injects messages.
Both use the same API — TrialConfig with different Scene configurations.
config = TrialConfig(
task_path=task_path,
scenes=[Scene(
roles=[Role("coder", "gemini", "flash"), Role("reviewer", "gemini", "flash")],
turns=[
Turn("coder"),
Turn("reviewer", "Review /app/. Write feedback to /app/.outbox/coder.json"),
Turn("coder", "Read feedback and fix."),
],
)],
environment="daytona",
)config = TrialConfig(
task_path=task_path,
scenes=[
Scene(name="skill-gen",
roles=[Role("gen", "gemini", "flash")],
turns=[Turn("gen", "Generate a skill document to /app/generated-skill.md")]),
Scene(name="solve",
roles=[Role("solver", "gemini", "flash")],
turns=[Turn("solver")]),
],
environment="daytona",
)The Scene API in 0.3 covers coder-reviewer and multi-turn patterns. It does not yet support:
- Dynamic termination — turn count is fixed at config time. A "user" role cannot decide to stop early based on agent output. Workaround: use
max_roundsin the standalone_scene.pyscheduler. - Oracle access — no mechanism for a "user" role to read
/solutionduring setup. - Per-round verification —
verify()runs once after all scenes complete, not between rounds. - Inter-round trajectory inspection — a "user" role cannot read the agent's trajectory between turns.
These are tracked for 0.4. See the Harbor PR #1462 mapping for details.
from benchflow.trial_yaml import trial_config_from_yaml
config = trial_config_from_yaml("trial.yaml")
result = await bf.run(config)| Agent | Protocol | Auth | Aliases |
|---|---|---|---|
gemini |
ACP | GOOGLE_API_KEY | — |
claude-agent-acp |
ACP | ANTHROPIC_API_KEY | claude |
codex-acp |
ACP | OPENAI_API_KEY | codex |
pi-acp |
ACP | ANTHROPIC_API_KEY | pi |
openclaw |
ACP | inferred from model | — |
Trial.run() catches common errors:
TimeoutError— agent exceeded timeoutConnectionError— SSH/ACP pipe closed (retried 3x with exponential backoff)ACPError— agent protocol error
Job-level retry with RetryConfig:
from benchflow.job import Job, JobConfig, RetryConfig
config = JobConfig(
retry=RetryConfig(
max_retries=2,
wait_multiplier=2.0,
min_wait_sec=1.0,
max_wait_sec=30.0,
),
)