BenchFlow runs AI agents against benchmark tasks in sandboxed environments. Single-agent, multi-agent, and multi-round patterns share one Scene-based lifecycle.
- Any ACP agent — Gemini CLI, Claude Code, Codex, OpenCode, OpenClaw, Pi, or your own
- Single + multi + progressive — single-agent / multi-agent (coder + reviewer, simulated user) / multi-round with a Python
BaseUsercallback - Cloud sandboxes — Daytona backend for parallel execution at scale
- Hardened verifier — defaults block BenchJack/Meerkat-style reward-hacking; tasks opt out per-feature
uv tool install benchflowRequires Python 3.12+ and uv. Set DAYTONA_API_KEY for cloud sandboxes; export the relevant agent API key (GEMINI_API_KEY, ANTHROPIC_API_KEY, etc.) or run claude login / codex --login for subscription auth.
Start with Getting started, then Concepts for the mental model. Then by goal:
| If you want to… | Read |
|---|---|
| Run an eval on an existing task | Getting started |
| Understand Trial / Scene / Role / Verifier | Concepts |
| Author a new task | Task authoring |
| Multi-agent: coder + reviewer, simulated user, BYOS, stateful envs | Use cases |
| Multi-round single-agent (progressive disclosure, oracle access) | Progressive disclosure |
| Skill evaluation (when the artifact is a skill, not a workspace) | Skill eval |
| Understand the security model | Sandbox hardening |
| CLI flags + commands | CLI reference |
| Python API surface | Python API reference |
Notebooks and runnable example scripts: examples/.
- Progressive disclosure on SWE-bench Pro — the
BaseUserabstraction drives a multi-round trial: terse round-0 prompt → failing-test hints → full spec. 5/5 oracle on Daytona, runnable demo atexamples/swebench_pro_progressive_disclosure.ipynb. Also benchflow's Harbor #1316 parity answer for the no-second-LLM case. See Progressive disclosure.
Two runnable labs validate the security story:
labs/benchjack-sandbox-hardening/— end-to-end demo that 0.2.1+ blocks three BenchJack exploits that flip 0.2.0's reward from 0.0 to 1.0.labs/reward-hack-matrix/— full reward-hack sweep across real benchmarks comparing 0.2.0 vs 0.2.2.
- Eval researchers / paper writers → Getting started → Concepts → Use cases
- Task authors → Task authoring → Sandbox hardening
- Agent builders integrating with benchflow → Concepts → Python API reference →
benchflow.agents.registry - Existing Harbor users migrating → Use cases — migration section → Progressive disclosure (Harbor #1316 parity)
PRs welcome. Open against main. CI runs ruff + tests on every PR; please run ruff check . and pytest tests/ locally first.
For a release: bump pyproject.toml to the next stable version, tag v<version> on main, push the tag — CI publishes to PyPI. Then bump main to the next .dev0.
Apache-2.0.