Skip to content

benchflow-ai/benchflow

Repository files navigation

BenchFlow

Multi-turn agent benchmarking — Scene-based lifecycle for any ACP agent

PyPI Discord

What

BenchFlow runs AI agents against benchmark tasks in sandboxed environments. Single-agent, multi-agent, and multi-round patterns share one Scene-based lifecycle.

  • Any ACP agent — Gemini CLI, Claude Code, Codex, OpenCode, OpenClaw, Pi, or your own
  • Single + multi + progressive — single-agent / multi-agent (coder + reviewer, simulated user) / multi-round with a Python BaseUser callback
  • Cloud sandboxes — Daytona backend for parallel execution at scale
  • Hardened verifier — defaults block BenchJack/Meerkat-style reward-hacking; tasks opt out per-feature

Install

uv tool install benchflow

Requires Python 3.12+ and uv. Set DAYTONA_API_KEY for cloud sandboxes; export the relevant agent API key (GEMINI_API_KEY, ANTHROPIC_API_KEY, etc.) or run claude login / codex --login for subscription auth.

Documentation

Start with Getting started, then Concepts for the mental model. Then by goal:

If you want to… Read
Run an eval on an existing task Getting started
Understand Trial / Scene / Role / Verifier Concepts
Author a new task Task authoring
Multi-agent: coder + reviewer, simulated user, BYOS, stateful envs Use cases
Multi-round single-agent (progressive disclosure, oracle access) Progressive disclosure
Skill evaluation (when the artifact is a skill, not a workspace) Skill eval
Understand the security model Sandbox hardening
CLI flags + commands CLI reference
Python API surface Python API reference

Notebooks and runnable example scripts: examples/.

Featured

Research artifacts

Two runnable labs validate the security story:

Audience

Contributing

PRs welcome. Open against main. CI runs ruff + tests on every PR; please run ruff check . and pytest tests/ locally first.

For a release: bump pyproject.toml to the next stable version, tag v<version> on main, push the tag — CI publishes to PyPI. Then bump main to the next .dev0.

License

Apache-2.0.

About

Framework for creating high fidelity and complex RL environments and evaluation tasks

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages