refactor(benchmarks): consolidate to re-export from openadapt-evals #17

abrichr · 2026-01-28T18:24:20Z

Summary

Migrates benchmark infrastructure to two-package architecture where openadapt-evals is the foundation package and openadapt-ml focuses on ML-specific agents.

Changes:

Remove 6 infrastructure files that now live in openadapt-evals:
- base.py, waa.py, waa_live.py, runner.py, data_collection.py, live_tracker.py
Keep only ML-specific agents in agent.py: PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent
Add openadapt-evals as optional dependency
Update tests to import from correct locations

Note: The removed files were not converted to deprecation stubs - they were fully removed to avoid code duplication. Users should import from openadapt_evals directly:

Removed from openadapt-ml	Import from openadapt-evals
`openadapt_ml.benchmarks.base`	`openadapt_evals.adapters.base`
`openadapt_ml.benchmarks.waa`	`openadapt_evals.adapters.waa.mock`
`openadapt_ml.benchmarks.waa_live`	`openadapt_evals.adapters.waa.live`
`openadapt_ml.benchmarks.runner`	`openadapt_evals.benchmarks.runner`
`openadapt_ml.benchmarks.data_collection`	`openadapt_evals.benchmarks.data_collection`
`openadapt_ml.benchmarks.live_tracker`	`openadapt_evals.benchmarks.live_tracker`

Validation:

All 311 tests pass in openadapt-ml
All 188 tests pass in openadapt-evals

Test plan

Run openadapt-ml test suite (311 passed)
Run openadapt-evals test suite (188 passed)
Verify imports work from canonical openadapt-evals locations

Post-merge note: viewer.py still contains full implementation with deprecation warning. Consider making it a thin re-export stub in a follow-up PR for consistency.

Generated with Claude Code

- Two-package architecture: openadapt-evals (foundation) + openadapt-ml (ML) - Verified audit findings: 10 dead files confirmed, 3 previously marked dead but used - CLI namespacing: oa evals <cmd>, oa ml <cmd> - Dependency direction: openadapt-ml depends on openadapt-evals (not circular) - Agents with ML deps (PolicyAgent, BaselineAgent) move to openadapt-ml - adapters/waa/ subdirectory pattern for benchmark organization Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add [benchmarks] optional dependency for benchmark evaluation: - pip install openadapt-ml[benchmarks] This is part of the repo consolidation to establish: - openadapt-evals: Foundation for benchmarks + infrastructure - openadapt-ml: ML training (depends on evals for benchmarks) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- oa ml serve: serve trained models for inference - oa ml dashboard: training dashboard for monitoring This distinguishes the two use cases clearly: - serve = model inference endpoint - dashboard = training progress UI Co-Authored-By: Claude Opus 4.5 <[email protected]>

Migrate benchmark infrastructure to two-package architecture: - openadapt-evals: Foundation package with all adapters, agents, runner - openadapt-ml: ML-specific agents that wrap openadapt-ml internals Changes: - Convert base.py, waa.py, waa_live.py, runner.py, data_collection.py, live_tracker.py to deprecation stubs that re-export from openadapt-evals - Keep only ML-specific agents in agent.py: PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent - Update __init__.py to import from openadapt-evals with deprecation warning - Update tests to import from correct locations - Remove test_waa_live.py (tests belong in openadapt-evals) Net: -3540 lines of duplicate code removed Co-Authored-By: Claude Opus 4.5 <[email protected]>

…-evals Remove deprecation stubs since there are no external users. Tests now import directly from openadapt-evals (canonical location). Deleted: - base.py, waa.py, waa_live.py, runner.py, data_collection.py, live_tracker.py Kept: - agent.py (ML-specific agents: PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent) - __init__.py (simplified to only export ML-specific agents) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add section 15 for Windows Agent Arena benchmark results with clearly marked placeholders. Results will be filled in when full evaluation completes. Warning banner indicates PR should not merge until placeholders are replaced. Sections added: - 15.1 Benchmark Overview - 15.2 Baseline Reproduction (paper vs our run) - 15.3 Model Comparison (GPT-4o, Claude, Qwen variants) - 15.4 Domain Breakdown Co-Authored-By: Claude Opus 4.5 <[email protected]>

WAA benchmark results belong in openadapt-evals (the benchmark infrastructure package) rather than openadapt-ml (the training package). See: OpenAdaptAI/openadapt-evals#22 Co-Authored-By: Claude Opus 4.5 <[email protected]>

abrichr · 2026-01-28T18:32:19Z

Related PR (benchmark results section): OpenAdaptAI/openadapt-evals#22

- Add setup_vnc_tunnel_and_browser() helper for automatic VNC access - Add VM_SIZE_FAST constants with D8 series sizes - Add VM_SIZE_FAST_FALLBACKS for automatic region/size retry - Add --fast flag to create command for faster installations - Add --fast flag to start command for more QEMU resources (6 cores, 16GB) - Opens browser automatically after container starts Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Document --fast VM flag usage - Explain parallelization options - Detail golden image approach for future optimization Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add section 13.5 with log viewing commands - Add benchmark run commands with examples - Renumber screenshot capture tool section to 13.6 Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add logs --run command for viewing task progress - Add logs --run -f for live streaming - Add logs --run --tail N for last N lines Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add example output for `logs` (container status) - Add example output for `logs --run -f` (benchmark execution) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add _show_benchmark_progress() function - Parse run logs for completed task count - Calculate elapsed time and estimated remaining - Show progress percentage Example usage: uv run python -m openadapt_ml.benchmarks.cli logs --progress Co-Authored-By: Claude Opus 4.5 <[email protected]>

Comprehensive analysis of Cua (YC X25) computer-use agent platform: - Architecture comparison (composite agents, sandbox-first) - Benchmark framework differences (cua-bench vs openadapt-evals) - Training data generation (trajectory replotting) - Recommendations: adopt patterns, not full migration Key findings: - Cua's parallelization uses multiple sandboxes (like our multi-VM plan) - Composite agent pattern could reduce API costs - HTML capture enables training data diversity Co-Authored-By: Claude Opus 4.5 <[email protected]>

…kers WAA natively supports parallel execution by distributing tasks across workers. Usage: # Run on single VM (default) run --num-tasks 154 # Run in parallel on multiple VMs VM1: run --num-tasks 154 --worker-id 0 --num-workers 3 VM2: run --num-tasks 154 --worker-id 1 --num-workers 3 VM3: run --num-tasks 154 --worker-id 2 --num-workers 3 Tasks auto-distribute: worker 0 gets tasks 0-51, worker 1 gets 52-103, etc. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Expand cua_waa_comparison.md with: - Success rate gap analysis (38.1% vs 19.5%) - Market positioning comparison (TAM, buyers, value props) - Where sandbox approach fails (Citrix, licensed SW, compliance) - Shell applications convergence opportunities - Bottom line: Windows enterprise automation is hard, validates OpenAdapt approach Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add WAA_PARALLELIZATION_DESIGN.md documenting: - Official WAA approach (Azure ML Compute) - Our dedicated VM approach (dev/debug) - When to use each approach - Add WAA_UNATTENDED_SCALABLE.md documenting: - Goal: unattended, scalable, programmatic WAA - Synthesized approach using official run_azure.py - Implementation plan and cost estimates - Update Dockerfile comments to clarify: - API agents (api-claude, api-openai) run externally - openadapt-evals CLI connects via SSH tunnel - No internal run.py patching needed Co-Authored-By: Claude Opus 4.5 <[email protected]>

abrichr · 2026-01-29T04:38:40Z

Latest additions (commit `6022772`)

Added WAA parallelization design documentation:

docs/WAA_PARALLELIZATION_DESIGN.md: Comparison of dedicated VM approach vs official Azure ML Compute approach
docs/WAA_UNATTENDED_SCALABLE.md: Synthesized approach for unattended, scalable, programmatic WAA execution

Also clarified in the Dockerfile that API agents (api-claude, api-openai) are run externally via the openadapt-evals CLI connecting over SSH tunnel, rather than patching run.py internally.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Replace imports from deleted benchmark files with direct imports from openadapt-evals: - azure.py: BenchmarkResult, BenchmarkTask, WAAAdapter - waa_demo/runner.py: BenchmarkAction, WAAMockAdapter, etc. This completes the migration to the two-package architecture where openadapt-evals is the canonical source for benchmark infrastructure. Co-Authored-By: Claude Opus 4.5 <[email protected]>

abrichr · 2026-01-29T05:26:39Z

PR Review Update

Issue Found: The PR description was inaccurate - it claimed files were "converted to deprecation stubs" but they were actually deleted.

Fix Applied (commit 4336e81):

Updated azure.py to import from openadapt_evals instead of deleted files
Updated waa_demo/runner.py to import from openadapt_evals

Verification:

All 18 benchmark tests pass locally
azure.py and waa_demo/runner.py import correctly

Remaining Blocker:
CI still failing because it uses PyPI openadapt-evals==0.1.0 which has old task ID format (browser_1 instead of mock_browser_001). Need to publish openadapt-evals==0.1.1 first.

- Update azure.py to import BenchmarkAgent from openadapt_evals - Add EvaluationConfig to runner.py imports Fixes CI failure: F821 Undefined name `EvaluationConfig` Co-Authored-By: Claude Opus 4.5 <[email protected]>

v0.1.0 uses task ID format "browser_1" but tests expect "mock_browser_001" which was added in v0.1.1. Co-Authored-By: Claude Opus 4.5 <[email protected]>

abrichr and others added 7 commits January 28, 2026 11:51

docs(readme): move WAA benchmark results to openadapt-evals

23beca3

WAA benchmark results belong in openadapt-evals (the benchmark infrastructure package) rather than openadapt-ml (the training package). See: OpenAdaptAI/openadapt-evals#22 Co-Authored-By: Claude Opus 4.5 <[email protected]>

abrichr mentioned this pull request Jan 28, 2026

feat: consolidate benchmark infrastructure and add results section OpenAdaptAI/openadapt-evals#22

Merged

3 tasks

abrichr and others added 10 commits January 28, 2026 17:57

docs: add WAA speedup options documentation

ab2414d

- Document --fast VM flag usage - Explain parallelization options - Detail golden image approach for future optimization Co-Authored-By: Claude Opus 4.5 <[email protected]>

docs(readme): add benchmark execution logs section

988d207

- Add section 13.5 with log viewing commands - Add benchmark run commands with examples - Renumber screenshot capture tool section to 13.6 Co-Authored-By: Claude Opus 4.5 <[email protected]>

docs(readme): clarify --run flag for benchmark execution logs

0b6b206

- Add logs --run command for viewing task progress - Add logs --run -f for live streaming - Add logs --run --tail N for last N lines Co-Authored-By: Claude Opus 4.5 <[email protected]>

docs(readme): add example output for logs commands

1e82176

- Add example output for `logs` (container status) - Add example output for `logs --run -f` (benchmark execution) Co-Authored-By: Claude Opus 4.5 <[email protected]>

abrichr and others added 2 commits January 28, 2026 23:51

style: fix ruff formatting

0fe26aa

Co-Authored-By: Claude Opus 4.5 <[email protected]>

abrichr and others added 2 commits January 29, 2026 00:49

fix(imports): add missing EvaluationConfig import

20e9078

- Update azure.py to import BenchmarkAgent from openadapt_evals - Add EvaluationConfig to runner.py imports Fixes CI failure: F821 Undefined name `EvaluationConfig` Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix(deps): require openadapt-evals>=0.1.1

0c0ce72

v0.1.0 uses task ID format "browser_1" but tests expect "mock_browser_001" which was added in v0.1.1. Co-Authored-By: Claude Opus 4.5 <[email protected]>

abrichr merged commit 7f171e4 into main Jan 29, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(benchmarks): consolidate to re-export from openadapt-evals #17

refactor(benchmarks): consolidate to re-export from openadapt-evals #17

abrichr commented Jan 28, 2026 •

edited

Loading

Uh oh!

abrichr commented Jan 28, 2026

Uh oh!

abrichr commented Jan 29, 2026

Uh oh!

abrichr commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor(benchmarks): consolidate to re-export from openadapt-evals #17

refactor(benchmarks): consolidate to re-export from openadapt-evals #17

Conversation

abrichr commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

abrichr commented Jan 28, 2026

Uh oh!

abrichr commented Jan 29, 2026

Latest additions (commit 6022772)

Uh oh!

abrichr commented Jan 29, 2026

PR Review Update

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abrichr commented Jan 28, 2026 •

edited

Loading

Latest additions (commit `6022772`)