Fix scoring, CU agent error handling, and PolicyAgent prompt alignment by abrichr · Pull Request #42 · OpenAdaptAI/openadapt-evals

abrichr · 2026-02-25T17:17:45Z

Summary

Comprehensive reliability fixes for evaluation infrastructure, addressing false-positive scoring, agent error handling, PolicyAgent prompt alignment with training data, and Docker networking workarounds.

Scoring & Infrastructure Reliability

Guard empty metric_results in evaluate_endpoint.py — min([]) crashed, all([]) returned True (false positive)
Remove reasonable_completion false-positive path in mock adapter — done after 2+ actions no longer counts as success
Add error_type field to BenchmarkResult ("infrastructure", "agent", "evaluation", or None)
Use config.timeout for screenshot/a11y requests in live adapter (was hardcoded 30.0)
Health check probes evaluate server (port 5050) in addition to WAA server (port 5001)

CU Agent Error Handling

Return type="error" instead of type="done" on all failure paths (no screenshot, API failure, retry exhaustion) — runner can now distinguish infrastructure failures from genuine task completion
Early exit on None screenshot — prevents sending "Screenshot unavailable" text to Claude and wasting retries
Runner handles type="error" as terminal action, propagates error_type to BenchmarkResult
Initialize action = None before runner loop to prevent NameError on 0-iteration edge case

PolicyAgent Prompt Alignment

Use parse_qwen_action() instead of parse_action_response() — the base parser only recognizes UPPERCASE format (CLICK(0.5, 0.3)) but the trained model outputs lowercase keyword format (click(x=500, y=300)). Without this fix, every model output silently fell through to type="done".
Rewrite _build_prompt() to match training format from convert_demos.py: <image>, Instruction: label, indented step history, <think> instruction
Add _format_action_qwen() — formats actions in [0,1000] coordinate range matching training data
Load model via QwenVLAdapter.from_pretrained() with proper LoRA config
Remove dead SYSTEM_PROMPT — QwenVLAdapter.generate() drops system role messages; training and inference are now consistently aligned without it
Fix temp file leak — track screenshot temp files, clean up in reset()

Qwen3VL Agent Fixes

Map wait() to type="wait" instead of type="done" — prevents premature episode termination when model outputs wait()
Fix Modal adapter path from /training/results/final to /training/adapter

Adapter Defensive Handling

Handle type="error" in adapters — live adapter _translate_action() returns None, mock adapter treats as terminal
Update BenchmarkAction type docstring to include "error"

Infrastructure: Docker Networking Workaround

Re-establish socat proxy after container restart — Docker port forwarding for 5050 is broken due to QEMU's --cap-add NET_ADMIN tap networking. The socat proxy (VM:5051 -> docker exec -> container:5050) was only set up on initial creation. Now re-established in:
- _restart_container() in run_dc_eval.py
- ensure_waa_ready() tunnel reconnect step
- WAA_START_SCRIPT ALREADY_RUNNING path in pool.py

Test plan

316 tests pass
Mock adapter no longer gives false positives on done after 2+ actions
CU agent returns type=error with error_type=infrastructure on failures
PolicyAgent uses parse_qwen_action() for lowercase keyword format
wait() action continues episode instead of terminating it
Socat proxy re-established after container restart in all code paths

- Guard empty metric_results in evaluate_endpoint.py to prevent min([])/all([]) crashes and false positives - Add error_type field to BenchmarkResult for distinguishing infrastructure vs agent vs evaluation failures - Set error_type="infrastructure" on evaluation timeout and request errors - Use config.timeout instead of hardcoded 30s for screenshot/a11y requests - Remove reasonable_completion false-positive path in mock adapter (calling done after 2+ actions no longer counts as success) - Health check now probes both WAA server and evaluate server Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Early exit with type="error" when no screenshot available (infrastructure) - Return type="error" on API call failure instead of type="done" - Return type="error" on retry exhaustion (screenshot/wait loop) - Runner handles type="error" as terminal action alongside type="done" - Runner propagates error_type from agent error to BenchmarkResult - Update tests to verify new error action behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add SYSTEM_PROMPT matching openadapt_ml.training.convert_demos - Change "TASK:" label to "Instruction:" (training format) - Remove a11y tree injection (not present in training data) - Add <think> instruction matching training tail prompt - Format history as " Step {i}: {action}" (0-indexed, indented) - Build SFT-style samples with system/user message structure - Track previous actions across steps (reset on reset()) - Replace broken openadapt_ml.vlm import with models.get_adapter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The previous code imported get_adapter from openadapt_ml.models which doesn't exist. Use QwenVLAdapter.from_pretrained() with lora_config for checkpoint loading, matching the actual openadapt-ml API. Also update default model_name to valid HuggingFace model ID. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add _format_action_qwen() matching training format from convert_demos: lowercase function names, [0,1000] coords, named params - Initialize action=None before runner while loop to prevent NameError - Handle screenshot bytes when screenshot_path is None (temp file fallback) - Add NOTE about QwenVLAdapter dropping system prompt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The PEFT adapter is uploaded to volume at /adapter by upload_adapter_to_volume(), and the volume mounts at /training, so the full path is /training/adapter (not /training/results/final). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…fense - PolicyAgent._parse_response() now uses parse_qwen_action() instead of parse_action_response(), which could not parse the lowercase keyword format the trained model outputs (click(x=500, y=300)) - wait() action now maps to type="wait" instead of type="done" in parse_qwen_action(), preventing premature episode termination - Adapters defensively handle type="error" alongside type="done" - Remove dead SYSTEM_PROMPT from PolicyAgent (QwenVLAdapter.generate() drops system role; training and inference are now consistently aligned) - Fix temp file leak: track and clean up screenshot temp files in reset() - Update BenchmarkAction type docstring to include "error" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Docker port forwarding for port 5050 (evaluate server) is broken due to QEMU's custom bridge networking. The socat proxy on the VM host (VM:5051 → docker exec → container:5050) was only set up on initial container creation, so any `docker restart` left the evaluate server unreachable. - Add _setup_eval_proxy() to run_dc_eval.py; called after container restart and during tunnel reconnect recovery - Fix WAA_START_SCRIPT ALREADY_RUNNING path in pool.py to check if socat proxy is alive and restart it if dead Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abrichr and others added 3 commits February 25, 2026 11:35

abrichr mentioned this pull request Feb 25, 2026

Align PolicyAgent prompt with training format OpenAdaptAI/openadapt-ml#31

Merged

4 tasks

abrichr changed the base branch from fix/cu-agent-reliability to main February 25, 2026 17:21

abrichr changed the title ~~Align PolicyAgent prompt with training format~~ Fix scoring, CU agent error handling, and PolicyAgent prompt alignment Feb 25, 2026

This was referenced Feb 25, 2026

Fix scoring false positives and infrastructure reliability #40

Closed

Return type=error on CU agent failures instead of type=done #41

Closed

abrichr and others added 5 commits February 25, 2026 12:30

abrichr merged commit 6b7d08c into main Feb 25, 2026
1 check passed

abrichr mentioned this pull request Feb 25, 2026

fix(docs): require conventional commit format for PR titles #43

Merged

1 task

abrichr deleted the fix/policy-agent-prompt-alignment branch February 28, 2026 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scoring, CU agent error handling, and PolicyAgent prompt alignment#42

Fix scoring, CU agent error handling, and PolicyAgent prompt alignment#42
abrichr merged 8 commits intomainfrom
fix/policy-agent-prompt-alignment

abrichr commented Feb 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scoring & Infrastructure Reliability

CU Agent Error Handling

PolicyAgent Prompt Alignment

Qwen3VL Agent Fixes

Adapter Defensive Handling

Infrastructure: Docker Networking Workaround

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abrichr commented Feb 25, 2026 •

edited

Loading