Skip to content

Fix scoring, CU agent error handling, and PolicyAgent prompt alignment#42

Merged
abrichr merged 8 commits intomainfrom
fix/policy-agent-prompt-alignment
Feb 25, 2026
Merged

Fix scoring, CU agent error handling, and PolicyAgent prompt alignment#42
abrichr merged 8 commits intomainfrom
fix/policy-agent-prompt-alignment

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Feb 25, 2026

Summary

Comprehensive reliability fixes for evaluation infrastructure, addressing false-positive scoring, agent error handling, PolicyAgent prompt alignment with training data, and Docker networking workarounds.

Scoring & Infrastructure Reliability

  • Guard empty metric_results in evaluate_endpoint.py — min([]) crashed, all([]) returned True (false positive)
  • Remove reasonable_completion false-positive path in mock adapter — done after 2+ actions no longer counts as success
  • Add error_type field to BenchmarkResult ("infrastructure", "agent", "evaluation", or None)
  • Use config.timeout for screenshot/a11y requests in live adapter (was hardcoded 30.0)
  • Health check probes evaluate server (port 5050) in addition to WAA server (port 5001)

CU Agent Error Handling

  • Return type="error" instead of type="done" on all failure paths (no screenshot, API failure, retry exhaustion) — runner can now distinguish infrastructure failures from genuine task completion
  • Early exit on None screenshot — prevents sending "Screenshot unavailable" text to Claude and wasting retries
  • Runner handles type="error" as terminal action, propagates error_type to BenchmarkResult
  • Initialize action = None before runner loop to prevent NameError on 0-iteration edge case

PolicyAgent Prompt Alignment

  • Use parse_qwen_action() instead of parse_action_response() — the base parser only recognizes UPPERCASE format (CLICK(0.5, 0.3)) but the trained model outputs lowercase keyword format (click(x=500, y=300)). Without this fix, every model output silently fell through to type="done".
  • Rewrite _build_prompt() to match training format from convert_demos.py: <image>, Instruction: label, indented step history, <think> instruction
  • Add _format_action_qwen() — formats actions in [0,1000] coordinate range matching training data
  • Load model via QwenVLAdapter.from_pretrained() with proper LoRA config
  • Remove dead SYSTEM_PROMPTQwenVLAdapter.generate() drops system role messages; training and inference are now consistently aligned without it
  • Fix temp file leak — track screenshot temp files, clean up in reset()

Qwen3VL Agent Fixes

  • Map wait() to type="wait" instead of type="done" — prevents premature episode termination when model outputs wait()
  • Fix Modal adapter path from /training/results/final to /training/adapter

Adapter Defensive Handling

  • Handle type="error" in adapters — live adapter _translate_action() returns None, mock adapter treats as terminal
  • Update BenchmarkAction type docstring to include "error"

Infrastructure: Docker Networking Workaround

  • Re-establish socat proxy after container restart — Docker port forwarding for 5050 is broken due to QEMU's --cap-add NET_ADMIN tap networking. The socat proxy (VM:5051 -> docker exec -> container:5050) was only set up on initial creation. Now re-established in:
    • _restart_container() in run_dc_eval.py
    • ensure_waa_ready() tunnel reconnect step
    • WAA_START_SCRIPT ALREADY_RUNNING path in pool.py

Test plan

  • 316 tests pass
  • Mock adapter no longer gives false positives on done after 2+ actions
  • CU agent returns type=error with error_type=infrastructure on failures
  • PolicyAgent uses parse_qwen_action() for lowercase keyword format
  • wait() action continues episode instead of terminating it
  • Socat proxy re-established after container restart in all code paths

abrichr and others added 3 commits February 25, 2026 11:35
- Guard empty metric_results in evaluate_endpoint.py to prevent
  min([])/all([]) crashes and false positives
- Add error_type field to BenchmarkResult for distinguishing
  infrastructure vs agent vs evaluation failures
- Set error_type="infrastructure" on evaluation timeout and request errors
- Use config.timeout instead of hardcoded 30s for screenshot/a11y requests
- Remove reasonable_completion false-positive path in mock adapter
  (calling done after 2+ actions no longer counts as success)
- Health check now probes both WAA server and evaluate server

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Early exit with type="error" when no screenshot available (infrastructure)
- Return type="error" on API call failure instead of type="done"
- Return type="error" on retry exhaustion (screenshot/wait loop)
- Runner handles type="error" as terminal action alongside type="done"
- Runner propagates error_type from agent error to BenchmarkResult
- Update tests to verify new error action behavior

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add SYSTEM_PROMPT matching openadapt_ml.training.convert_demos
- Change "TASK:" label to "Instruction:" (training format)
- Remove a11y tree injection (not present in training data)
- Add <think> instruction matching training tail prompt
- Format history as "  Step {i}: {action}" (0-indexed, indented)
- Build SFT-style samples with system/user message structure
- Track previous actions across steps (reset on reset())
- Replace broken openadapt_ml.vlm import with models.get_adapter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr changed the base branch from fix/cu-agent-reliability to main February 25, 2026 17:21
@abrichr abrichr changed the title Align PolicyAgent prompt with training format Fix scoring, CU agent error handling, and PolicyAgent prompt alignment Feb 25, 2026
abrichr and others added 5 commits February 25, 2026 12:30
The previous code imported get_adapter from openadapt_ml.models which
doesn't exist. Use QwenVLAdapter.from_pretrained() with lora_config
for checkpoint loading, matching the actual openadapt-ml API. Also
update default model_name to valid HuggingFace model ID.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _format_action_qwen() matching training format from convert_demos:
  lowercase function names, [0,1000] coords, named params
- Initialize action=None before runner while loop to prevent NameError
- Handle screenshot bytes when screenshot_path is None (temp file fallback)
- Add NOTE about QwenVLAdapter dropping system prompt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The PEFT adapter is uploaded to volume at /adapter by
upload_adapter_to_volume(), and the volume mounts at /training,
so the full path is /training/adapter (not /training/results/final).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fense

- PolicyAgent._parse_response() now uses parse_qwen_action() instead of
  parse_action_response(), which could not parse the lowercase keyword
  format the trained model outputs (click(x=500, y=300))
- wait() action now maps to type="wait" instead of type="done" in
  parse_qwen_action(), preventing premature episode termination
- Adapters defensively handle type="error" alongside type="done"
- Remove dead SYSTEM_PROMPT from PolicyAgent (QwenVLAdapter.generate()
  drops system role; training and inference are now consistently aligned)
- Fix temp file leak: track and clean up screenshot temp files in reset()
- Update BenchmarkAction type docstring to include "error"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docker port forwarding for port 5050 (evaluate server) is broken due
to QEMU's custom bridge networking.  The socat proxy on the VM host
(VM:5051 → docker exec → container:5050) was only set up on initial
container creation, so any `docker restart` left the evaluate server
unreachable.

- Add _setup_eval_proxy() to run_dc_eval.py; called after container
  restart and during tunnel reconnect recovery
- Fix WAA_START_SCRIPT ALREADY_RUNNING path in pool.py to check if
  socat proxy is alive and restart it if dead

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr merged commit 6b7d08c into main Feb 25, 2026
1 check passed
@abrichr abrichr deleted the fix/policy-agent-prompt-alignment branch February 28, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant