Fix scoring, CU agent error handling, and PolicyAgent prompt alignment#42
Merged
Fix scoring, CU agent error handling, and PolicyAgent prompt alignment#42
Conversation
- Guard empty metric_results in evaluate_endpoint.py to prevent min([])/all([]) crashes and false positives - Add error_type field to BenchmarkResult for distinguishing infrastructure vs agent vs evaluation failures - Set error_type="infrastructure" on evaluation timeout and request errors - Use config.timeout instead of hardcoded 30s for screenshot/a11y requests - Remove reasonable_completion false-positive path in mock adapter (calling done after 2+ actions no longer counts as success) - Health check now probes both WAA server and evaluate server Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Early exit with type="error" when no screenshot available (infrastructure) - Return type="error" on API call failure instead of type="done" - Return type="error" on retry exhaustion (screenshot/wait loop) - Runner handles type="error" as terminal action alongside type="done" - Runner propagates error_type from agent error to BenchmarkResult - Update tests to verify new error action behavior Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add SYSTEM_PROMPT matching openadapt_ml.training.convert_demos
- Change "TASK:" label to "Instruction:" (training format)
- Remove a11y tree injection (not present in training data)
- Add <think> instruction matching training tail prompt
- Format history as " Step {i}: {action}" (0-indexed, indented)
- Build SFT-style samples with system/user message structure
- Track previous actions across steps (reset on reset())
- Replace broken openadapt_ml.vlm import with models.get_adapter
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tasks
This was referenced Feb 25, 2026
The previous code imported get_adapter from openadapt_ml.models which doesn't exist. Use QwenVLAdapter.from_pretrained() with lora_config for checkpoint loading, matching the actual openadapt-ml API. Also update default model_name to valid HuggingFace model ID. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _format_action_qwen() matching training format from convert_demos: lowercase function names, [0,1000] coords, named params - Initialize action=None before runner while loop to prevent NameError - Handle screenshot bytes when screenshot_path is None (temp file fallback) - Add NOTE about QwenVLAdapter dropping system prompt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The PEFT adapter is uploaded to volume at /adapter by upload_adapter_to_volume(), and the volume mounts at /training, so the full path is /training/adapter (not /training/results/final). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…fense - PolicyAgent._parse_response() now uses parse_qwen_action() instead of parse_action_response(), which could not parse the lowercase keyword format the trained model outputs (click(x=500, y=300)) - wait() action now maps to type="wait" instead of type="done" in parse_qwen_action(), preventing premature episode termination - Adapters defensively handle type="error" alongside type="done" - Remove dead SYSTEM_PROMPT from PolicyAgent (QwenVLAdapter.generate() drops system role; training and inference are now consistently aligned) - Fix temp file leak: track and clean up screenshot temp files in reset() - Update BenchmarkAction type docstring to include "error" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docker port forwarding for port 5050 (evaluate server) is broken due to QEMU's custom bridge networking. The socat proxy on the VM host (VM:5051 → docker exec → container:5050) was only set up on initial container creation, so any `docker restart` left the evaluate server unreachable. - Add _setup_eval_proxy() to run_dc_eval.py; called after container restart and during tunnel reconnect recovery - Fix WAA_START_SCRIPT ALREADY_RUNNING path in pool.py to check if socat proxy is alive and restart it if dead Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Comprehensive reliability fixes for evaluation infrastructure, addressing false-positive scoring, agent error handling, PolicyAgent prompt alignment with training data, and Docker networking workarounds.
Scoring & Infrastructure Reliability
metric_resultsin evaluate_endpoint.py —min([])crashed,all([])returned True (false positive)reasonable_completionfalse-positive path in mock adapter —doneafter 2+ actions no longer counts as successerror_typefield toBenchmarkResult("infrastructure","agent","evaluation", orNone)config.timeoutfor screenshot/a11y requests in live adapter (was hardcoded30.0)CU Agent Error Handling
type="error"instead oftype="done"on all failure paths (no screenshot, API failure, retry exhaustion) — runner can now distinguish infrastructure failures from genuine task completionNonescreenshot — prevents sending "Screenshot unavailable" text to Claude and wasting retriestype="error"as terminal action, propagateserror_typetoBenchmarkResultaction = Nonebefore runner loop to preventNameErroron 0-iteration edge casePolicyAgent Prompt Alignment
parse_qwen_action()instead ofparse_action_response()— the base parser only recognizes UPPERCASE format (CLICK(0.5, 0.3)) but the trained model outputs lowercase keyword format (click(x=500, y=300)). Without this fix, every model output silently fell through totype="done"._build_prompt()to match training format fromconvert_demos.py:<image>,Instruction:label, indented step history,<think>instruction_format_action_qwen()— formats actions in [0,1000] coordinate range matching training dataQwenVLAdapter.from_pretrained()with proper LoRA configSYSTEM_PROMPT—QwenVLAdapter.generate()drops system role messages; training and inference are now consistently aligned without itreset()Qwen3VL Agent Fixes
wait()totype="wait"instead oftype="done"— prevents premature episode termination when model outputswait()/training/results/finalto/training/adapterAdapter Defensive Handling
type="error"in adapters — live adapter_translate_action()returnsNone, mock adapter treats as terminalBenchmarkActiontype docstring to include"error"Infrastructure: Docker Networking Workaround
--cap-add NET_ADMINtap networking. The socat proxy (VM:5051 -> docker exec -> container:5050) was only set up on initial creation. Now re-established in:_restart_container()inrun_dc_eval.pyensure_waa_ready()tunnel reconnect stepWAA_START_SCRIPTALREADY_RUNNINGpath inpool.pyTest plan