Skip to content

Align PolicyAgent prompt with training format#31

Merged
abrichr merged 6 commits intomainfrom
fix/policy-agent-prompt-alignment
Feb 25, 2026
Merged

Align PolicyAgent prompt with training format#31
abrichr merged 6 commits intomainfrom
fix/policy-agent-prompt-alignment

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Feb 25, 2026

Summary

Align PolicyAgent prompt format with training data from convert_demos.py, fix broken API calls, and increase Modal inference timeout.

PolicyAgent Prompt Alignment

  • Rewrite _build_sample() to match training format: Instruction: label, indented step history, <think> instruction (was Goal: with accessibility tree and What action should be taken next?)
  • Replace _action_to_string() with training-aligned format — lowercase function-call style with [0,1000] coordinates (click(x=500, y=300)) instead of UPPERCASE normalized format (CLICK(0.500, 0.300))
  • Fix self.policy.predict(sample) to self.policy.predict_action_from_sample(sample) with 4-tuple unpacking — predict() does not exist on AgentPolicy
  • Remove dead SYSTEM_PROMPT from _build_sample() messages — QwenVLAdapter.generate() only extracts user role messages. Training also ignores system prompt, so omitting it keeps inference consistent.
  • Track _previous_actions across steps, clear in reset() (was a no-op)
  • Remove unused methods _format_accessibility_tree() and _format_history()

Modal Inference

  • Increase timeout from 300s to 600s — 8B model with vision inputs on A10G can take >5 min on cold start

Test plan

  • 393 tests pass, 2 skipped
  • ruff format passes
  • Action format matches convert_demos._format_action_qwen() output character-for-character
  • Prompt format matches convert_demos.convert_step() output

- Import SYSTEM_PROMPT from convert_demos (canonical source)
- Add system message to SFT sample
- Change "Goal:" label to "Instruction:" (training format)
- Remove a11y tree, URL, window title injection (not in training data)
- Add <think> instruction matching training tail prompt
- Format history as "  Step {i}: {action}" (0-indexed, indented)
- Track previous actions across steps (reset on reset())

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abrichr and others added 5 commits February 25, 2026 12:30
AgentPolicy has predict_action_from_sample() which returns a 4-tuple
(Action, thought, state, raw_text). The previous code called predict()
which doesn't exist on AgentPolicy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace UPPERCASE/normalized format (CLICK(0.500, 0.300)) with
training-aligned format (click(x=500, y=300)): lowercase function
names, [0,1000] coordinates, named parameters, press() for keys,
finished() for done.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vision model inference with large screenshots can take 3+ minutes on
A10G, especially on cold start. 300s was causing premature timeouts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
QwenVLAdapter.generate() only extracts user role messages, dropping
the system prompt. Since training also ignores it, removing it at
inference keeps behaviour consistent and eliminates misleading code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@abrichr abrichr merged commit aeac459 into main Feb 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant