Align PolicyAgent prompt with training format#31
Merged
Conversation
- Import SYSTEM_PROMPT from convert_demos (canonical source)
- Add system message to SFT sample
- Change "Goal:" label to "Instruction:" (training format)
- Remove a11y tree, URL, window title injection (not in training data)
- Add <think> instruction matching training tail prompt
- Format history as " Step {i}: {action}" (0-indexed, indented)
- Track previous actions across steps (reset on reset())
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6 tasks
AgentPolicy has predict_action_from_sample() which returns a 4-tuple (Action, thought, state, raw_text). The previous code called predict() which doesn't exist on AgentPolicy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace UPPERCASE/normalized format (CLICK(0.500, 0.300)) with training-aligned format (click(x=500, y=300)): lowercase function names, [0,1000] coordinates, named parameters, press() for keys, finished() for done. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vision model inference with large screenshots can take 3+ minutes on A10G, especially on cold start. 300s was causing premature timeouts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
QwenVLAdapter.generate() only extracts user role messages, dropping the system prompt. Since training also ignores it, removing it at inference keeps behaviour consistent and eliminates misleading code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Align PolicyAgent prompt format with training data from
convert_demos.py, fix broken API calls, and increase Modal inference timeout.PolicyAgent Prompt Alignment
_build_sample()to match training format:Instruction:label, indented step history,<think>instruction (wasGoal:with accessibility tree andWhat action should be taken next?)_action_to_string()with training-aligned format — lowercase function-call style with [0,1000] coordinates (click(x=500, y=300)) instead of UPPERCASE normalized format (CLICK(0.500, 0.300))self.policy.predict(sample)toself.policy.predict_action_from_sample(sample)with 4-tuple unpacking —predict()does not exist onAgentPolicySYSTEM_PROMPTfrom_build_sample()messages —QwenVLAdapter.generate()only extracts user role messages. Training also ignores system prompt, so omitting it keeps inference consistent._previous_actionsacross steps, clear inreset()(was a no-op)_format_accessibility_tree()and_format_history()Modal Inference
Test plan
convert_demos._format_action_qwen()output character-for-characterconvert_demos.convert_step()output