Skip to content

Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection#13

Open
kevinngo1304 wants to merge 18 commits intomainfrom
improve-personas-and-judge
Open

Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection#13
kevinngo1304 wants to merge 18 commits intomainfrom
improve-personas-and-judge

Conversation

@kevinngo1304
Copy link
Copy Markdown
Collaborator

PR: Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection

Branch: improve-personas-and-judgemain
11 commits | 12 files changed | +401 / -28 lines


Summary

This PR adds three major capabilities to Murphy's persona-driven testing pipeline:

  1. New UI-focused test personas — Three new built-in personas (boomer_ui, genz_ui, whitespace_police_ui) that evaluate visual design quality rather than functional correctness, each with dedicated trait dimensions and judge criteria.

  2. Per-persona feature suggestions — Every persona now produces 1–3 concrete, actionable feature/UX improvement suggestions grounded in what it observed during testing. Suggestions flow through the entire pipeline: execution, judging, HTML/Markdown reports, and executive summary.

  3. Smarter screenshot selection for the judge — Screenshots sent to the judge are now selected by action signal strength (navigation, text input, errors, final state) instead of simple recency, so the judge sees the most informative visual progression.


What changed

New UI personas and design test type

  • murphy/models.py — Added boomer_ui, genz_ui, whitespace_police_ui to TestPersona. Extended TraitVector with three new UI dimensions (visual_density_preference, aesthetic_era, layout_strictness). Added design to TestType. Registered all three personas in PERSONA_REGISTRY with full trait vectors.
  • murphy/core/judge.py — Added TRAIT_JUDGE_QUESTIONS entries for visual_density_preference and layout_strictness. Added a design rule to TEST_TYPE_RULES. build_judge_trait_context now includes the UI traits when test type is design.
  • murphy/prompts.py — Added persona descriptions, distribution percentages, execution behavior instructions, and success criteria examples for all three UI personas. Rebalanced the persona distribution (total still 100%).

Per-persona feature suggestions

  • murphy/models.py — Added feature_suggestions: list[str] to both ScenarioExecutionVerdict and TestResult.
  • murphy/prompts.py — Added _PERSONA_SUGGESTION_INSTRUCTIONS dict with tailored suggestion prompts for every built-in persona, plus _build_suggestion_instruction() which falls back to discovered persona instructions. Injected into build_execution_prompt.
  • murphy/personas/pipeline_models.py — Added suggestion_instruction field to PersonaDescription and Persona.
  • murphy/personas/persona_labeling.py — Updated the LLM labeling prompt to request a suggestion_instruction per cluster; wired it through build_persona_result.
  • murphy/personas/bridge.py — Added get_discovered_suggestion_instruction() to look up discovered persona suggestions.
  • murphy/core/execution.py — Propagated feature_suggestions from the agent's verdict into TestResult.
  • murphy/core/summary.py — Aggregated all feature suggestions into the executive summary prompt so recommended_actions are informed by persona-grounded suggestions.
  • murphy/io/report_markdown.py — Renders per-test suggestions in detail sections and an aggregated collapsible "Feature Suggestions" table in the report.
  • murphy/api/templates.py — Renders feature suggestions in the HTML results view.

Smarter screenshot selection

  • murphy/core/judge.py — Added _select_key_screenshots() which scores each agent step by action type signal strength (high: navigate, input_text, done; low: scroll, refresh_dom_state) and error presence, then picks the top N most informative screenshots in chronological order. Replaced the old history.screenshots(n_last=3) call.

Housekeeping

  • CHANGELOG.md — Documented all additions and changes under [1.1.0].
  • tests/murphy/personas/test_persona_labeling.py — Updated mock data and assertions to cover suggestion_instruction.

fjfok and others added 11 commits April 21, 2026 15:59
…recency

Replace history.screenshots(n_last=3) with _select_key_screenshots() which scores
each agent step by the actions it performed:
- +10 for the final step (always most important)
- +3 for high-signal actions: navigate, input_text, done, select_dropdown_option,
  upload_file, evaluate (JS mutation)
- +1 for mid-signal actions: clicks and other interactions
- +4 for steps that produced errors
- 0 for low-signal actions: scroll, refresh_dom_state, search_page, wait

Top-N steps are returned in chronological order so the judge sees the visual
progression rather than just the end state. Falls back to the last screenshot
when no steps score above zero.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Each persona now produces 1-3 concrete, actionable feature/UX improvement
suggestions grounded in what they observed during testing. Suggestions flow
through the agent verdict, judge verdict, and into HTML reports and the
executive summary.
…d personas

The discovered persona rendering (description, traits, execution hints)
was computed but discarded — the prompt always called the predefined
renderer, which returns near-empty output for discovered persona slugs.
Each discovered persona now carries a tailored suggestion_instruction
generated during labeling, which is used in the execution prompt to
produce persona-grounded feature suggestions instead of generic ones.
The judge was duplicating the agent's feature suggestion work, resulting
in ~6 suggestions per test instead of the intended 1-3. Remove the
feature_suggestions prompt and field from the judge, use only the
agent's suggestions, and make the report section collapsible.
@kevinngo1304 kevinngo1304 changed the title Improve personas and judge Improve personas and judge — UI personas, feature suggestions, and smarter screenshot selection Apr 22, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

Agent Task Evaluation Results: 2/2 (100%)

View detailed results
Task Result Reason
amazon_laptop ✅ Pass Skipped - API key not available (fork PR or missing secret)
browser_use_pip ✅ Pass Skipped - API key not available (fork PR or missing secret)

Check the evaluate-tasks job for detailed task execution logs.

Comment thread CHANGELOG.md Outdated
Comment thread murphy/core/execution.py Outdated
Comment thread murphy/models.py Outdated
Comment thread murphy/core/judge.py Outdated
Pydantic model guarantees these fields are never None, making the
trailing `or ''` and `or []` unnecessary.
boomer_ui -> classic_ui, genz_ui -> modern_ui,
whitespace_police_ui -> layout_auditor_ui
…e-file change

Trait names, judge questions, summary names, and the drift assertion all
live in models.py now.  judge.py and prompts.py derive their trait lists
from TraitVector class vars instead of maintaining their own copies.
Comment thread murphy/api/templates.py
Comment thread murphy/models.py Outdated
Persona badges and grouping now gracefully handle personas not in the
predefined list, using a default badge color and stable ordering
(predefined first, then discovered alphabetically).
The judge was returning empty trait_evaluations because OpenAI strict
mode sets additionalProperties:false on all objects, blocking dynamic
keys in dict[str, str] fields. Switch JudgeVerdict.trait_evaluations
to list[TraitEvaluation] (structured objects with trait_name + assessment)
and enrich the discovered-persona judge context with per-trait evaluation
questions derived from each dimension's low/high anchors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants