feat: mode-posture energy fix for /plan-ceo-review and /office-hours (v1.1.2.0)#1065
Merged
feat: mode-posture energy fix for /plan-ceo-review and /office-hours (v1.1.2.0)#1065
Conversation
…tput
Rewrites Writing Style rule 2-4 examples in scripts/resolvers/preamble.ts
to cover three framing families (pain reduction, upside/delight, forcing
pressure) instead of diagnostic-pain only. Adds inline exemplars to
plan-ceo-review (0D-prelude shared between SCOPE + SELECTIVE EXPANSION)
and office-hours (Q3 forcing exemplar with career/day/weekend domain
gating, builder operating principles wild exemplar).
V1 shipped rule 2-4 examples that all pointed to diagnostic-pain framing
("3-second spinner", "double-click button"). Models follow concrete
examples over abstract taxonomies, so any skill with a non-diagnostic
mode posture (expansion, forcing, delight) got flattened at runtime
even when the template itself said "dream big" or "direct to the point
of discomfort." This change targets the actual lever: swap the single
diagnostic example for three paired framings, one per posture family.
Preserves V1 clarity gains — rules 2, 3, 4 principles unchanged, only
examples expanded. Terse mode (EXPLAIN_LEVEL: terse) still skips the
block entirely.
Mechanical cascade from `bun run gen:skill-docs --host all` after the Writing Style rule 2-4 example rewrite and the plan-ceo-review / office-hours template exemplar additions. No hand edits — every change flows from the prior commit's templates.
Three gate-tier E2E tests detect when preamble / template changes flatten the distinctive posture of /plan-ceo-review SCOPE EXPANSION or /office-hours (startup Q3, builder mode). The V1 regression that this PR fixes shipped without anyone catching it at ship time — this is the ongoing signal so the same thing doesn't happen again. Pieces: - `judgePosture(mode, text)` in `test/helpers/llm-judge.ts`. Sonnet judge with mode-specific dual-axis rubric (expansion: surface_framing + decision_preservation; forcing: stacking_preserved + domain_matched_consequence; builder: unexpected_combinations + excitement_over_optimization). Pass threshold 4/5 on both axes. - Three fixtures in `test/fixtures/mode-posture/` — deterministic input for expansion proposal generation, Q3 forcing question, and builder adjacent-unlock riffing. - `plan-ceo-review-expansion-energy` case appended to `test/skill-e2e-plan.test.ts`. Generator: Opus (skill default). Judge: Sonnet. - New `test/skill-e2e-office-hours.test.ts` with `office-hours-forcing-energy` + `office-hours-builder-wildness` cases. Generator: Sonnet. Judge: Sonnet. - Touchfile registration in `test/helpers/touchfiles.ts` — all three as `gate` tier in `E2E_TIERS`, triggered by changes to `scripts/resolvers/preamble.ts`, the relevant skill template, the judge helper, or any mode-posture fixture. Cost: ~$0.50-$1.50 per triggered PR. Sonnet judge is cheap; Opus generator for the plan-ceo-review case dominates. Known V1.1 tradeoff: judges test prose markers more than deep behavior. V1.2 candidate is a cross-provider (Codex) adversarial judge on the same output to decouple house-style bias.
… entries Mechanical test updates after the mode-posture work: - Golden ship SKILL.md baselines (claude + codex + factory hosts) regenerate with the rewritten Writing Style rule 2-4 examples from preamble.ts. - Touchfile selection test expects 6 matches for a plan-ceo-review/ change (was 5) because E2E_TOUCHFILES now includes plan-ceo-review-expansion-energy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E2E Evals: ✅ PASS64/64 tests passed | $7.58 total cost | 12 parallel runners
12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restores the distinctive energy of two skills after V1's writing-style shipped.
The problem: V1 (v1.0.0.0) added Writing Style rules 2–4 to
scripts/resolvers/preamble.ts. Each rule had one example, and all three examples pointed to diagnostic-pain framing ("3-second spinner", "double-click button"). Models follow concrete examples more reliably than abstract rules, so any skill with a non-diagnostic mode posture got flattened at runtime —/plan-ceo-reviewSCOPE EXPANSION collapsed into feature bullets,/office-hoursQ3 softened into "Who's your target user?", builder mode turned into PRD-voice brainstorms.The fix: rewrite rules 2 + 4 to show three paired framings (pain / upside / forcing). Rule 3 gets an exception for stacked multi-part questions. Add inline exemplars anchored on stable headings in both skill templates. Add three gate-tier E2E tests + Sonnet judge so the regression can't silently ship again.
Commits on this branch
feat:rewrite preamble rule 2-4 examples, add plan-ceo-review 0D-prelude + office-hours inline exemplarschore:regenerate SKILL.md cascade (29 files across claude + external hosts)test:gate-tier mode-posture E2E tests + judgePosture helper + 3 fixtures + touchfile registrationstest:update golden ship baselines + touchfile countchore:bump to v1.1.2.0Test Coverage
E2E_TOUCHFILES+E2E_TIERS:plan-ceo-review-expansion-energy(Opus generator, Sonnet judge)office-hours-forcing-energy(Sonnet generator, Sonnet judge)office-hours-builder-wildness(Sonnet generator, Sonnet judge)bun testsuite: 766 pass, 0 fail after golden-baseline regen + touchfile count update.Pre-Landing Review
/plan-ceo-reviewHOLD SCOPE (clean, 0 unresolved)./plan-eng-review(clean, 1 finding — the original test-infrastructure target was wrong:test/skill-llm-eval.test.tsdoes static analysis, needed to be E2E). Fixed before merge.Known limitations (accepted V1.1 tradeoffs)
EXPLAIN_LEVEL: tersestill get V0 prose. No regression for them.CHANGELOG
User-facing entry at v1.1.2.0. Two "Fixed" items (expansion mode stays expansive, forcing questions stay forcing). One "Added" item (gate-tier eval tests). Contributor section at the bottom covers implementation details.
Test plan
bun testpasses (766/766 on targeted suites)bun run gen:skill-docs --host allregenerates all hosts cleanlyE2E_TOUCHFILES+E2E_TIERShave matching keys (98 each)EVALS_TIER=gate EVALS=1 bun run test:e2erequiresANTHROPIC_API_KEY— run manually post-merge to verify the three new cases fire/plan-ceo-reviewSCOPE EXPANSION on a real plan, verify proposals lead with felt-experience framing🤖 Generated with Claude Code