Unsupervisedcom · nhorton · Feb 27, 2026 · Feb 27, 2026 · Feb 27, 2026 · Mar 1, 2026
diff --git a/.deepreview b/.deepreview
@@ -181,8 +181,8 @@ requirements_traceability:
          # THIS TEST VALIDATES A HARD REQUIREMENT (JOBS-REQ-004.5).
          # YOU MUST NOT MODIFY THIS TEST UNLESS THE REQUIREMENT CHANGES
          ```
-         Both lines are required. They must appear inside the method body
-         (after `def`, before the docstring). If tests have REQ references
+         Both lines are required. They must appear immediately before the
+         `def` line (not inside the method body). If tests have REQ references
          only in docstrings but are missing the formal comment block, flag
          them. If the second line is missing, flag that too.
       5. For rule-validated requirements, verify the `.deepreview` rule's
@@ -238,6 +238,11 @@ update_documents_relating_to_src_deepwork:
       Flag any sections that are now outdated or inaccurate due to the changes.
       If the documentation file itself was changed, verify the updates are correct
       and consistent with the source files.
+
+      Output Format:
+      - PASS: All documentation is current and accurate.
+      - FAIL: Issues found. List each with the doc file path, section name, and
+        what is outdated or inaccurate.
     additional_context:
       unchanged_matching_files: true
 
@@ -264,6 +269,11 @@ update_learning_agents_architecture:
 
       Flag any sections that are now outdated or inaccurate due to the changes.
       If the doc itself was changed, verify the updates are correct.
+
+      Output Format:
+      - PASS: All documentation is current and accurate.
+      - FAIL: Issues found. List each with the doc section name and what is
+        outdated or inaccurate.
     additional_context:
       unchanged_matching_files: true
 
@@ -300,8 +310,13 @@ shell_code_review:
         section markers) still accurate after the changes? Flag any comments
         that describe behavior that no longer matches the code.
 
+      Output Format:
+      - PASS: No issues found.
+      - FAIL: Issues found. List each with file, line, severity (high/medium/low),
+        and a concise description.
+
 deepreview_config_quality:
-  description: "Review .deepreview configs for rule consolidation, overly broad rules, description accuracy, and correct directory placement."
+  description: "Review .deepreview configs for rule consolidation, description accuracy, overly broad rules, directory placement, valid configuration, and effective instructions."
   match:
     include:
       - "**/.deepreview"
@@ -377,9 +392,156 @@ deepreview_config_quality:
       - Rules that reference `additional_context.unchanged_matching_files`
         with patterns spanning multiple directories.
 
+      ## 5. Valid Configuration
+
+      Each rule must be syntactically valid YAML and structurally correct:
+      - Required fields present: `description`, `match.include`, `review.strategy`,
+        `review.instructions` (either inline string or `file:` reference).
+      - Field types correct: `match.include` and `match.exclude` are lists of
+        strings, `review.strategy` is one of `individual`, `matches_together`,
+        or `all_changed_files`, `description` is a string.
+      - No unknown top-level fields inside a rule (valid fields: `description`,
+        `match`, `review`).
+      - If `review.instructions.file` is used, the referenced file path exists.
+
+      ## 6. Effective Instructions
+
+      Review instructions must give the reviewer enough context to do useful work:
+      - Instructions include concrete checks — not just "review this file" but
+        specific things to look for (e.g., "check that X matches Y", "verify Z
+        is present").
+      - If the rule protects a documentation file, the doc path is explicitly
+        referenced in the instructions so the reviewer knows what to read.
+      - If the rule uses `additional_context.unchanged_matching_files: true`,
+        the instructions explain why unchanged files are needed (e.g., "compare
+        the doc against source changes").
+      - Instructions specify the output format (PASS/FAIL with details).
+
       ## Output Format
 
       - PASS: No issues found.
       - FAIL: Issues found. List each with the .deepreview file path,
-        rule name, which check failed (consolidation / description / overly-broad / placement),
+        rule name, which check failed (consolidation / description / overly-broad /
+        placement / valid-configuration / effective-instructions),
         and a specific recommendation.
+
+job_yml_quality:
+  description: "Review job.yml files for structural quality: logical step decomposition, review coverage, no orphaned steps, and no promise tags."
+  match:
+    include:
+      - ".deepwork/jobs/**/job.yml"
+      - "src/deepwork/standard_jobs/**/job.yml"
+      - "library/jobs/**/job.yml"
+    exclude:
+      - "tests/fixtures/**"
+  review:
+    strategy: individual
+    instructions: |
+      Review this job.yml file for intrinsic structural quality.
+
+      Check all of the following:
+
+      ## 1. Intermediate Deliverables
+
+      The job breaks work into logical steps with reviewable intermediate
+      deliverables. Each step should produce a tangible output that can be
+      reviewed before proceeding. Flag jobs where steps are too coarse
+      (doing too much in one step) or too fine (splitting trivially).
+
+      ## 2. Review Coverage
+
+      Reviews are defined for each step that produces outputs. Particularly
+      critical documents should have their own dedicated reviews. Note that
+      reviewers do NOT have transcript access — if criteria need to evaluate
+      conversation behavior, the step must include an output file (e.g.,
+      `.deepwork/tmp/[summary].md`) as a communication channel to the reviewer.
+
+      ## 3. No Orphaned Steps
+
+      Every step defined in the `steps` list must appear in at least one
+      workflow. Flag any step that is defined but not referenced by any
+      workflow's `steps` list.
+
+      ## 4. No Promise Tags
+
+      The job.yml file must not contain `<promise>` tags anywhere in its
+      content (these are a deprecated pattern).
+
+      # Output Format
+      You should return one of the following:
+
+      - PASS: No issues found.
+      - FAIL: Issues found. List each with the check name
+        (deliverables / review-coverage / orphaned-steps / promise-tags)
+        and a specific description.
+
+step_instruction_quality:
+  description: "Review step instruction files for completeness, specificity, output examples, quality criteria, structured questions, conciseness, and no promise tags."
+  match:
+    include:
+      - ".deepwork/jobs/**/steps/*.md"
+      - "src/deepwork/standard_jobs/**/steps/*.md"
+      - "library/jobs/**/steps/*.md"
+    exclude:
+      - "tests/fixtures/**"
+  review:
+    strategy: individual
+    instructions: |
+      Review this step instruction file for intrinsic quality.
+
+      IMPORTANT: Read the job.yml file in the same job directory for context on
+      how this instruction file fits into the larger workflow — particularly
+      which outputs the step declares and whether it has user inputs.
+
+      Check all of the following:
+
+      ## 1. Complete
+
+      The instruction file is complete — no stubs, placeholders, TODO markers,
+      or sections saying "to be written". Every section has substantive content.
+
+      ## 2. Specific & Actionable
+
+      Instructions are tailored to this step's specific purpose, not generic
+      boilerplate that could apply to any step. The agent should know exactly
+      what to do after reading the instructions.
+
+      ## 3. Output Examples
+
+      If the step declares outputs (check the job.yml), the instruction file
+      shows what good output looks like. This can be template examples, worked
+      examples, or negative examples of what NOT to do. If the step has no
+      outputs, this check passes automatically.
+
+      ## 4. Quality Criteria Defined
+
+      The instruction file defines quality criteria or acceptance criteria for
+      its outputs, either explicitly in a dedicated section or woven into the
+      instructions. The agent should understand what "done well" looks like.
+
+      ## 5. Ask Structured Questions
+
+      If this step gathers user input (check the job.yml for `inputs` with
+      `name` fields rather than `file` references), the instructions explicitly
+      tell the agent to ask structured questions. If the step has no user
+      inputs, this check passes automatically.
+
+      ## 6. Concise, No Redundancy
+
+      Instructions are free of unnecessary verbosity and do not duplicate
+      content that belongs in the job.yml's `common_job_info_provided_to_all_steps_at_runtime`
+      section. Shared context (project background, terminology, conventions)
+      should be in common_job_info, not repeated in each step file.
+
+      ## 7. No Promise Lines
+
+      The instruction file must not contain `<promise>Quality Criteria Met</promise>`
+      or similar promise tag patterns (these are a deprecated pattern).
+
+      ## Output Format
+
+      - PASS: No issues found.
+      - FAIL: Issues found. List each with the check name
+        (complete / specific / output-examples / quality-criteria /
+        structured-questions / conciseness / promise-lines)
+        and a specific description.
diff --git a/.github/workflows/.deepreview b/.github/workflows/.deepreview
@@ -18,5 +18,10 @@ update_github_workflows_readme:
 
       Flag any sections that are now outdated or inaccurate due to the changes.
       If the README itself was changed, verify the updates are correct.
+
+      Output Format:
+      - PASS: README accurately reflects all workflows.
+      - FAIL: Issues found. List each with the README section name and what is
+        outdated or inaccurate.
     additional_context:
       unchanged_matching_files: true
diff --git a/doc/architecture.md b/doc/architecture.md
@@ -244,7 +244,10 @@ steps:
       - name: product_category
         description: "Product category"
     outputs:
-      - competitors.md
+      competitors.md:
+        type: file
+        description: "List of competitors with descriptions"
+        required: true
     dependencies: []
 
   - id: primary_research
@@ -255,8 +258,10 @@ steps:
       - file: competitors.md
         from_step: identify_competitors
     outputs:
-      - primary_research.md
-      - competitor_profiles/
+      primary_research.md:
+        type: file
+        description: "Primary research findings"
+        required: true
     dependencies:
       - identify_competitors
 
@@ -619,6 +624,7 @@ Users invoke workflows through the `/deepwork` skill, which uses MCP tools:
 2. `start_workflow` — begins a workflow session, creates a git branch, returns first step instructions
 3. `finished_step` — submits step outputs for quality review, returns next step or completion
 4. `abort_workflow` — cancels the current workflow if it cannot be completed
+5. `go_to_step` — navigates back to a prior step, clearing progress from that step onward
 
 **Example: Creating a New Job**
 ```
@@ -932,7 +938,8 @@ DeepWork includes an MCP (Model Context Protocol) server that provides an altern
                               ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                   DeepWork MCP Server                        │
-│  Tools: get_workflows | start_workflow | finished_step      │
+│  Tools: get_workflows | start_workflow | finished_step |    │
+│         abort_workflow | go_to_step | review tools          │
 │  State: session tracking, step progress, outputs            │
 │  Quality Gate: invokes review agent for validation          │
 └─────────────────────────────────────────────────────────────┘
@@ -980,8 +987,10 @@ Begins a new workflow session.
 Reports step completion and gets next instructions.
 
 **Parameters**:
-- `outputs: list[str]` - List of output file paths created
+- `outputs: dict[str, str | list[str]]` - Map of output names to file path(s)
 - `notes: str | None` - Optional notes about work done
+- `quality_review_override_reason: str | None` - If provided, skips quality review
+- `session_id: str | None` - Target a specific workflow session
 
 **Returns**:
 - `status: "needs_work" | "next_step" | "workflow_complete"`
@@ -998,7 +1007,23 @@ Aborts the current workflow and returns to the parent (if nested).
 
 **Returns**: Aborted workflow info, resumed parent info (if any), current stack
 
-#### 5. `get_review_instructions`
+#### 5. `go_to_step`
+Navigates back to a prior step, clearing progress from that step onward.
+
+**Parameters**:
+- `step_id: str` - ID of the step to navigate back to
+- `session_id: str | None` - Target a specific workflow session
+
+**Returns**: `begin_step` (step info for the target step), `invalidated_steps` (step IDs whose progress was cleared), `stack` (current workflow stack)
+
+**Behavior**:
+- Validates the target step exists in the workflow
+- Rejects forward navigation (target entry index > current entry index)
+- Clears session tracking state for all steps from target onward (files on disk are not deleted)
+- For concurrent entries, navigates to the first step in the entry
+- Marks the target step as started
+
+#### 6. `get_review_instructions`
 Runs the `.deepreview`-based code review pipeline. Registered directly in `jobs/mcp/server.py` (not in `tools.py`) since it operates outside the workflow lifecycle.
 
 **Parameters**:
@@ -1008,15 +1033,15 @@ Runs the `.deepreview`-based code review pipeline. Registered directly in `jobs/
 
 The `--platform` CLI option on `serve` controls which formatter is used (defaults to `"claude"`).
 
-#### 6. `get_configured_reviews`
+#### 7. `get_configured_reviews`
 Lists configured review rules from `.deepreview` files without running the pipeline.
 
 **Parameters**:
 - `only_rules_matching_files: list[str] | None` - Filter to rules matching these files.
 
 **Returns**: List of rule summaries (name, description, defining_file).
 
-#### 7. `mark_review_as_passed`
+#### 8. `mark_review_as_passed`
 Marks a review as passed so it is skipped on subsequent runs while the reviewed files remain unchanged. Part of the **review pass caching** mechanism.
 
 **Parameters**:
@@ -1038,12 +1063,17 @@ Manages workflow session state persisted to `.deepwork/tmp/session_[id].json`:
 
 ```python
 class StateManager:
-    def create_session(...) -> WorkflowSession
-    def load_session(session_id) -> WorkflowSession
-    def start_step(step_id) -> None
-    def complete_step(step_id, outputs, notes) -> None
-    def advance_to_step(step_id, entry_index) -> None
-    def complete_workflow() -> None
+    async def create_session(...) -> WorkflowSession
+    def resolve_session(session_id=None) -> WorkflowSession
+    async def start_step(step_id, session_id=None) -> None
+    async def complete_step(step_id, outputs, notes, session_id=None) -> None
+    async def advance_to_step(step_id, entry_index, session_id=None) -> None
+    async def go_to_step(step_id, entry_index, invalidate_step_ids, session_id=None) -> None
+    async def complete_workflow(session_id=None) -> None
+    async def abort_workflow(explanation, session_id=None) -> tuple
+    async def record_quality_attempt(step_id, session_id=None) -> int
+    def get_all_outputs(session_id=None) -> dict
+    def get_stack() -> list[StackEntry]
 ```
 
 Session state includes:
@@ -1058,25 +1088,34 @@ Evaluates step outputs against quality criteria:
 
 ```python
 class QualityGate:
-    def evaluate(
-        quality_criteria: list[str],
-        outputs: list[str],
+    async def evaluate_reviews(
+        reviews: list[dict],
+        outputs: dict[str, str | list[str]],
+        output_specs: dict[str, str],
+        project_root: Path,
+        notes: str | None = None,
+    ) -> list[ReviewResult]
+
+    async def build_review_instructions_file(
+        reviews: list[dict],
+        outputs: dict[str, str | list[str]],
+        output_specs: dict[str, str],
         project_root: Path,
-    ) -> QualityGateResult
+        notes: str | None = None,
+    ) -> str
 ```
 
-The quality gate:
-1. Builds a review prompt with criteria and output file contents
-2. Invokes Claude Code via subprocess with proper flag ordering (see `doc/reference/calling_claude_in_print_mode.md`)
-3. Uses `--json-schema` for structured output conformance
-4. Parses the `structured_output` field from the JSON response
-5. Returns pass/fail with per-criterion feedback
+The quality gate supports two modes:
+- **External runner** (`evaluate_reviews`): Invokes Claude Code via subprocess to evaluate each review, returns list of failed `ReviewResult` objects
+- **Self-review** (`build_review_instructions_file`): Generates a review instructions file for the agent to spawn a subagent for self-review
 
 ### Schemas (`jobs/mcp/schemas.py`)
 
 Pydantic models for all tool inputs and outputs:
-- `StartWorkflowInput`, `FinishedStepInput`
-- `GetWorkflowsResponse`, `StartWorkflowResponse`, `FinishedStepResponse`
+- `StartWorkflowInput`, `FinishedStepInput`, `AbortWorkflowInput`, `GoToStepInput`
+- `GetWorkflowsResponse`, `StartWorkflowResponse`, `FinishedStepResponse`, `AbortWorkflowResponse`, `GoToStepResponse`
+- `ActiveStepInfo`, `ExpectedOutput`, `ReviewInfo`, `ReviewResult`, `StackEntry`
+- `JobInfo`, `WorkflowInfo`, `JobLoadErrorInfo`
 - `WorkflowSession`, `StepProgress`
 - `QualityGateResult`, `QualityCriteriaResult`