Skip to content

Stop Codex harness retries on active-goal router failures#39156

Open
Copilot wants to merge 7 commits into
mainfrom
copilot/fix-retry-loop-token-drain
Open

Stop Codex harness retries on active-goal router failures#39156
Copilot wants to merge 7 commits into
mainfrom
copilot/fix-retry-loop-token-drain

Conversation

Copilot AI commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Codex harness retries were treating a known non-retryable router failure as partial-execution noise, causing repeated fresh runs and avoidable token burn. This change classifies the “thread already has a goal / use update_goal” failure as terminal for the current attempt sequence.

  • Retry guard expansion

    • Added a new non-retryable signature in harness_retry_guard.cjs for the active-goal router error emitted by Codex.
    • Detection uses a single consolidated pattern matching the full error shape.
  • Harness retry policy update

    • Updated codex_harness.cjs to short-circuit retries when the new goalAlreadyActive guard is detected.
    • Logs now include this condition under existing non-retryable guard handling.
  • Focused test coverage

    • Extended harness_retry_guard.test.cjs to assert active-goal detection.
    • Extended codex_harness.test.cjs retry-policy tests to verify no retry on this error and reused shared guard detection logic.
const nonRetryableGuard = detectNonRetryableHarnessGuard(result.output);

if (
  nonRetryableGuard.aiCreditsExceeded ||
  nonRetryableGuard.awfAPIProxyBlockingRequests ||
  nonRetryableGuard.goalAlreadyActive
) {
  log(`attempt ${attempt + 1}: ... — not retrying (non-retryable guard condition)`);
  break;
}

pr-sous-chef: requested branch update in run 27485386371

Generated by 👨‍🍳 PR Sous Chef · 58 AIC · ⌖ 0.988 AIC · ⊞ 17.4K ·


pr-sous-chef: requested branch update for run 27504382710

Generated by 👨‍🍳 PR Sous Chef · 52.8 AIC · ⌖ 1.74 AIC · ⊞ 17.4K ·

Copilot AI and others added 2 commits June 13, 2026 22:15
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix retry loop draining all tokens Stop Codex harness retries on active-goal router failures Jun 13, 2026
Copilot AI requested a review from pelikhan June 13, 2026 22:23
@pelikhan pelikhan marked this pull request as ready for review June 13, 2026 22:46
Copilot AI review requested due to automatic review settings June 13, 2026 22:46
@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Test Quality Sentinel completed test quality analysis.

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #39156 does not have the 'implementation' label and has 0 new lines of code in business logic directories (≤100 threshold). Neither enforcement condition is met.

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

PR Code Quality Reviewer completed the code quality review.

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Codex harness retry behavior to stop retrying when Codex reports a known non-retryable “active goal already exists for this thread” router failure, reducing wasted fresh-run retries and token consumption.

Changes:

  • Added a new non-retryable guard (goalAlreadyActive) to the shared harness retry-guard detector.
  • Updated Codex harness retry loop to treat the new guard as terminal (no further retries) and include it in the non-retryable logging path.
  • Extended unit tests to validate detection of the new guard and Codex retry-policy behavior.
Show a summary per file
File Description
actions/setup/js/harness_retry_guard.cjs Adds the goalAlreadyActive guard detection pattern and exposes it via the guard result.
actions/setup/js/harness_retry_guard.test.cjs Adds test coverage for the new active-goal guard signal.
actions/setup/js/codex_harness.cjs Stops retries when goalAlreadyActive is detected and logs it as non-retryable.
actions/setup/js/codex_harness.test.cjs Adds retry-policy coverage ensuring no retry occurs on the active-goal router failure.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 2

const AI_CREDITS_EXCEEDED_PATTERNS = [/\bmax[\s_-]*ai[\s_-]*credits[\s_-]*exceeded\b/i, /\bai[\s_-]*credits[\s_-]*rate[\s_-]*limit[\s_-]*error\b/i, /ai[\s_-]*credits?.*(?:rate[\s-]*limit|limit exceeded|budget exceeded|exceeded)/i];

const AWF_API_PROXY_BLOCKING_REQUESTS_PATTERNS = [/\bawf\b.*\bapi[\s_-]*proxy\b.*\bblocking requests\b/i, /\bapi[\s_-]*proxy\b.*\bblocking requests\b/i, /\bapi[\s_-]*proxy\b.*\bblocked requests?\b/i, /\bDIFC_FILTERED\b/];
const GOAL_ALREADY_ACTIVE_PATTERNS = [/\bthis thread already has a goal\b.*\buse update_goal only when the existing goal is complete\b/i];

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced .* with [\s\S]*? so the pattern matches across newlines, and simplified the second anchor to \buse update_goal\b for flexibility. A dedicated multiline test confirms the behavior.

Comment thread actions/setup/js/codex_harness.cjs Outdated
const reasons = [];
if (nonRetryableGuard.aiCreditsExceeded) reasons.push("AI credits budget exceeded");
if (nonRetryableGuard.awfAPIProxyBlockingRequests) reasons.push("AWF API proxy is blocking requests");
if (nonRetryableGuard.goalAlreadyActive) reasons.push("goal is already active for this thread");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The reason string now reads: "goal is already active for this thread (use update_goal when the current goal is complete)".

@github-actions github-actions Bot mentioned this pull request Jun 13, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 80/100 — Excellent

Analyzed 3 test(s) (2 new, 1 modified): 3 design, 0 implementation, 0 guideline violation(s). All new tests verify behavioral contracts of the retry-guard and harness functions.

📊 Metrics & Test Classification (3 tests analyzed)
Metric Value
New/modified tests analyzed 3
✅ Design tests (behavioral contracts) 3 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 3 (100%)
Duplicate test clusters 0
Test inflation detected Yes — codex_harness.test.cjs (+11 test lines / +2 prod lines = 5.5:1)
🚨 Coding-guideline violations 0

Test Classification Details

Test File Classification Issues Detected
"does not retry when codex reports an existing active goal" actions/setup/js/codex_harness.test.cjs:~466 ✅ Design None
"detects goal already active markers" actions/setup/js/harness_retry_guard.test.cjs:~74 ✅ Design None
"returns false when output has no guard markers" (modified) actions/setup/js/harness_retry_guard.test.cjs:~65 ✅ Design Extended with goalAlreadyActive negative assertion

Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 0 tests
  • 🟨 JavaScript (*.test.cjs, *.test.js): 3 tests (vitest)

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). The inflation flag on codex_harness.test.cjs (5.5:1) is a mechanical artefact of the compact production change (a single-line condition) versus the necessarily verbose test setup, not a sign of low-quality coverage.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

🧪 Test quality analysis by Test Quality Sentinel · 166 AIC · ⌖ 33 AIC · ⊞ 27.3K ·

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 80/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 3 analyzed tests (2 new, 1 modified) verify behavioral contracts with full error/edge-case coverage and no coding-guideline violations.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnose and /tdd — requesting changes on pattern robustness and test fidelity.

The production fix is correct and the structure follows established patterns well. Two issues worth addressing before merge:

  1. Pattern fragility (harness_retry_guard.cjs:8) — the single regex requires both phrases to appear on the same line (.* does not span newlines). If Codex output wraps the error across lines, or rephrases slightly, the guard silently stops working and the token-burn regression resurfaces.
  2. Mock drift (codex_harness.test.cjs:416) — the local shouldRetry was updated to check only goalAlreadyActive, leaving aiCreditsExceeded / awfAPIProxyBlockingRequests absent from the mock; mirroring the full guard object would keep the mock structurally aligned with production.
📋 Positive highlights
  • ✅ Clean additive change — no structural churn, consistent with existing guard handling
  • ✅ Two-layer coverage: guard unit tests (harness_retry_guard.test.cjs) + policy integration tests (codex_harness.test.cjs)
  • reasons.push(...) / log message follows the existing pattern exactly
  • ✅ JSDoc return type updated to reflect the new field
  • GOAL_ALREADY_ACTIVE_PATTERNS exported for testability, consistent with peers

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 225.6 AIC · ⌖ 26.1 AIC · ⊞ 27.9K

const AI_CREDITS_EXCEEDED_PATTERNS = [/\bmax[\s_-]*ai[\s_-]*credits[\s_-]*exceeded\b/i, /\bai[\s_-]*credits[\s_-]*rate[\s_-]*limit[\s_-]*error\b/i, /ai[\s_-]*credits?.*(?:rate[\s-]*limit|limit exceeded|budget exceeded|exceeded)/i];

const AWF_API_PROXY_BLOCKING_REQUESTS_PATTERNS = [/\bawf\b.*\bapi[\s_-]*proxy\b.*\bblocking requests\b/i, /\bapi[\s_-]*proxy\b.*\bblocking requests\b/i, /\bapi[\s_-]*proxy\b.*\bblocked requests?\b/i, /\bDIFC_FILTERED\b/];
const GOAL_ALREADY_ACTIVE_PATTERNS = [/\bthis thread already has a goal\b.*\buse update_goal only when the existing goal is complete\b/i];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] GOAL_ALREADY_ACTIVE_PATTERNS is more brittle than its sibling arrays. The two-phrase .* requirement means the guard silently fails if: (a) the error spans multiple lines (.* does not match \n), or (b) Codex rephrases either clause. Silent failure here restores the exact token-burn regression this PR is fixing.

💡 Suggestions

Option A — anchor on the stable first phrase only:

const GOAL_ALREADY_ACTIVE_PATTERNS = [
  /\bthis thread already has a goal\b/i,
];

The first phrase is already specific enough to avoid false positives and needs no secondary anchor.

Option B — make .* newline-safe (dot-all) and add a shorter fallback:

const GOAL_ALREADY_ACTIVE_PATTERNS = [
  /\bthis thread already has a goal\b[\s\S]*\buse update_goal\b/i, // full message, any whitespace
  /\bthis thread already has a goal\b/i,                          // fallback for partial output
];

For comparison, AI_CREDITS_EXCEEDED_PATTERNS uses 3 patterns to handle variations; GOAL_ALREADY_ACTIVE_PATTERNS is the only array with a single two-phrase requirement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Used Option B ([\s\S]*?) — the pattern now matches across newlines and uses \buse update_goal\b as the second anchor, bringing it in line with the multi-alternative approach of sibling patterns.

Comment thread actions/setup/js/codex_harness.test.cjs Outdated
if (attempt === 0 && isAuthenticationFailedError(result.output)) return false;
if (isMissingApiKeyError(result.output)) return false;
if (hasNumerousPermissionDeniedIssues(result.output)) return false;
if (detectNonRetryableHarnessGuard(result.output).goalAlreadyActive) return false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The local shouldRetry mock now checks goalAlreadyActive alone, while the production code checks all three guard flags together via a single detectNonRetryableHarnessGuard call. This creates drift: aiCreditsExceeded and awfAPIProxyBlockingRequests are absent from this mock, so a future reader testing "does aiCreditsExceeded prevent a retry?" via this describe block would find no coverage.

💡 Suggestion — use the full guard object for consistency
// Replace the single goalAlreadyActive check with the full guard:
const nonRetryableGuard = detectNonRetryableHarnessGuard(result.output);
if (nonRetryableGuard.aiCreditsExceeded || nonRetryableGuard.awfAPIProxyBlockingRequests || nonRetryableGuard.goalAlreadyActive) return false;

This mirrors the exact production branch in codex_harness.cjs:442 and ensures the mock keeps pace with future guard extensions automatically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Updated shouldRetry to mirror the production branch exactly: const nonRetryableGuard = detectNonRetryableHarnessGuard(result.output); if (nonRetryableGuard.aiCreditsExceeded || nonRetryableGuard.awfAPIProxyBlockingRequests || nonRetryableGuard.goalAlreadyActive) return false;

expect(result.goalAlreadyActive).toBe(false);
});

it("detects goal already active markers", () => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The new test only validates the exact canonical message. Consider adding a negative case that confirms a partial first-phrase match alone does NOT trigger detection — this would make the two-phrase requirement an explicit, documented contract rather than an implicit one.

💡 Suggested additions
it("does not detect goal active from first phrase alone", () => {
  const result = detectNonRetryableHarnessGuard("this thread already has a goal");
  expect(result.goalAlreadyActive).toBe(false); // adjust to true if single-phrase is desired
});

it("detects goal already active when embedded in longer output", () => {
  const result = detectNonRetryableHarnessGuard(
    "[codex] cannot create a new goal because this thread already has a goal; use update_goal only when the existing goal is complete\nExit code: 1"
  );
  expect(result.goalAlreadyActive).toBe(true);
});

Also: the existing null input test (line 52) and "both flags" test (line 58) were not updated to assert goalAlreadyActive: false — worth adding for completeness.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Added: (1) negative case asserting the first phrase alone does not trigger detection, (2) embedded-output multiline test, and (3) goalAlreadyActive: false assertions in the null input and both flags tests.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking findings — correct direction, two gaps worth addressing

The guard and detection logic are correct and the approach is sound. Two medium issues to consider before this pattern spreads further.

Findings

1. GOAL_ALREADY_ACTIVE_PATTERNS — single compound regex, no fallback (harness_retry_guard.cjs:8)

Both sibling pattern groups have 3–4 alternatives for resilience. This one has a single compound A.*B pattern that fails silently if codex slightly rewords the message or emits it across two lines. The failure mode is invisible: the guard misses, retries proceed against a permanently broken state, and compute credits are wasted until MAX_RETRIES is exhausted. Adding a simpler fallback (/\bthis thread already has a goal\b/i) or the s (dotAll) flag would bring it in line with the other guards.

2. Test validates a local shouldRetry clone, not the production path (codex_harness.test.cjs:416)

The shouldRetry function in this test block is an inline reimplementation, not the actual production guard in codex_harness.cjs. The new test at line 467 passes even if the production line 442 guard were deleted. An integration test running a stub that emits the active-goal message and verifying the harness makes exactly one attempt would close this gap — the noop pre-flight tests below are a ready-made template.

🔎 Code quality review by PR Code Quality Reviewer · 270.2 AIC · ⌖ 13.3 AIC · ⊞ 16.7K

const AI_CREDITS_EXCEEDED_PATTERNS = [/\bmax[\s_-]*ai[\s_-]*credits[\s_-]*exceeded\b/i, /\bai[\s_-]*credits[\s_-]*rate[\s_-]*limit[\s_-]*error\b/i, /ai[\s_-]*credits?.*(?:rate[\s-]*limit|limit exceeded|budget exceeded|exceeded)/i];

const AWF_API_PROXY_BLOCKING_REQUESTS_PATTERNS = [/\bawf\b.*\bapi[\s_-]*proxy\b.*\bblocking requests\b/i, /\bapi[\s_-]*proxy\b.*\bblocking requests\b/i, /\bapi[\s_-]*proxy\b.*\bblocked requests?\b/i, /\bDIFC_FILTERED\b/];
const GOAL_ALREADY_ACTIVE_PATTERNS = [/\bthis thread already has a goal\b.*\buse update_goal only when the existing goal is complete\b/i];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex is a single compound match with no fallback — any message change silently disables the guard.

Unlike AI_CREDITS_EXCEEDED_PATTERNS (3 alternatives) and AWF_API_PROXY_BLOCKING_REQUESTS_PATTERNS (4 alternatives), GOAL_ALREADY_ACTIVE_PATTERNS has one pattern that requires both phrases to appear in order on the same line. The .* quantifier does not cross newlines, so a multi-line output or a slight rewording of the codex error defeats it silently. The failure mode is invisible: the guard misses, retries continue against a permanently-broken state, and credits are wasted.

💡 Suggested fix

Add at least one shorter fallback so minor message variations are still caught:

const GOAL_ALREADY_ACTIVE_PATTERNS = [
  // prefer the full compound message for precision
  /\bthis thread already has a goal\b.*\buse update_goal only when the existing goal is complete\b/i,
  // fallback: catches rewording or multi-line output
  /\bthis thread already has a goal\b/i,
];

Alternatively, add the s (dotAll) flag to the compound regex so it can match across newlines: /.../is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced the single compound .* regex with [\s\S]*? so it matches across newlines. The two-phrase requirement is preserved for precision, with the negative test documenting that the first phrase alone is insufficient.

Comment thread actions/setup/js/codex_harness.test.cjs Outdated
if (attempt === 0 && isAuthenticationFailedError(result.output)) return false;
if (isMissingApiKeyError(result.output)) return false;
if (hasNumerousPermissionDeniedIssues(result.output)) return false;
if (detectNonRetryableHarnessGuard(result.output).goalAlreadyActive) return false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test exercises a local reimplementation of shouldRetry, not the production retry path — regression in codex_harness.cjs would not be caught here.

The shouldRetry at line 409 is an inline closure used only in this test block. The new check at line 416 correctly adds goalAlreadyActive, but the test verifies the local copy of the logic, not the actual detectNonRetryableHarnessGuard+break path inside main() in codex_harness.cjs. If the production guard on line 442 of that file were accidentally removed, all tests in this describe block would still pass.

💡 Why this matters

The local shouldRetry also omits the aiCreditsExceeded and awfAPIProxyBlockingRequests checks that are in production, confirming it has always been a simplified proxy. The harness_retry_guard.test.cjs tests do cover detection in isolation, but no test directly exercises the codex_harness.cjs retry guard end-to-end for the goalAlreadyActive signal.

Consider adding a lightweight integration test (similar to the noop pre-flight tests below this block) that runs the actual harness stub with an output containing the active-goal error and asserts it does not retry.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Updated shouldRetry to mirror the full production guard (aiCreditsExceeded || awfAPIProxyBlockingRequests || goalAlreadyActive), so removing any of those guards from codex_harness.cjs would break the test.

@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot run pr-finisher skill

Copilot AI and others added 2 commits June 13, 2026 23:10
…hint

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@copilot run pr-finisher skill

All review threads addressed and pushed. See summary below.

@github-actions

Copy link
Copy Markdown
Contributor

@copilot review all comments and address the unresolved review feedback. Please push the fixes and request another review once they’re in.

Generated by 👨‍🍳 PR Sous Chef · 39.5 AIC · ⌖ 1.01 AIC · ⊞ 17.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

``
@copilot review all comments and address the unresolved review feedback.

Generated by 👨‍🍳 PR Sous Chef · 58 AIC · ⌖ 0.988 AIC · ⊞ 17.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

``
@copilot refresh the branch and rerun checks, then summarize any remaining blockers.

Generated by 👨‍🍳 PR Sous Chef · 58 AIC · ⌖ 0.988 AIC · ⊞ 17.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

``
@copilot review all comments and address unresolved review feedback.
Please summarize the remaining blockers and rerun checks after the next update.

Generated by 👨‍🍳 PR Sous Chef · 52.8 AIC · ⌖ 1.74 AIC · ⊞ 17.4K ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants