Skip to content

Eliminate looped time.After timer leaks, propagate cancellation correctly, and enforce timeafterleak in CI#39188

Merged
pelikhan merged 4 commits into
mainfrom
copilot/fix-timeafterleak-issues
Jun 14, 2026
Merged

Eliminate looped time.After timer leaks, propagate cancellation correctly, and enforce timeafterleak in CI#39188
pelikhan merged 4 commits into
mainfrom
copilot/fix-timeafterleak-issues

Conversation

Copilot AI commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

timeafterleak identified three production retry/poll loops using time.After in select within loop bodies, which allocates one timer per iteration until it fires. This PR converts those sites to explicit timers, tightens cancellation behavior in MCP server readiness polling, and adds CI enforcement so the pattern does not regress.

  • Timer lifecycle fix in looped select paths

    • Replaced loop-scoped time.After(...) receives with time.NewTimer(...).
    • On ctx.Done() paths, explicitly call timer.Stop() before returning.
    • Updated the three affected production call sites:
      • pkg/cli/docker_images.go (retry backoff loop)
      • pkg/cli/add_interactive_workflow.go (status polling loop)
      • pkg/cli/mcp_inspect_mcp_scripts_server.go (server readiness polling loop)
  • Cancellation/error propagation improvement

    • Updated MCP scripts server readiness waiting to return concrete errors instead of collapsing cancellation and timeout into the same false result.
    • Context cancellation now propagates as ctx.Err() from readiness polling.
  • CI enforcement

    • Added -timeafterleak to custom linter flags in .github/workflows/cgo.yml so looped time.After usage is blocked going forward.
    • Added an inline note clarifying that -test=false intentionally scopes this check to production code.
  • Regression tests

    • Added context-cancellation coverage for checkStatusAndOfferRun in pkg/cli/add_interactive_workflow_test.go.
    • Added readiness polling cancellation/timeout tests in pkg/cli/mcp_inspect_mcp_scripts_server_test.go.

Example pattern now used at these call sites:

timer := time.NewTimer(d)
select {
case <-timer.C:
	// continue
case <-ctx.Done():
	timer.Stop()
	return ctx.Err()
}

Copilot AI and others added 2 commits June 14, 2026 06:21
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
@github-actions

This comment has been minimized.

Copilot AI changed the title [WIP] Fix timer leaks caused by time.After in loops Eliminate looped time.After timer leaks and enforce timeafterleak in CI Jun 14, 2026
Copilot AI requested a review from pelikhan June 14, 2026 06:32
@github-actions github-actions Bot mentioned this pull request Jun 14, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Hey @copilot-swe-agent 👋 — great work identifying and fixing the time.After timer leaks in the three retry/poll loops! The fix pattern is correct and the CI enforcement via -timeafterleak is a solid regression guard. A couple of things to address before this is ready:

  • Unfocused diffpkg/workflow/schemas/github-workflow.json (+245/-49) contains a large JSON Schema refactoring (wrapping $ref entries in allOf wrappers) that is completely unrelated to the timer leak fix and is not mentioned in the PR description. This should be split into its own PR so reviewers can evaluate the two changes independently.
  • No tests — the timer lifecycle changes in docker_images.go, add_interactive_workflow.go, and mcp_inspect_mcp_scripts_server.go have no accompanying test coverage. At minimum, unit or integration tests exercising the context-cancellation path (the ctx.Done()timer.Stop() branch) would give reviewers confidence the fix is correct.

If you would like a hand with those, you can assign this prompt to your coding agent:

This PR fixes looped time.After timer leaks in three files:
- pkg/cli/docker_images.go
- pkg/cli/add_interactive_workflow.go
- pkg/cli/mcp_inspect_mcp_scripts_server.go

Please do two things:
1. Remove the changes to pkg/workflow/schemas/github-workflow.json from this branch (revert that file to match main). Those changes are unrelated and should go in a separate PR.
2. Add unit tests for the context-cancellation path in each of the three fixed files. Each test should verify that when the context is cancelled during the timer wait, the function returns ctx.Err() promptly and does not leak the timer.

Generated by ✅ Contribution Check · 555.8 AIC · ⌖ 12.9 AIC · ⊞ 24.7K ·

@pelikhan pelikhan marked this pull request as ready for review June 14, 2026 14:40
Copilot AI review requested due to automatic review settings June 14, 2026 14:40
@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Test Quality Sentinel completed test quality analysis.

No test files were added or modified in this PR. The changes include .github/workflows/cgo.yml (CI enforcement), three Go source files with timer leak fixes, and a JSON schema file. Test Quality Sentinel skipped.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to eliminate loop-scoped time.After usage inside select statements (to avoid per-iteration timer leaks) and to enforce the timeafterleak custom analyzer in CI to prevent regressions. In addition, it updates the embedded GitHub Actions workflow JSON schema by wrapping several $ref-based definitions in allOf (which can change validation semantics by ensuring sibling keywords are applied).

Changes:

  • Replaced looped time.After(...) calls with time.NewTimer(...) and timer.Stop() on ctx.Done() paths in three CLI loops.
  • Enabled the -timeafterleak custom analyzer in .github/workflows/cgo.yml.
  • Updated pkg/workflow/schemas/github-workflow.json to wrap certain $ref usages in allOf (not described in the PR metadata, but included in the diff).
Show a summary per file
File Description
pkg/workflow/schemas/github-workflow.json Wrapes various $ref sites in allOf (notably around event objects/types), potentially changing schema validation behavior.
pkg/cli/mcp_inspect_mcp_scripts_server.go Replaces looped time.After with an explicit timer in server readiness polling.
pkg/cli/docker_images.go Replaces looped time.After with an explicit timer in retry backoff handling.
pkg/cli/add_interactive_workflow.go Replaces looped time.After with an explicit timer in status polling.
.github/workflows/cgo.yml Adds -timeafterleak to the custom linter flags in CI.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 5/5 changed files
  • Comments generated: 2

Comment on lines +1096 to +1101
"branch_protection_rule": {
"$comment": "https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#branch_protection_rule",
"$ref": "#/definitions/eventObject",
"allOf": [
{
"$ref": "#/definitions/eventObject"
}
Comment on lines +1105 to +1109
"types": {
"$ref": "#/definitions/types",
"allOf": [
{
"$ref": "#/definitions/types"
}
@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

PR Code Quality Reviewer completed the code quality review.

@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions

Copy link
Copy Markdown
Contributor

✅ Design Decision Gate — ADR Verified

ADR reviewed: ADR-39133: Ship a Custom timeafterleak Linter for time.After in Loop Selects — implementation aligns with the stated decision. Great work! 🏗️

This PR is the direct follow-up to ADR-39133 (which introduced the timeafterleak analyzer): it remediates the call sites the ADR identified and turns the linter on in CI.

📋 Verification Summary
  • Three production occurrences → three fixes. ADR-39133 Context states a code scan "found three production occurrences of the pattern." This PR converts exactly those three loop-select time.After sites to explicit time.NewTimer + timer.Stop() on the ctx.Done() path:
    • pkg/cli/docker_images.go (retry backoff loop)
    • pkg/cli/add_interactive_workflow.go (status polling loop)
    • pkg/cli/mcp_inspect_mcp_scripts_server.go (server readiness polling loop)
  • CI enforcement matches the ADR consequence. ADR-39133 states the analyzer "Catches a real, otherwise-undetected resource-leak pattern automatically in CI." This PR adds -timeafterleak to the custom linter flags in .github/workflows/cgo.yml, realizing that consequence.
  • Fix pattern is canonical. The NewTimer/Stop remediation is the standard fix for the exact loop-select Comm-position leak the analyzer flags — no divergence from the decision.
  • The remaining large diff in pkg/workflow/schemas/github-workflow.json is a generated-schema regeneration and carries no separate architectural decision.

No divergences found; the ADR and implementation are in sync.

🏗️ ADR gate enforced by Design Decision Gate 🏗️ · 91.2 AIC · ⌖ 17.1 AIC · ⊞ 13.1K ·

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /diagnose, /tdd, and /improve-codebase-architecture — requesting changes on one correctness gap and flagging several improvement opportunities.

📋 Key Themes & Highlights

Key Themes

  • Silent error drop (mcp_inspect_mcp_scripts_server.go): waitForServerReady returns bool, so ctx.Err() is silently dropped — callers cannot distinguish cancellation from timeout. This is a latent correctness issue exposed by the refactor.
  • Missing regression tests: All three fix sites lack context-cancellation tests. The linter enforces the pattern at the syntax level, but a misbehaving refactor (e.g. removing timer.Stop()) would pass CI since test files are excluded from linting (-test=false).
  • Duplicated idiom: All three files implement the same NewTimer/select/Stop block inline. A contextSleep(ctx, d) helper would consolidate the pattern and make future improvements (e.g. drain after Stop) a one-line change.
  • Unrelated schema change: pkg/workflow/schemas/github-workflow.json (+245/-49) has no connection to timer leaks. Bundling it makes bisection harder and the $ref → allOf wrapping across dozens of event definitions warrants independent review.

Positive Highlights

  • ✅ Correct fix pattern at all three call sites — time.NewTimer with timer.Stop() on the ctx.Done() path
  • ✅ CI enforcement via -timeafterleak is the right systemic control; once merged, the linter will block future regressions in production code
  • ✅ Linter has thorough testdata coverage including edge cases (FuncLit boundary, single-case select, nolint directives)
  • ✅ Exponential-backoff timer duration in docker_images.go is naturally handled by the new-per-iteration approach

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 404.7 AIC · ⌖ 26 AIC · ⊞ 27.9K

timer := time.NewTimer(mcpScriptsServerStartupDelay)
select {
case <-ctx.Done():
timer.Stop()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] return false silently drops the cancellation reason — callers cannot distinguish "context was cancelled" from "server never became ready".

💡 Suggested fix: return error instead of bool

Changing the function signature to func waitForServerReady(...) error lets the cancellation propagate:

case <-ctx.Done():
    timer.Stop()
    return ctx.Err()   // propagates cancellation
case <-timer.C:
}
// ...
mcpInspectLog.Printf("Server did not become ready within timeout")
return fmt.Errorf("server did not become ready within timeout")

With bool, a cancelled context and a genuine timeout produce the same false result. Callers log "server not ready" even when the real cause is an upstream shutdown, making incidents harder to diagnose.

var workflowFound bool
for i := range 5 {
// Wait 2 seconds before each check (including the first)
timer := time.NewTimer(2 * time.Second)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] No regression test covers context cancellation in this polling loop — the most common real-world code path for the timer fix.

💡 Suggested test skeleton
func TestCheckStatusAndOfferRun_ContextCancelled(t *testing.T) {
    ctx, cancel := context.WithCancel(context.Background())
    cancel() // pre-cancel
    cfg := &AddInteractiveConfig{Verbose: true}
    err := cfg.checkStatusAndOfferRun(ctx)
    if !errors.Is(err, context.Canceled) {
        t.Errorf("expected context.Canceled, got %v", err)
    }
}

The current timer fix is correct, but without a test a future refactor could reintroduce time.After (or forget timer.Stop()) and silently pass CI since the linter only covers production files (-test=false).

Comment thread pkg/cli/docker_images.go
dockerImagesLog.Printf("Failed to download image %s (attempt %d/%d). Retrying in %ds...", image, attempt, maxAttempts, waitTime)

// Use context-aware sleep
timer := time.NewTimer(time.Duration(waitTime) * time.Second)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/improve-codebase-architecture] This is the third file in this PR that implements the same NewTimer → select → Stop-on-cancel idiom. A shared helper would prevent future drift and consolidate the canonical pattern.

💡 Suggested helper
// contextSleep pauses for d or until ctx is cancelled.
// Returns ctx.Err() if the context fires first, nil otherwise.
func contextSleep(ctx context.Context, d time.Duration) error {
    timer := time.NewTimer(d)
    select {
    case <-ctx.Done():
        timer.Stop()
        return ctx.Err()
    case <-timer.C:
        return nil
    }
}

All three call sites (add_interactive_workflow.go, docker_images.go, mcp_inspect_mcp_scripts_server.go) would reduce to a single if err := contextSleep(ctx, d); err != nil { return err }. Any future fix or improvement (e.g., adding drain after Stop()) would only need to be made once.

Comment thread .github/workflows/cgo.yml
# legacy custom analyzer findings in tests or other analyzer families.
- name: Run custom linters
run: make golint-custom LINTER_FLAGS="-errstringmatch -panicinlibrarycode -manualmutexunlock -osexitinlibrary -rawloginlib -regexpcompileinfunction -fprintlnsprintf -strconvparseignorederror -jsonmarshalignoredeerror -uncheckedtypeassertion -fmterrorfnoverbs -tolowerequalfold -httpnoctx -test=false"
run: make golint-custom LINTER_FLAGS="-errstringmatch -panicinlibrarycode -manualmutexunlock -osexitinlibrary -rawloginlib -regexpcompileinfunction -fprintlnsprintf -strconvparseignorederror -jsonmarshalignoredeerror -uncheckedtypeassertion -fmterrorfnoverbs -tolowerequalfold -httpnoctx -timeafterleak -test=false"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/diagnose] -test=false means the timeafterleak linter skips all test files. Test helpers, integration test fixtures, and fake clocks that use time.After in a loop will not be caught by this gate. Consider adding a separate lint step (or a comment in the CI YAML) noting this intentional scope limit so future contributors know test files are unprotected.

@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot run pr-finisher skill

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

@copilot run pr-finisher skill

Implemented PR-finisher pass in commit 3dba05a: reverted the unrelated schema changes, fixed waitForServerReady to propagate cancellation/timeout errors, added cancellation/timeout tests, and added a clarifying note on -test=false scope in cgo.yml. Local validation passed (make fmt, make lint, make test-unit, make test, make recompile, make agent-report-progress). CI on this new HEAD is stale/not rerun from the agent side, so maintainer re-trigger is still needed before merge.

Copilot AI changed the title Eliminate looped time.After timer leaks and enforce timeafterleak in CI Eliminate looped time.After timer leaks, propagate cancellation correctly, and enforce timeafterleak in CI Jun 14, 2026
@pelikhan pelikhan merged commit 4ee14ca into main Jun 14, 2026
36 checks passed
@pelikhan pelikhan deleted the copilot/fix-timeafterleak-issues branch June 14, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants