Skip to content

HTTP MCP servers fail with TypeError: fetch failed after idle period — CLI reuses dead pooled TCP connection #3257

@pjperez

Description

@pjperez

Describe the bug

Summary

When the Copilot CLI process is left running for an extended idle period (typically a few minutes, matching common NAT / stateful-firewall idle timeouts of 60–300 s), the underlying TCP connection to an HTTP MCP server is silently dropped on-path — no FIN, no RST, the flow just disappears from the middlebox's state table. The client kernel never learns the connection is dead, undici keeps the socket in its pool, and the next MCP request writes to the dead socket and fails with TypeError: fetch failed.

The CLI does retry, but the retries fire fast enough that they reuse the same dead pool entry (all retries fail within ~1.5 s with the same error), so the server is then marked as failed for the session until a manual /mcp reconnect or process restart.

This is especially painful when driving the CLI through the SDK, where a single CLI process is shared across many parallel sessions: the SDK has no handle on the MCP transport, so there is no client-side workaround.

The straightforward fix is to set undici's keepAliveInitialDelay on the MCP HTTP Agent to something below typical middlebox idle timeouts (e.g. 30 s). undici enables SO_KEEPALIVE by default but leaves the initial probe delay at the OS default — which on Linux/macOS/Windows is ~2 hours, so probes never fire inside a realistic session. Lowering it makes the kernel send keep-alive probes often enough to keep idle connections alive end-to-end.

Affected version

GitHub Copilot CLI 1.0.45

Environment

  • OS: Windows 11 Enterprise 26200, x64 (also reproducible on macOS / Linux per general HTTP behavior)
  • PowerShell 7.5.5
  • Driver: official Copilot CLI SDK, one copilot process serving N parallel sessions
  • MCP servers: two type: http servers — one fronted by an Azure VM with stock nginx defaults (~75s idle timeout), one fronted by Azure API Management (~4-minute idle timeout). Both reproduce.
  • Idle period before failure: matches the upstream idle timeout in each case.

Steps to reproduce

  1. Configure an HTTP MCP server in ~/.copilot/mcp-config.json whose endpoint is fronted by anything that closes idle TCP connections after a finite timeout (most do — Node default 5s, nginx 75s, AWS ALB 60s, Azure APIM ~4min, Azure App Service 240s, etc.):

    {
      "mcpServers": {
        "example": {
          "type": "http",
          "url": "https://example-mcp.invalid/mcp/",
          "tools": ["*"]
        }
      }
    }
  2. Start copilot and open a session via the SDK. Confirm /mcp shows the server connected and an MCP tool call succeeds.

  3. Leave the process idle for longer than the upstream idle timeout (e.g. 5–10 minutes for nginx defaults; longer for APIM/ALB).

  4. Send a new SDK request that triggers an MCP tool call.

Expected behavior

The CLI either (a) detects the closed socket and transparently reconnects, or (b) retries the request once on a fresh connection. Per RFC 7230 §6.3.1, clients reusing persistent connections must be prepared for the server to close at any time and should retry idempotent requests once on a connection-level failure where no response bytes were received. initialize, tools/list, and tools/call (with the same JSON-RPC id) are all safe to retry under those conditions.

What actually happens

The MCP request fails immediately with a transport-level error and the server is then marked as failed for the session. From ~/.copilot/logs/process-*.log (real signature, multiple processes, multiple servers):

2026-05-12T10:11:25.125Z [ERROR] MCP transport for <server> closed
2026-05-12T10:11:25.126Z [ERROR] Transient error connecting to HTTP server <server>: TypeError: fetch failed
2026-05-12T10:11:25.650Z [ERROR] MCP transport for <server> closed
2026-05-12T10:11:25.650Z [ERROR] Transient error connecting to HTTP server <server>: TypeError: fetch failed
2026-05-12T10:11:26.690Z [ERROR] MCP transport for <server> closed
2026-05-12T10:11:26.690Z [ERROR] Failed to start MCP client for remote server <server>: TypeError: fetch failed
2026-05-12T10:11:26.690Z [ERROR] Recorded failure for server <server>: fetch failed

Two observations:

  1. The retries fire 525 ms and then 1040 ms after the first failure — fast enough that they almost certainly hit the same broken connection in the pool rather than dialing a fresh one.
  2. The log only surfaces TypeError: fetch failed and discards error.cause, which in undici typically carries the actual reason (ECONNRESET, UND_ERR_SOCKET, "other side closed", etc.). That makes triage harder than it needs to be — see "Diagnostic improvement" below.

The same servers are healthy throughout: a manual /mcp reconnect or a CLI restart fixes the issue immediately, and both servers continue serving other clients normally.

Likely cause

The pooled HTTP/1.1 connection to the MCP server is being silently dropped on idle by an intermediary (typically NAT / stateful firewall on the path) — no FIN, no RST, the flow just disappears from the middlebox's state table. The client kernel never learns the connection is dead, undici keeps the socket in its pool, and the next request writes to it and either times out or eventually surfaces as TypeError: fetch failed. The retry path then hits the same dead socket and fails the same way.

This is a textbook case for TCP keep-alive: periodic probe packets on idle sockets keep the flow live in the middlebox's state table (and detect genuine breakage if the path actually died). undici does enable SO_KEEPALIVE by default via socket.setKeepAlive(true, …), but the initial probe delay (keepAliveInitialDelay) is left at the OS default — which on Linux/macOS/Windows is ~2 hours, so the probes never fire inside a realistic session. The result is "TCP keep-alive enabled, in name only."

Suggested fix

Set keepAliveInitialDelay on the undici Agent used for HTTP MCP to a value below common middlebox idle timeouts — e.g. 30–60 seconds. That alone causes the kernel to send keep-alive probes often enough to keep idle MCP connections alive through the typical NAT/firewall idle window (commonly 60–300 s).

Concretely, when constructing the dispatcher for HTTP MCP:

import { Agent } from 'undici';

const mcpAgent = new Agent({
  connect: {
    keepAlive: true,
    keepAliveInitialDelay: 30_000, // 30s — well below typical middlebox idle timeouts
  },
});

Optional belt-and-suspenders

  • Retry once on connection-level failure for idempotent MCP JSON-RPC POSTs (undici.RetryAgent does this with one wrapper). Covers the residual race where a connection dies between the last probe and the next request.
  • Send periodic MCP ping notifications on idle HTTP MCP sessions as application-level keepalive. Useful if the server's MCP-session GC is also idle-based (see related Streamable HTTP error - Session not found #1360).

Diagnostic improvement

The log line:

TypeError: fetch failed

drops the underlying error.cause. In Node's built-in fetch, error.cause typically carries the actual reason (ECONNRESET, UND_ERR_SOCKET, "other side closed", ETIMEDOUT, TLS failures, etc.). Logging err?.cause?.code ?? err?.cause?.message alongside the generic message would let users (and bug reports) distinguish these without having to attach a debugger or sniff packets. It's a one-line change.

Why no SDK-side workaround works

The SDK communicates with the CLI process; the CLI owns the MCP transport lifecycle. The SDK can't see, ping, or reset the MCP socket, so the only workarounds available today are:

  • Restarting the entire CLI process (kills all parallel sessions sharing it).
  • Switching the affected servers to stdio transport (not always possible).
  • Manual /mcp reconnect from a TUI session (not reachable via SDK at all).

None of these are acceptable for a long-running multi-session SDK driver, which is why a fix in the CLI's HTTP MCP transport is the only viable path.

Related

  • Streamable HTTP error - Session not found #1360Streamable HTTP error - Session not found. Closest sibling. Same symptom class (HTTP MCP works initially → degrades after idle → exit/resume of the CLI fixes it), but a different layer: there the server returns JSON-RPC Session not found (-32001), i.e. the MCP-level Mcp-Session-Id was GC'd by the server. The bug filed here is TCP-level (fetch failed). The two might share a common root cause — "no liveness/keepalive on idle HTTP MCP sessions" — or might be independent; worth investigating together.
  • Background Agent MCP Connections Destroyed After Text-Only Turn #2949Background agent MCP connections destroyed after text-only turn. Different root cause (LLM produces a text-only turn → CLI tears down MCP), but same general area: HTTP MCP transport lifecycle reliability.

Affected version

No response

Steps to reproduce the behavior

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:mcpMCP server configuration, discovery, connectivity, OAuth, policy, and registryarea:networkingProxy, SSL/TLS, certificates, corporate environments, and connectivity issues

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions