HTTP MCP servers fail with `TypeError: fetch failed` after idle period — CLI reuses dead pooled TCP connection

### Describe the bug

## Summary

When the Copilot CLI process is left running for an extended idle period (typically a few minutes, matching common NAT / stateful-firewall idle timeouts of 60–300 s), the underlying TCP connection to an HTTP MCP server is silently dropped on-path — no FIN, no RST, the flow just disappears from the middlebox's state table. The client kernel never learns the connection is dead, undici keeps the socket in its pool, and the next MCP request writes to the dead socket and fails with `TypeError: fetch failed`.

The CLI does retry, but the retries fire fast enough that they reuse the same dead pool entry (all retries fail within ~1.5 s with the same error), so the server is then marked as failed for the session until a manual `/mcp` reconnect or process restart.

This is especially painful when driving the CLI through the SDK, where a single CLI process is shared across many parallel sessions: the SDK has no handle on the MCP transport, so there is no client-side workaround.

The straightforward fix is to set undici's `keepAliveInitialDelay` on the MCP HTTP `Agent` to something below typical middlebox idle timeouts (e.g. 30 s). undici enables `SO_KEEPALIVE` by default but leaves the initial probe delay at the OS default — which on Linux/macOS/Windows is ~2 hours, so probes never fire inside a realistic session. Lowering it makes the kernel send keep-alive probes often enough to keep idle connections alive end-to-end.

## Affected version

`GitHub Copilot CLI 1.0.45`

## Environment

- OS: Windows 11 Enterprise 26200, x64 (also reproducible on macOS / Linux per general HTTP behavior)
- PowerShell 7.5.5
- Driver: official Copilot CLI SDK, one `copilot` process serving N parallel sessions
- MCP servers: two `type: http` servers — one fronted by an Azure VM with stock nginx defaults (~75s idle timeout), one fronted by Azure API Management (~4-minute idle timeout). Both reproduce.
- Idle period before failure: matches the upstream idle timeout in each case.

## Steps to reproduce

1. Configure an HTTP MCP server in `~/.copilot/mcp-config.json` whose endpoint is fronted by anything that closes idle TCP connections after a finite timeout (most do — Node default 5s, nginx 75s, AWS ALB 60s, Azure APIM ~4min, Azure App Service 240s, etc.):

   ```json
   {
     "mcpServers": {
       "example": {
         "type": "http",
         "url": "https://example-mcp.invalid/mcp/",
         "tools": ["*"]
       }
     }
   }
   ```

2. Start `copilot` and open a session via the SDK. Confirm `/mcp` shows the server connected and an MCP tool call succeeds.
3. Leave the process idle for longer than the upstream idle timeout (e.g. 5–10 minutes for nginx defaults; longer for APIM/ALB).
4. Send a new SDK request that triggers an MCP tool call.

## Expected behavior

The CLI either (a) detects the closed socket and transparently reconnects, or (b) retries the request once on a *fresh* connection. Per RFC 7230 §6.3.1, clients reusing persistent connections must be prepared for the server to close at any time and should retry idempotent requests once on a connection-level failure where no response bytes were received. `initialize`, `tools/list`, and `tools/call` (with the same JSON-RPC id) are all safe to retry under those conditions.

## What actually happens

The MCP request fails immediately with a transport-level error and the server is then marked as failed for the session. From `~/.copilot/logs/process-*.log` (real signature, multiple processes, multiple servers):

```
2026-05-12T10:11:25.125Z [ERROR] MCP transport for <server> closed
2026-05-12T10:11:25.126Z [ERROR] Transient error connecting to HTTP server <server>: TypeError: fetch failed
2026-05-12T10:11:25.650Z [ERROR] MCP transport for <server> closed
2026-05-12T10:11:25.650Z [ERROR] Transient error connecting to HTTP server <server>: TypeError: fetch failed
2026-05-12T10:11:26.690Z [ERROR] MCP transport for <server> closed
2026-05-12T10:11:26.690Z [ERROR] Failed to start MCP client for remote server <server>: TypeError: fetch failed
2026-05-12T10:11:26.690Z [ERROR] Recorded failure for server <server>: fetch failed
```

Two observations:

1. The retries fire 525 ms and then 1040 ms after the first failure — fast enough that they almost certainly hit the same broken connection in the pool rather than dialing a fresh one.
2. The log only surfaces `TypeError: fetch failed` and discards `error.cause`, which in undici typically carries the actual reason (`ECONNRESET`, `UND_ERR_SOCKET`, `"other side closed"`, etc.). That makes triage harder than it needs to be — see "Diagnostic improvement" below.

The same servers are healthy throughout: a manual `/mcp` reconnect or a CLI restart fixes the issue immediately, and both servers continue serving other clients normally.

## Likely cause

The pooled HTTP/1.1 connection to the MCP server is being silently dropped on idle by an intermediary (typically NAT / stateful firewall on the path) — no FIN, no RST, the flow just disappears from the middlebox's state table. The client kernel never learns the connection is dead, undici keeps the socket in its pool, and the next request writes to it and either times out or eventually surfaces as `TypeError: fetch failed`. The retry path then hits the same dead socket and fails the same way.

This is a textbook case for TCP keep-alive: periodic probe packets on idle sockets keep the flow live in the middlebox's state table (and detect genuine breakage if the path actually died). undici does enable `SO_KEEPALIVE` by default via `socket.setKeepAlive(true, …)`, but the **initial probe delay** (`keepAliveInitialDelay`) is left at the OS default — which on Linux/macOS/Windows is ~2 hours, so the probes never fire inside a realistic session. The result is "TCP keep-alive enabled, in name only."

## Suggested fix

**Set `keepAliveInitialDelay` on the undici `Agent` used for HTTP MCP to a value below common middlebox idle timeouts** — e.g. 30–60 seconds. That alone causes the kernel to send keep-alive probes often enough to keep idle MCP connections alive through the typical NAT/firewall idle window (commonly 60–300 s).

Concretely, when constructing the dispatcher for HTTP MCP:

```js
import { Agent } from 'undici';

const mcpAgent = new Agent({
  connect: {
    keepAlive: true,
    keepAliveInitialDelay: 30_000, // 30s — well below typical middlebox idle timeouts
  },
});
```

### Optional belt-and-suspenders

- **Retry once on connection-level failure** for idempotent MCP JSON-RPC POSTs (`undici.RetryAgent` does this with one wrapper). Covers the residual race where a connection dies between the last probe and the next request.
- **Send periodic MCP `ping` notifications** on idle HTTP MCP sessions as application-level keepalive. Useful if the server's MCP-session GC is also idle-based (see related #1360).

## Diagnostic improvement

The log line:

```
TypeError: fetch failed
```

drops the underlying `error.cause`. In Node's built-in `fetch`, `error.cause` typically carries the actual reason (`ECONNRESET`, `UND_ERR_SOCKET`, `"other side closed"`, `ETIMEDOUT`, TLS failures, etc.). Logging `err?.cause?.code ?? err?.cause?.message` alongside the generic message would let users (and bug reports) distinguish these without having to attach a debugger or sniff packets. It's a one-line change.

## Why no SDK-side workaround works

The SDK communicates with the CLI process; the CLI owns the MCP transport lifecycle. The SDK can't see, ping, or reset the MCP socket, so the only workarounds available today are:

- Restarting the entire CLI process (kills all parallel sessions sharing it).
- Switching the affected servers to `stdio` transport (not always possible).
- Manual `/mcp` reconnect from a TUI session (not reachable via SDK at all).

None of these are acceptable for a long-running multi-session SDK driver, which is why a fix in the CLI's HTTP MCP transport is the only viable path.

## Related

- **#1360** — *Streamable HTTP error - Session not found.* Closest sibling. Same symptom class (HTTP MCP works initially → degrades after idle → exit/resume of the CLI fixes it), but a different layer: there the server returns JSON-RPC `Session not found` (-32001), i.e. the MCP-level `Mcp-Session-Id` was GC'd by the server. The bug filed here is TCP-level (`fetch failed`). The two might share a common root cause — "no liveness/keepalive on idle HTTP MCP sessions" — or might be independent; worth investigating together.
- #2949 — *Background agent MCP connections destroyed after text-only turn.* Different root cause (LLM produces a text-only turn → CLI tears down MCP), but same general area: HTTP MCP transport lifecycle reliability.


### Affected version

_No response_

### Steps to reproduce the behavior

_No response_

### Expected behavior

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP MCP servers fail with `TypeError: fetch failed` after idle period — CLI reuses dead pooled TCP connection #3257

Describe the bug

Summary

Affected version

Environment

Steps to reproduce

Expected behavior

What actually happens

Likely cause

Suggested fix

Optional belt-and-suspenders

Diagnostic improvement

Why no SDK-side workaround works

Related

Affected version

Steps to reproduce the behavior

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

HTTP MCP servers fail with TypeError: fetch failed after idle period — CLI reuses dead pooled TCP connection #3257

Description

Describe the bug

Summary

Affected version

Environment

Steps to reproduce

Expected behavior

What actually happens

Likely cause

Suggested fix

Optional belt-and-suspenders

Diagnostic improvement

Why no SDK-side workaround works

Related

Affected version

Steps to reproduce the behavior

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

HTTP MCP servers fail with `TypeError: fetch failed` after idle period — CLI reuses dead pooled TCP connection #3257