Skip to content

Fix process crash when testing Stderr#1449

Open
ericstj wants to merge 33 commits intomodelcontextprotocol:mainfrom
ericstj:fix1448
Open

Fix process crash when testing Stderr#1449
ericstj wants to merge 33 commits intomodelcontextprotocol:mainfrom
ericstj:fix1448

Conversation

@ericstj
Copy link
Copy Markdown
Contributor

@ericstj ericstj commented Mar 19, 2026

Fix #1448

Summary of fixes

Product changes

  • StreamClientSessionTransport.DisposeAsync — Safety-net SetDisconnected() after CleanupAsync prevents indefinite hang when ReadMessagesAsync and DisposeAsync race
  • StdioClientSessionTransport.GetUnexpectedExitExceptionAsync — Standalone timeout CTS instead of linked token prevents CancelShutdown() from prematurely aborting stderr pipe drain
  • StdioClientTransport — try/catch around user's StandardErrorLines callback prevents crashing the host; named handler + detach enables proper cleanup

Test fixes

  • 6 test files: Replace WithStdioServerTransport() with WithStreamServerTransport(Stream.Null, Stream.Null)StdioServerTransport blocks a thread pool thread on stdin
  • Conformance tests: try/catch in output handlers + explicit handler detach
  • ConformanceClient: Top-level exception handler prevents crash dumps
  • New test: CreateAsync_StdErrCallbackThrows_DoesNotCrashProcess

@stephentoub

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Mar 20, 2026

This may not be the root cause of the problem. We've now got dumps on both windows and linux. I'll have a look and see if the dumps reveal a better root-cause.

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Mar 23, 2026

Crashes fixed, but we have tests hanging. Could be due to the WaitForExit is now hanging. Downloaded the dumps and having a look.

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Mar 23, 2026

Yeah, WaitForExit is now hanging... I'm not convinced this is the right change. I'm wondering more "why aren't processes being cleaned up" and do we have a hang or test bug somewhere. Also - the fragility here where missing something means a crash seems busted. I think we should have a design that's resilient to not cleaning up external state.

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Mar 24, 2026

Found leaked StdioServerTransport objects in dumps, these kept handles open to StdIn and were causing threadpool starvation. -- at least that's the theory.

What this fixes is the test host hang, which CI reports as a failure. The causal chain:

  1. 7 tests create StdioServerTransport via DI without disposing ServiceProvider
  2. Each transport starts Task.Run(ReadMessagesAsync) → blocks a thread pool thread on Console.OpenStandardInput()
    (native read that never returns)
  3. On 2-vCPU CI runners, 7 permanently blocked TP threads exhaust the pool
  4. Other tests needing TP threads for async I/O (like Process.Start completions) starve and hang
  5. CI blame collector detects the hang → kills the test host → reported as failure

The await using ensures DisposeAsync() is called on the ServiceProvider, which calls
StdioServerTransport.DisposeAsync(), which cancels the CancellationTokenSource and unblocks the read loop — freeing
those TP threads.

ericstj added 3 commits March 25, 2026 10:19
StdioServerTransport does not reliably close on linux, which causes a
threadpool thread to be blocked on read.  This can lead to threadpool
starvation and application hangs.  Since the tests in question do not
require stdio transport, switch to using StreamServerTransport with null
streams to avoid this issue.
@ericstj ericstj requested a review from stephentoub March 26, 2026 03:50
ericstj and others added 4 commits March 26, 2026 11:08
When CleanupAsync is entered from ReadMessagesAsync (stdout EOF), the
cancellation token is _shutdownCts.Token which hasn't been canceled yet
(base.CleanupAsync runs later). If stderr EOF hasn't been delivered by
the threadpool yet, WaitForExitAsync hangs indefinitely.

Use a linked CTS with ShutdownTimeout to bound the wait. The process is
already dead; we're just draining stderr pipe buffers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xUnit v3 test host hangs intermittently on .NET 10 when RuntimeAsync is
enabled (the default). The hang occurs after all tests have passed, with
the test host process stalling indefinitely. Setting DOTNET_RuntimeAsync=0
reliably prevents the hang.

This is a temporary workaround pending a fix in the .NET runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TEMPORARY diagnostic instrumentation to capture what happens during the
intermittent xUnit v3 RunTest hang where finished TCS is never signaled.

Changes:
- Instrumented xunit.v3.core.dll with trace points in TestRunner.RunTest:
  before/after EC.Run, runTest entry, pre-try completion, test invocation,
  and finished signaling. Also moves pre-try code inside try block as a
  structural fix.
- DiagnosticExceptionTracing.cs module initializer hooking
  UnhandledException and UnobservedTaskException for additional context.
- Both test csproj files copy instrumented DLLs post-build via
  ReplaceXunitWithInstrumented target.
- CI workflow collects xunit-runtest-diag.log as test artifact.

All instrumentation writes to xunit-runtest-diag.log and stderr.
Remove once the root cause is identified.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The async chain from ConnectAsync through initializeTask/processingTask
can become GC-unreachable when the server process exits before responding.

Task.Delay creates a System.Threading.Timer that is directly rooted in
the runtime's timer queue, providing a simple GC root chain:
Timer Queue -> Timer -> Task.Delay -> WhenAny -> ConnectAsync.

This ensures ConnectAsync always resumes within the initialization
timeout, even if intermediate state machines are collected.

Use a dedicated CTS for the delay (not the caller's cancellationToken)
so external cancellation flows through initializeTask via
initializationCts, preserving the expected OperationCanceledException
propagation path.

Only call FailPendingRequests when processingTask wins (server exited).
When delayTask wins, let the CancelAfter timer cancel initializationCts
naturally, producing TimeoutException through the existing OCE catch
block.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ericstj and others added 4 commits April 1, 2026 17:19
xUnit v3's TestRunner.RunTest wraps test execution in an async void
callback (required by ExecutionContext.Run's ContextCallback signature).
When the test's async continuation chain forms a cycle with no external
GC root (no I/O, no timers), the entire object graph is collectible.
In Release mode on Ubuntu, the JIT's aggressive optimizations make this
reproducible — the async void state machine gets collected, the test
never completes, and the test host hangs.

The fix extracts the async void body into an async Task method and
holds a reference to the returned Task, keeping the entire async chain
GC-rooted for the lifetime of RunTest.

Also reverts prior ConnectAsync/McpSessionHandler workarounds that
were attempting to address this from the product side.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace async void with plain void for the ExecutionContext.Run callback.
The test body runs in an async Task method (runTestCore) whose returned
Task is pinned via GCHandle.Alloc — a root in the GC handle table that
is independent of the managed object graph. This is stronger than
GC.KeepAlive and avoids async void entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace async void with async Task for the test body, await the Task
directly. Eliminates GCHandle, finished TCS, and async void entirely.

This is cleaner xUnit code but does not fix the intermittent test hang,
which has a separate root cause in the product code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
StreamClientSessionTransport.DisposeAsync could hang indefinitely when
awaiting Completion because the channel writer was never completed.

Root cause: ReadMessagesAsync and DisposeAsync race to call CleanupAsync.
The _cleanedUp Interlocked guard ensures only one runs the full body
(which calls SetDisconnected to complete the channel). The loser calls
CancelShutdown() and returns immediately — without waiting for the winner
to finish. If DisposeAsync then does 'await Completion', it blocks on a
channel that the still-running cleanup hasn't completed yet.

Fix: call SetDisconnected() after CleanupAsync in DisposeAsync as a
safety net. SetDisconnected is idempotent (uses Interlocked.Exchange),
so a duplicate call is a no-op.

Also removes the instrumented xUnit DLLs and MSBuild target that were
added to test a disproven GC-collection hypothesis.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 2, 2026

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 2, 2026

https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23910228803/job/69732069649?pr=1449
No hangs. Failure in Session_TracksActivities but it appears to be an unrelated flaky test -- will peel off a fix for that in another PR.

Rerunning

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 2, 2026

https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23910228803/job/69734695843?pr=1449

Another build, another flaky test. This time it's ReadEventsAsync_RespectsModeSwitchFromStreamingToPolling -- seems to be another synchronization issue -- not a hang. Opened a PR for this.

Rerunning.

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 2, 2026

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 2, 2026

And another green build -- https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23910228803/job/69763953157?pr=1449
I think this is good. Polishing things up and will do some more builds on final bits.

GetUnexpectedExitExceptionAsync used a linked CancellationTokenSource
that included the caller's token (_shutdownCts.Token). When
ReadMessagesAsync and DisposeAsync race to call CleanupAsync, the
loser calls CancelShutdown() which cancels _shutdownCts. This
prematurely aborted WaitForExitAsync in the winner's cleanup path,
preventing stderr pipe buffers from being fully drained. The
ErrorDataReceived callback would never fire for lines still in the
pipe, causing CreateAsync_ValidProcessInvalidServer_StdErrCallbackInvoked
to fail intermittently.

Fix: use a standalone timeout CTS instead of linking to the caller's
token. The process is already dead at this point—we only need a
timeout to bound the pipe drain, not external cancellation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 2, 2026

Ok, this should be good to go. Updated the summary at the top of the PR. @stephentoub @halter73 -- do your worst.

@ericstj ericstj removed the NO MERGE PR should not be merged until the label is removed label Apr 2, 2026
@ericstj ericstj requested review from halter73 and stephentoub April 2, 2026 21:26
@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 2, 2026

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 2, 2026

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 2, 2026

Extract WaitForProcessExitAsync to ensure ErrorDataReceived events are
fully drained before the handler is detached. Fixes a race where
HasExited returns false while the process is exiting (stdout closed but
not yet reaped), causing GetUnexpectedExitExceptionAsync to skip the
drain and the handler to be detached before callbacks arrive.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 3, 2026

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 3, 2026

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 3, 2026

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 3, 2026

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Apr 3, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CreateAsync_ValidProcessInvalidServer_StdErrCallbackInvoked Test crash in CI

2 participants