Skip to content

fix(sdk): fix deadlock in TrustedHttpContextProvider and improve block_on for current-thread runtimes#3490

Closed
lklimek wants to merge 9 commits intov3.1-devfrom
fix/trusted-context-provider-deadlock
Closed

fix(sdk): fix deadlock in TrustedHttpContextProvider and improve block_on for current-thread runtimes#3490
lklimek wants to merge 9 commits intov3.1-devfrom
fix/trusted-context-provider-deadlock

Conversation

@lklimek
Copy link
Copy Markdown
Contributor

@lklimek lklimek commented Apr 14, 2026

Issue being fixed or feature implemented

Fixes #3432 — deadlock in TrustedHttpContextProvider::get_quorum_public_key() when called from within a tokio runtime.

What was done?

Root cause

get_quorum_public_key is a sync ContextProvider method called during proof verification, which happens inside an async context. The original implementation called futures::executor::block_on to drive the async quorum fetch. This deadlocks because:

  • futures::executor::block_on parks the calling thread waiting for the future
  • The future uses reqwest HTTP, which needs tokio's I/O reactor
  • The reactor lives on the same thread that is now parked → infinite hang

A subsequent fix attempted tokio::task::block_in_place, which panics on current-thread runtimes (no other worker threads to steal the task).

Fix

TrustedHttpContextProvider (rs-sdk-trusted-context-provider): replaced the broken async bridge with std::thread::scope. A scoped OS thread is spawned with its own independent current_thread tokio runtime. The future runs on that thread's reactor, completely isolated from the outer runtime. The scope guarantees the thread joins before returning, so non-'static borrows of self are safe.

dash_sdk::sync::block_on (rs-sdk): upgraded to handle all runtime flavors:

  • No active runtime: creates a temporary current_thread runtime and drives the future directly
  • CurrentThread runtime: spawns a dedicated OS thread with its own runtime (block_in_place panics here)
  • Multi-thread and future flavors (_): existing block_in_place + spawn (unchanged, still optimal)

RuntimeFlavor is #[non_exhaustive] so the wildcard arm is required; the CurrentThread arm explicitly handles the problematic case.

Regression tests

  • test_get_quorum_public_key_no_deadlock_inside_tokio_runtime: spins up a mock HTTP server and calls get_quorum_public_key from inside a current_thread runtime. Uses a 5-second channel timeout to detect hangs. Verified: old code times out after exactly 5 s; fix completes in ~30 ms.
  • test_block_on_fails_on_current_thread_runtime (renamed to reflect fixed behaviour): runs the same async→sync→async nesting on a current_thread runtime and asserts Ok("Success").

How Has This Been Tested?

  • cargo test -p rs-sdk-trusted-context-provider — 9 tests pass including the deadlock regression test
  • cargo test -p dash-sdk --lib sync — 45 tests pass including the current-thread regression test
  • Regression test manually verified against pre-fix code (times out after 5 s)

Breaking Changes

None.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section if my code contains any
  • I have made corresponding changes to the documentation if needed

🤖 Co-authored by Claudius the Magnificent AI Agent

Summary by CodeRabbit

  • Bug Fixes

    • Fixed deadlock and panic issues when calling synchronous APIs from async contexts.
    • Simplified prefetch initialization by removing preliminary network connectivity checks.
  • Chores

    • Updated dependencies and runtime handling to improve stability and threading behavior.
  • Tests

    • Added regression tests to prevent deadlocks and verify async/sync interactions.

lklimek and others added 7 commits April 14, 2026 11:52
…_in_place

`get_quorum_public_key()` called `futures::executor::block_on()` for the
cache-miss refetch path, which deadlocks when invoked from inside a tokio
runtime (e.g. the FFI wrapper's `runtime.block_on()`).

Replace with `tokio::task::block_in_place` + `Handle::current().block_on()`
when a tokio context is active, falling back to a temporary
`tokio::runtime::Runtime` when no context is present.

Also remove diagnostic HTTP requests to google.com and the raw quorums
endpoint from `dash_sdk_create_trusted` — they were debugging artifacts
that should not be in production code.

Closes #3432

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ide tokio

Reproduce the scenario from issue #3432: a cache-miss refetch of a quorum
public key called from within a multi-thread tokio runtime context.

A minimal in-process HTTP server serves the /quorums response so no external
network access is needed.  The test would have deadlocked with the old
`futures::executor::block_on` approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…m_public_key

Replace futures::executor::block_on with a scoped OS thread running its own
current-thread tokio runtime. This fixes two failure modes:

1. futures::executor::block_on deadlocks inside a current-thread tokio runtime:
   the single thread is both executor and I/O driver; blocking it for block_on
   starves the reactor and the reqwest future never completes.

2. tokio::task::block_in_place panics on current-thread runtimes.

std::thread::scope spawns a new OS thread with no tokio context; the thread
creates its own current-thread runtime, runs find_quorum to completion, and
the scope ensures the thread is joined before get_quorum_public_key returns.
Borrowing self across the scope boundary is safe because the scope exits only
after the thread joins.

The overhead (thread + runtime creation per cache miss) is acceptable because
quorum cache misses are rare.

Also updates the regression test: it now uses a current-thread runtime
(reproducing the exact deadlock scenario) with a 5s timeout. The old
futures::executor::block_on code was verified to trigger the timeout.

Closes #3432

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
block_on previously called tokio::task::block_in_place unconditionally,
which panics on current-thread runtimes. Now it detects the runtime
flavor and branches accordingly:

- No runtime: creates a temporary current-thread runtime
- CurrentThread: spawns a dedicated OS thread with its own runtime
- MultiThread (and future flavors): uses block_in_place as before

Fixes #3432

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ldcard arm

RuntimeFlavor is #[non_exhaustive], so _ already covers MultiThread and any
future multi-threaded variants. Deduplicate the identical arms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…:block_on

Copies the runtime-aware block_on utility into the trusted context provider
so it is available for future use without a shared-crate dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace Err(_) with Err(e) and include the error in the trace message
so the reason no runtime was found is not silently discarded.

Also reverts the block_on copy added to trusted-context-provider.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

Warning

Rate limit exceeded

@lklimek has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 4 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 18 minutes and 4 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fcf8e7d0-7058-4a93-9c93-4e55ba0ac001

📥 Commits

Reviewing files that changed from the base of the PR and between 52bac76 and 0d7e82c.

📒 Files selected for processing (1)
  • packages/rs-sdk-trusted-context-provider/Cargo.toml
📝 Walkthrough

Walkthrough

Replaces blocking futures executor usage inside Tokio runtimes with thread-isolated current-thread Tokio runtimes, removes diagnostic reqwest calls from the FFI prefetch task, swaps futures for tokio where needed, and updates block_on behavior to handle no-runtime, current-thread, and multi-thread contexts safely.

Changes

Cohort / File(s) Summary
Dependency Cleanup
packages/rs-sdk-ffi/Cargo.toml, packages/rs-sdk-trusted-context-provider/Cargo.toml
Removed reqwest from FFI crate and replaced futures with tokio = { version = "1.40", features = ["rt","rt-multi-thread"] } in the trusted-context-provider crate.
FFI SDK Initialization
packages/rs-sdk-ffi/src/sdk.rs
Removed preliminary diagnostic HTTP GET checks and associated logging from the async prefetch task; now invokes update_quorum_caches().await directly.
Context Provider Runtime Handling
packages/rs-sdk-trusted-context-provider/src/provider.rs
Replaced futures::executor::block_on(self.find_quorum(...)) with spawning an OS thread that builds a current-thread Tokio runtime and block_ons the future; added error mapping for runtime creation and thread join panics; added regression test to prevent deadlock.
Sync Utility Runtime Branching
packages/rs-sdk/src/sync.rs
Reworked block_on to detect runtime state and handle three cases: no active runtime (create temporary current-thread runtime), current-thread runtime (spawn OS thread with its own current-thread runtime), and other runtimes (use block_in_place + task bridging); added docs and regression test.

Sequence Diagram(s)

sequenceDiagram
  participant App
  participant FFI_SDK as FFI\ SDK
  participant PrefetchTask as Prefetch\ Task
  participant Provider as TrustedContextProvider
  participant Thread as Quorum\ Thread
  participant TokioRT as Tokio\ Current-Thread\ Runtime
  participant QuorumHTTP as Quorum\ HTTP\ Service

  App->>FFI_SDK: create trusted SDK (returns handle)
  FFI_SDK->>PrefetchTask: spawn async prefetch (detached)
  PrefetchTask->>Provider: update_quorum_caches().await
  Note over App,Provider: Later, App requests proof verification
  App->>Provider: get_quorum_public_key(hash)
  alt cache hit
    Provider-->>App: return key
  else cache miss
    Provider->>Thread: spawn OS thread to run async fetch
    Thread->>TokioRT: build current-thread runtime
    TokioRT->>QuorumHTTP: perform HTTP requests (reqwest)
    QuorumHTTP-->>TokioRT: respond with quorum data
    TokioRT-->>Thread: return result
    Thread-->>Provider: join and deliver key/result
    Provider-->>App: return key or error
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I tunneled through runtimes tight and deep,
Spawned a thread where async dreams could sleep,
No more deadlocks in the midnight code,
Prefetch hops along the safer road,
The SDK leaps onward—light and fleet! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: fixing a deadlock in TrustedHttpContextProvider and improving block_on for current-thread runtimes, which aligns with the primary objective to eliminate deadlock caused by futures::executor::block_on inside a tokio runtime.
Linked Issues check ✅ Passed The PR fully addresses issue #3432 objectives: eliminates the deadlock by replacing futures::executor::block_on with std::thread::scope, extends block_on to handle no-runtime and current-thread runtime cases, adds regression tests using mock HTTP servers, and removes diagnostic external network requests.
Out of Scope Changes check ✅ Passed All code changes are directly scoped to fixing issue #3432: TrustedContextProvider deadlock fix, block_on improvements, removal of reqwest/diagnostic HTTP requests, and related regression tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/trusted-context-provider-deadlock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added this to the v3.1.0 milestone Apr 14, 2026
@lklimek lklimek marked this pull request as ready for review April 14, 2026 13:19
@thepastaclaw
Copy link
Copy Markdown
Collaborator

thepastaclaw commented Apr 14, 2026

⏳ Review in progress (commit 0d7e82c)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

✅ DashSDKFFI.xcframework built for this PR.

SwiftPM (host the zip at a stable URL, then use):

.binaryTarget(
  name: "DashSDKFFI",
  url: "https://your.cdn.example/DashSDKFFI.xcframework.zip",
  checksum: "cbe56c4ea36c4a584ba597bba07790f1a661bf641d21fea6f2cf0e14514024fe"
)

Xcode manual integration:

  • Download 'DashSDKFFI.xcframework' artifact from the run link above.
  • Drag it into your app target (Frameworks, Libraries & Embedded Content) and set Embed & Sign.
  • If using the Swift wrapper package, point its binaryTarget to the xcframework location or add the package and place the xcframework at the expected path.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/rs-sdk-trusted-context-provider/Cargo.toml`:
- Line 25: The Tokio dev-dependency in Cargo.toml only enables
["macros","rt-multi-thread"] but the crate's tests use tokio networking and
async IO APIs (e.g., tokio::net::TcpListener and tokio::io::{AsyncBufReadExt,
AsyncWriteExt, BufReader}), so update the tokio entry in the [dev-dependencies]
section to include the "net" and "io-util" features (i.e., add "net" and
"io-util" alongside "macros" and "rt-multi-thread") so those APIs are explicitly
enabled for tests.

In `@packages/rs-sdk-trusted-context-provider/src/provider.rs`:
- Around line 915-968: The test spawns two detached threads (the server thread
using server_rt and the client thread that builds rt and calls
provider.get_quorum_public_key) and never joins them or signals the server loop
to stop, leaving a live Tokio runtime; modify the test to keep the JoinHandles,
add a shutdown path for the server loop (e.g., a oneshot or atomic flag the
server loop checks after serving the request or break after a single accept),
signal shutdown once the client has finished (use rx to wait for the result),
then join both thread handles to ensure the server_rt and rt runtimes are
dropped and their threads exit cleanly; reference the spawned closures around
listener.accept(), the server_rt.block_on(...) thread, and the thread that
creates rt and calls provider.get_quorum_public_key so you can locate where to
add the channel/flag, signal, and handle.join().

In `@packages/rs-sdk/src/sync.rs`:
- Around line 89-104: In RuntimeFlavor::CurrentThread branch the worker thread
currently only sends the successful fut output via tx and drops errors/panics,
so callers get only AsyncError::RecvError; change the spawn to return a
JoinHandle and send a Result<F::Output, AsyncError> over tx (wrap build runtime
errors and any panic/worker failures into AsyncError), ensure you call join on
the handle and propagate the joined thread error if it panicked or the runtime
build failed, and have rx.recv() return that Result so the caller receives the
real failure instead of a generic recv error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 24e588b1-2b8a-4e32-9920-9e8d1542f4d7

📥 Commits

Reviewing files that changed from the base of the PR and between 127ad5e and 2a77e1c.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • packages/rs-sdk-ffi/Cargo.toml
  • packages/rs-sdk-ffi/src/sdk.rs
  • packages/rs-sdk-trusted-context-provider/Cargo.toml
  • packages/rs-sdk-trusted-context-provider/src/provider.rs
  • packages/rs-sdk/src/sync.rs
💤 Files with no reviewable changes (1)
  • packages/rs-sdk-ffi/Cargo.toml

Comment thread packages/rs-sdk-trusted-context-provider/Cargo.toml Outdated
Comment thread packages/rs-sdk-trusted-context-provider/src/provider.rs Outdated
Comment thread packages/rs-sdk/src/sync.rs Outdated
lklimek and others added 2 commits April 15, 2026 08:56
…est threads

- sync.rs: Send Result<T, AsyncError> over channel in CurrentThread
  branch so runtime build failures and panics propagate as real errors
  instead of opaque RecvError.
- provider.rs: Add oneshot shutdown signal to mock server, capture and
  join both thread handles so no detached threads or live runtimes
  survive the test.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lity

Production code only uses tokio::runtime::Builder::new_current_thread()
which needs the "rt" feature alone. The "rt-multi-thread", "net", and
"io-util" features are only used in tests, so move them to
[dev-dependencies] to avoid breaking the wasm-sdk build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a deadlock/panic when synchronous SDK APIs bridge into async work (reqwest/tokio) from within an existing tokio runtime, particularly on current-thread runtimes used by FFI consumers.

Changes:

  • Updated dash_sdk::sync::block_on to handle “no runtime”, “current-thread runtime”, and “multi-thread runtime” cases safely.
  • Reworked TrustedHttpContextProvider::get_quorum_public_key to perform async quorum fetches on a dedicated OS thread with its own tokio runtime to avoid deadlocks.
  • Added regression tests for both the trusted context provider deadlock and block_on behavior; removed diagnostic HTTP checks and the related reqwest dependency from the FFI crate.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/rs-sdk/src/sync.rs Runtime-flavor-aware block_on implementation + regression test for current-thread runtimes.
packages/rs-sdk-trusted-context-provider/src/provider.rs Avoid deadlock by running quorum fetch in a scoped OS thread with an isolated runtime + regression test.
packages/rs-sdk-trusted-context-provider/Cargo.toml Dependency updates: drop futures, add tokio (and dev features needed by tests).
packages/rs-sdk-ffi/src/sdk.rs Removed diagnostic HTTP probes during trusted SDK creation.
packages/rs-sdk-ffi/Cargo.toml Removed reqwest dependency (previously only used for diagnostics).
Cargo.lock Lockfile updated to reflect dependency changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

/// This test proves the fix works: the same async-sync-async nesting that used to
/// panic now completes successfully on a current-thread runtime.
#[test]
fn test_block_on_fails_on_current_thread_runtime() {
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name says it "fails" on a current-thread runtime, but the body and assertions expect success after the fix. Rename the test to reflect the intended behavior (e.g., succeeds/does_not_panic) so failures are easier to interpret.

Suggested change
fn test_block_on_fails_on_current_thread_runtime() {
fn test_block_on_succeeds_on_current_thread_runtime() {

Copilot uses AI. Check for mistakes.

let result = rt.block_on(outer(rx));

rt.block_on(worker_join).ok();
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ignores the JoinHandle result from the worker task. If the task panics or is cancelled, the test will still pass/fail later in a misleading way. Prefer awaiting it with unwrap()/asserting is_ok() so task failures are surfaced deterministically.

Suggested change
rt.block_on(worker_join).ok();
rt.block_on(worker_join).unwrap();

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +107
let recv_result = rx.recv()?;
join_handle
.join()
.map_err(|_| AsyncError::Generic("block_on worker thread panicked".to_string()))?;
recv_result
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CurrentThread branch, if rx.recv() returns early (e.g., worker panics before sending), the function returns via ? and the JoinHandle is dropped without joining. That can hide panics and makes failures harder to diagnose. Consider always joining the thread (even on recv error) and mapping a panic to an explicit AsyncError::Generic("... panicked") (or similar) so callers get a clearer error.

Suggested change
let recv_result = rx.recv()?;
join_handle
.join()
.map_err(|_| AsyncError::Generic("block_on worker thread panicked".to_string()))?;
recv_result
let recv_result = rx.recv();
let join_result = join_handle.join();
if join_result.is_err() {
return Err(AsyncError::Generic(
"block_on worker thread panicked".to_string(),
));
}
recv_result?

Copilot uses AI. Check for mistakes.
@lklimek lklimek marked this pull request as draft April 15, 2026 12:41
@lklimek
Copy link
Copy Markdown
Contributor Author

lklimek commented Apr 15, 2026

Superseded by #3497

@lklimek lklimek closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock: futures::executor::block_on inside tokio runtime in TrustedHttpContextProvider

3 participants