Skip to content

Add two-phase sampling API to DistServer#578

Draft
kmontemayor2-sc wants to merge 16 commits intomainfrom
kmonte/shared-backend-decomp-3
Draft

Add two-phase sampling API to DistServer#578
kmontemayor2-sc wants to merge 16 commits intomainfrom
kmonte/shared-backend-decomp-3

Conversation

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator

@kmontemayor2-sc kmontemayor2-sc commented Apr 6, 2026

Replace the single-step create_sampling_producer with a two-phase API:

  • init_sampling_backend: creates/reuses a SharedDistSamplingBackend
  • register_sampling_input: registers a lightweight per-channel input

We do this so we can re-use the sampler backends across the storage cluster, this greatly improves on cluster stability and lets us save on process_start_gap_seconds time.

I will have a followup PR (#579) where we have BaseDistLoader use the new two-phase API, but for now we just delegate to the two phases in DistServer.
Note, we really should have that follow up, as this approach means we'd be creating one dist sampling process tree per input still, which we are trying to avoid (but should be fine as a temporary standin to help make the reviews easier).

The existing create_sampling_producer/destroy_sampling_producer methods
are preserved as bridge methods that delegate to the new API, keeping
existing loaders working without changes.

Also adds InitSamplingBackendRequest and RegisterBackendRequest message
dataclasses, and per-channel fetch stats logging.

kmonte and others added 3 commits April 6, 2026 20:59
…r.py

Move create_dist_sampler(), SamplerInput, and SamplerRuntime out of
dist_sampling_producer.py into a shared utils module so they can be
reused by the upcoming SharedDistSamplingBackend.

Also rename `w` -> `worker` in DistSamplingProducer.init() for clarity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Introduce SharedDistSamplingBackend which manages a pool of worker
processes servicing multiple compute-rank channels through a fair-queued
round-robin scheduler. This replaces the per-channel producer model in
graph-store mode with a shared backend + lightweight per-channel state.

Includes tests for pure business logic helpers (_compute_num_batches,
_epoch_batch_indices, _compute_worker_seeds_ranges), shuffle behavior,
and completion reporting.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace the single-step create_sampling_producer with a two-phase API:
- init_sampling_backend: creates/reuses a SharedDistSamplingBackend
- register_sampling_input: registers a lightweight per-channel input

The existing create_sampling_producer/destroy_sampling_producer methods
are preserved as bridge methods that delegate to the new API, keeping
existing loaders working without changes.

Also adds InitSamplingBackendRequest and RegisterBackendRequest message
dataclasses, and per-channel fetch stats logging.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

GiGL Automation

@ 21:11:56UTC : 🔄 Scala Unit Test started.

@ 21:19:11UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

GiGL Automation

@ 21:11:56UTC : 🔄 Lint Test started.

@ 21:19:07UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

GiGL Automation

@ 21:11:58UTC : 🔄 Integration Test started.

@ 22:35:09UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

GiGL Automation

@ 21:12:01UTC : 🔄 Python Unit Test started.

@ 22:24:16UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

GiGL Automation

@ 21:12:02UTC : 🔄 E2E Test started.

@ 22:30:05UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc kmontemayor2-sc changed the title Kmonte/shared backend decomp 3 Add two-phase sampling API to DistServer Apr 14, 2026
@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

GiGL Automation

@ 23:13:27UTC : 🔄 Integration Test started.

@ 24:24:24UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

GiGL Automation

@ 23:13:28UTC : 🔄 Lint Test started.

@ 23:17:18UTC : ❌ Workflow failed.
Please check the logs for more details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

GiGL Automation

@ 23:13:29UTC : 🔄 Scala Unit Test started.

@ 23:22:27UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

GiGL Automation

@ 23:13:31UTC : 🔄 Python Unit Test started.

@ 24:20:52UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

GiGL Automation

@ 23:13:31UTC : 🔄 E2E Test started.

@ 24:43:22UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 24:19:20UTC : 🔄 Python Unit Test started.

@ 01:26:03UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 24:19:20UTC : 🔄 E2E Test started.

@ 01:50:22UTC : ❌ Workflow failed.
Please check the logs for more details.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 24:19:20UTC : 🔄 Integration Test started.

@ 01:34:08UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 24:19:21UTC : 🔄 Lint Test started.

@ 24:26:20UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 24:19:21UTC : 🔄 Scala Unit Test started.

@ 24:26:53UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 02:50:48UTC : 🔄 Integration Test started.

@ 04:15:20UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 02:50:49UTC : 🔄 Scala Unit Test started.

@ 02:58:50UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 02:50:51UTC : 🔄 Lint Test started.

@ 02:58:14UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 02:50:51UTC : 🔄 E2E Test started.

@ 04:05:30UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

GiGL Automation

@ 02:50:52UTC : 🔄 Python Unit Test started.

@ 03:47:46UTC : ✅ Workflow completed successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants