Add two-phase sampling API to DistServer#578
Conversation
…r.py Move create_dist_sampler(), SamplerInput, and SamplerRuntime out of dist_sampling_producer.py into a shared utils module so they can be reused by the upcoming SharedDistSamplingBackend. Also rename `w` -> `worker` in DistSamplingProducer.init() for clarity. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Introduce SharedDistSamplingBackend which manages a pool of worker processes servicing multiple compute-rank channels through a fair-queued round-robin scheduler. This replaces the per-channel producer model in graph-store mode with a shared backend + lightweight per-channel state. Includes tests for pure business logic helpers (_compute_num_batches, _epoch_batch_indices, _compute_worker_seeds_ranges), shuffle behavior, and completion reporting. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace the single-step create_sampling_producer with a two-phase API: - init_sampling_backend: creates/reuses a SharedDistSamplingBackend - register_sampling_input: registers a lightweight per-channel input The existing create_sampling_producer/destroy_sampling_producer methods are preserved as bridge methods that delegate to the new API, keeping existing loaders working without changes. Also adds InitSamplingBackendRequest and RegisterBackendRequest message dataclasses, and per-channel fetch stats logging. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
/all_test |
GiGL Automation@ 21:11:56UTC : 🔄 @ 21:19:11UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 21:11:56UTC : 🔄 @ 21:19:07UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 21:11:58UTC : 🔄 @ 22:35:09UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 21:12:01UTC : 🔄 @ 22:24:16UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 21:12:02UTC : 🔄 @ 22:30:05UTC : ✅ Workflow completed successfully. |
…r module docstring Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…sses Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…into kmonte/shared-backend-decomp-3
|
/all_test |
GiGL Automation@ 23:13:27UTC : 🔄 @ 24:24:24UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:13:28UTC : 🔄 @ 23:17:18UTC : ❌ Workflow failed. |
GiGL Automation@ 23:13:29UTC : 🔄 @ 23:22:27UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:13:31UTC : 🔄 @ 24:20:52UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:13:31UTC : 🔄 @ 24:43:22UTC : ✅ Workflow completed successfully. |
|
/all_test |
GiGL Automation@ 24:19:20UTC : 🔄 @ 01:26:03UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 24:19:20UTC : 🔄 @ 01:50:22UTC : ❌ Workflow failed. |
GiGL Automation@ 24:19:20UTC : 🔄 @ 01:34:08UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 24:19:21UTC : 🔄 @ 24:26:20UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 24:19:21UTC : 🔄 @ 24:26:53UTC : ✅ Workflow completed successfully. |
|
/all_test |
GiGL Automation@ 02:50:48UTC : 🔄 @ 04:15:20UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 02:50:49UTC : 🔄 @ 02:58:50UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 02:50:51UTC : 🔄 @ 02:58:14UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 02:50:51UTC : 🔄 @ 04:05:30UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 02:50:52UTC : 🔄 @ 03:47:46UTC : ✅ Workflow completed successfully. |
Replace the single-step create_sampling_producer with a two-phase API:
We do this so we can re-use the sampler backends across the storage cluster, this greatly improves on cluster stability and lets us save on process_start_gap_seconds time.
I will have a followup PR (#579) where we have BaseDistLoader use the new two-phase API, but for now we just delegate to the two phases in DistServer.
Note, we really should have that follow up, as this approach means we'd be creating one dist sampling process tree per input still, which we are trying to avoid (but should be fine as a temporary standin to help make the reviews easier).
The existing create_sampling_producer/destroy_sampling_producer methods
are preserved as bridge methods that delegate to the new API, keeping
existing loaders working without changes.
Also adds InitSamplingBackendRequest and RegisterBackendRequest message
dataclasses, and per-channel fetch stats logging.