Conversation
|
/all_test |
GiGL Automation@ 24:19:30UTC : 🔄 @ 01:26:31UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 24:19:30UTC : 🔄 @ 24:25:57UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 24:19:31UTC : 🔄 @ 01:52:42UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 24:19:32UTC : 🔄 @ 24:28:30UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 24:19:32UTC : 🔄 @ 01:38:44UTC : ✅ Workflow completed successfully. |
|
/all_test |
GiGL Automation@ 23:59:14UTC : 🔄 @ 24:06:39UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:59:15UTC : 🔄 @ 01:22:05UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:59:17UTC : 🔄 @ 01:15:30UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:59:17UTC : 🔄 @ 24:07:40UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:59:17UTC : 🔄 @ 01:19:53UTC : ✅ Workflow completed successfully. |
Expand the one-line docstring to include concrete examples showing how ROUND_ROBIN and CONTIGUOUS strategies distribute node IDs across compute nodes, including split filtering and fractional server assignment. Co-Authored-By: Claude Opus 4.6 <[email protected]>
… check The world_size != num_compute_nodes validation was unnecessarily restrictive — callers may legitimately pass a different world_size. Also extract the validator to a module-level function since it no longer needs self. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The sliced tensor holds a reference to the original, but in the contiguous flow the original is a local variable that goes out of scope, so the slice effectively owns the data. Removing clone() avoids an unnecessary copy. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace the all_reduce count comparison with all_gather + sorted tensor comparison to catch cases where counts match but actual node IDs differ between CONTIGUOUS and ROUND_ROBIN strategies. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Merge _make_rank_aware_async_mock and _make_rank_aware_ablp_async_mock into a single generic helper - Remove _assert_contiguous_node_ids and _assert_contiguous_ablp_inputs helpers, inline assertions directly in tests - Replace @parameterized.expand with separate named test methods for better readability - Fix stale variable reference in integration test log line Co-Authored-By: Claude Opus 4.6 <[email protected]>
Annotate _mock_request_server, _mock_async_request_server, _patch_remote_requests, and _create_server_with_splits kwargs with proper type hints. Add Callable, Iterator, and Any imports. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
/all_test |
GiGL Automation@ 21:10:20UTC : 🔄 @ 22:31:23UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 16:37:52UTC : 🔄 @ 16:45:13UTC : ✅ Workflow completed successfully. |
mkolodner-sc
left a comment
There was a problem hiding this comment.
Thanks Kyle! Did a pass here, left some comments/questions.
…Snapchat/GiGL into kmonte/update-node-shard-strategy
CONTIGUOUS shard strategy
|
/all_test |
GiGL Automation@ 19:00:26UTC : 🔄 @ 19:09:57UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:00:27UTC : 🔄 @ 20:27:49UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:00:27UTC : 🔄 @ 20:31:19UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:00:28UTC : 🔄 @ 19:10:18UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 19:00:32UTC : 🔄 @ 20:17:06UTC : ✅ Workflow completed successfully. |
Making some changes to the way we distribute nodes for graph store mode.
This is one step in allowing us to reduce the produce load across the cluster, and decreasing cluster spin up time and increasing overall stability.
Let's do this instead of #567 as this way let's us have all compute ranks share an even number of nodes, which is really what we want.
This isn't really a performance improvement - it helps us have our cluster be much more stable as it reduces the amount of cross-talk between compute and storage sub-clusters.