Skip to content

Add CONTIGUOUS shard strategy#545

Merged
kmontemayor2-sc merged 30 commits intomainfrom
kmonte/update-node-shard-strategy
Apr 9, 2026
Merged

Add CONTIGUOUS shard strategy#545
kmontemayor2-sc merged 30 commits intomainfrom
kmonte/update-node-shard-strategy

Conversation

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator

@kmontemayor2-sc kmontemayor2-sc commented Mar 13, 2026

Making some changes to the way we distribute nodes for graph store mode.

This is one step in allowing us to reduce the produce load across the cluster, and decreasing cluster spin up time and increasing overall stability.

Let's do this instead of #567 as this way let's us have all compute ranks share an even number of nodes, which is really what we want.

This isn't really a performance improvement - it helps us have our cluster be much more stable as it reduces the amount of cross-talk between compute and storage sub-clusters.

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:30UTC : 🔄 Python Unit Test started.

@ 01:26:31UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:30UTC : 🔄 Lint Test started.

@ 24:25:57UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:31UTC : 🔄 Integration Test started.

@ 01:52:42UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:32UTC : 🔄 Scala Unit Test started.

@ 24:28:30UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:32UTC : 🔄 E2E Test started.

@ 01:38:44UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:14UTC : 🔄 Lint Test started.

@ 24:06:39UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:15UTC : 🔄 E2E Test started.

@ 01:22:05UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:17UTC : 🔄 Python Unit Test started.

@ 01:15:30UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:17UTC : 🔄 Scala Unit Test started.

@ 24:07:40UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:17UTC : 🔄 Integration Test started.

@ 01:19:53UTC : ✅ Workflow completed successfully.

kmonte and others added 10 commits March 24, 2026 16:41
Expand the one-line docstring to include concrete examples showing how
ROUND_ROBIN and CONTIGUOUS strategies distribute node IDs across compute
nodes, including split filtering and fractional server assignment.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… check

The world_size != num_compute_nodes validation was unnecessarily
restrictive — callers may legitimately pass a different world_size.
Also extract the validator to a module-level function since it no longer
needs self.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The sliced tensor holds a reference to the original, but in the
contiguous flow the original is a local variable that goes out of scope,
so the slice effectively owns the data. Removing clone() avoids an
unnecessary copy.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace the all_reduce count comparison with all_gather + sorted tensor
comparison to catch cases where counts match but actual node IDs differ
between CONTIGUOUS and ROUND_ROBIN strategies.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Merge _make_rank_aware_async_mock and _make_rank_aware_ablp_async_mock
  into a single generic helper
- Remove _assert_contiguous_node_ids and _assert_contiguous_ablp_inputs
  helpers, inline assertions directly in tests
- Replace @parameterized.expand with separate named test methods for
  better readability
- Fix stale variable reference in integration test log line

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Annotate _mock_request_server, _mock_async_request_server,
_patch_remote_requests, and _create_server_with_splits kwargs with
proper type hints. Add Callable, Iterator, and Any imports.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 25, 2026

GiGL Automation

@ 21:10:20UTC : 🔄 E2E Test started.

@ 22:31:23UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

GiGL Automation

@ 16:37:52UTC : 🔄 Lint Test started.

@ 16:45:13UTC : ✅ Workflow completed successfully.

Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kyle! Did a pass here, left some comments/questions.

Comment thread gigl/distributed/graph_store/messages.py
Comment thread gigl/distributed/graph_store/remote_dist_dataset.py
Comment thread gigl/distributed/graph_store/remote_dist_dataset.py
Comment thread gigl/distributed/graph_store/sharding.py
Comment thread gigl/distributed/graph_store/sharding.py Outdated
Comment thread gigl/distributed/graph_store/remote_dist_dataset.py Outdated
Comment thread gigl/distributed/graph_store/sharding.py Outdated
Comment thread gigl/distributed/graph_store/sharding.py Outdated
Comment thread gigl/distributed/graph_store/sharding.py Outdated
Comment thread gigl/distributed/graph_store/remote_dist_dataset.py
Comment thread gigl/distributed/graph_store/remote_dist_dataset.py Outdated
Comment thread gigl/distributed/graph_store/remote_dist_dataset.py Outdated
Comment thread gigl/distributed/graph_store/sharding.py Outdated
@kmontemayor2-sc kmontemayor2-sc changed the title Add alternative shard strategy Add CONTIGUOUS shard strategy Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kyle for the work! Left a few comments but this LGTM

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator Author

/all_test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

GiGL Automation

@ 19:00:26UTC : 🔄 Scala Unit Test started.

@ 19:09:57UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

GiGL Automation

@ 19:00:27UTC : 🔄 E2E Test started.

@ 20:27:49UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

GiGL Automation

@ 19:00:27UTC : 🔄 Integration Test started.

@ 20:31:19UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

GiGL Automation

@ 19:00:28UTC : 🔄 Lint Test started.

@ 19:10:18UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

GiGL Automation

@ 19:00:32UTC : 🔄 Python Unit Test started.

@ 20:17:06UTC : ✅ Workflow completed successfully.

Copy link
Copy Markdown
Collaborator

@svij-sc svij-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp

@kmontemayor2-sc kmontemayor2-sc marked this pull request as ready for review April 9, 2026 18:37
@kmontemayor2-sc kmontemayor2-sc added this pull request to the merge queue Apr 9, 2026
Merged via the queue into main with commit ab359ad Apr 9, 2026
6 checks passed
@kmontemayor2-sc kmontemayor2-sc deleted the kmonte/update-node-shard-strategy branch April 9, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants