Skip to content

Pod Ring preset#251

Open
AtlantaPepsi wants to merge 3 commits intoROCm:candidatefrom
AtlantaPepsi:podring
Open

Pod Ring preset#251
AtlantaPepsi wants to merge 3 commits intoROCm:candidatefrom
AtlantaPepsi:podring

Conversation

@AtlantaPepsi
Copy link
Copy Markdown
Contributor

@AtlantaPepsi AtlantaPepsi commented Mar 28, 2026

Motivation

We need a intra-pod ring preset, similar to nicrings preset, to simulate potential patterns used by RCCL

Technical Details

Similar to poda2a preset, we have the option to reorder all detectable devices according to user-input stride, then divide reordered devices into subgroups of user specified size. Each subgroup will be a ring.

Test Plan

Test Result

Example: on 2 nodes each with 4 GPU

  • Stride = 1 and Group Size = 4 ->all 2 x 4 = 8 devices in natural order and cut into 2 subgroups
[PodRing Related]
MEM_TYPE             =            0 : Using default GPU GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            4 : Using 4 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_REMOTE_READ      =            0 : Using SRC as executor
STRIDE               =            1 : Reordering devices by taking 1 steps
GROUP_SIZE           =            4 : Dividing all devices into ring groups of 4

GPU-GFX IntraPod Ring benchmark:
==============================
[268435456 bytes per Transfer] [GFX:8] [MemType:default GPU] [NIC QueuePairs:0] [#Ranks:2]
2 ring(s) of 4 devices:
  Ring 0: R0:G0 -> R0:G1 -> R0:G2 -> R0:G3 -> R0:G0
  Ring 1: R1:G0 -> R1:G1 -> R1:G2 -> R1:G3 -> R1:G0


--- Pod Ring Group 0 ---
┌------------┬------------┬----------┐
│  Src   Src │  Dst   Dst │ GFX BW   │
│ Rank   GPU │ Rank   GPU │ (GB/s)   │
├------------┼------------┼----------┤
│    0     0 │    0     1 │ 106.53   │
│    0     1 │    0     2 │ 105.44   │
│    0     2 │    0     3 │ 108.15   │
│    0     3 │    0     0 │ 110.00   │
├------------┼------------┼----------┤
│        MAX │            │ 110.00   │
│        AVG │            │ 107.53   │
│        MIN │            │ 105.44   │
└------------┴------------┴----------┘
Aggregate bandwidth (CPU Timed):  197.714 GB/s

--- Pod Ring Group 1 ---
┌------------┬------------┬----------┐
│  Src   Src │  Dst   Dst │ GFX BW   │
│ Rank   GPU │ Rank   GPU │ (GB/s)   │
├------------┼------------┼----------┤
│    1     0 │    1     1 │ 105.30   │
│    1     1 │    1     2 │ 104.57   │
│    1     2 │    1     3 │ 106.82   │
│    1     3 │    1     0 │ 106.71   │
├------------┼------------┼----------┤
│        MAX │            │ 106.82   │
│        AVG │            │ 105.85   │
│        MIN │            │ 104.57   │
└------------┴------------┴----------┘
Aggregate bandwidth (CPU Timed):  197.387 GB/s
  • Stride = 4 and Group Size = 4 ->all 2 x 4 = 8 devices reordered and cut into 2 subgroups
[PodRing Related]
MEM_TYPE             =            0 : Using default GPU GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            4 : Using 4 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_REMOTE_READ      =            0 : Using SRC as executor
STRIDE               =            4 : Reordering devices by taking 4 steps
GROUP_SIZE           =            4 : Dividing all devices into ring groups of 4

GPU-GFX IntraPod Ring benchmark:
==============================
[268435456 bytes per Transfer] [GFX:8] [MemType:default GPU] [NIC QueuePairs:0] [#Ranks:2]
2 ring(s) of 4 devices:
  Ring 0: R0:G0 -> R0:G2 -> R1:G0 -> R1:G2 -> R0:G0
  Ring 1: R0:G1 -> R0:G3 -> R1:G1 -> R1:G3 -> R0:G1


--- Pod Ring Group 0 ---
┌------------┬------------┬----------┐
│  Src   Src │  Dst   Dst │ GFX BW   │
│ Rank   GPU │ Rank   GPU │ (GB/s)   │
├------------┼------------┼----------┤
│    0     0 │    0     2 │ 110.72   │
│    0     2 │    1     0 │ 111.23   │
│    1     0 │    1     2 │ 109.95   │
│    1     2 │    0     0 │ 110.07   │
├------------┼------------┼----------┤
│        MAX │            │ 111.23   │
│        AVG │            │ 110.49   │
│        MIN │            │ 109.95   │
└------------┴------------┴----------┘
Aggregate bandwidth (CPU Timed):  410.284 GB/s

--- Pod Ring Group 1 ---
┌------------┬------------┬----------┐
│  Src   Src │  Dst   Dst │ GFX BW   │
│ Rank   GPU │ Rank   GPU │ (GB/s)   │
├------------┼------------┼----------┤
│    0     1 │    0     3 │ 104.38   │
│    0     3 │    1     1 │ 104.27   │
│    1     1 │    1     3 │ 103.70   │
│    1     3 │    0     1 │ 103.47   │
├------------┼------------┼----------┤
│        MAX │            │ 104.38   │
│        AVG │            │ 103.96   │
│        MIN │            │ 103.47   │
└------------┴------------┴----------┘
Aggregate bandwidth (CPU Timed):  385.956 GB/s

Submission Checklist

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “podring” preset and centralizes several scheduling helper utilities so they can be reused across presets.

Changes:

  • Added a new PodRingPreset to run intra-pod ring transfers (optionally with NIC queue-pair transfers) and print per-group summaries.
  • Moved common helper routines (StrideGenerate, RoundRobinSchedule, CombinationSchedule) into TransferBench::Utils.
  • Updated existing presets to call the new Utils:: helper implementations.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/client/Utilities.hpp Adds shared scheduling / indexing helpers used by multiple presets.
src/client/Presets/Presets.hpp Registers the new podring preset and includes its header.
src/client/Presets/PodRing.hpp New preset implementing ring transfers within pod subgroups.
src/client/Presets/PodPeerToPeer.hpp Switches round-robin scheduling call to Utils::RoundRobinSchedule.
src/client/Presets/PodAllToAll.hpp Removes local stride helper and uses Utils::StrideGenerate.
src/client/Presets/NicPeerToPeer.hpp Removes local scheduling helpers and uses Utils::RoundRobinSchedule / Utils::CombinationSchedule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/client/Presets/PodRing.hpp Outdated
Comment thread src/client/Presets/PodRing.hpp Outdated
Comment thread src/client/Presets/PodRing.hpp Outdated
Comment on lines +194 to +197
double gfxMax = std::numeric_limits<double>::min();
double nicMin = std::numeric_limits<double>::max();
double nicAvg = 0.0;
double nicMax = std::numeric_limits<double>::min();
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For initializing max values, std::numeric_limits<double>::min() is the smallest positive normalized value, not the most-negative value. If a bandwidth can be 0.0, gfxMax/nicMax would incorrectly remain ~2e-308. Use std::numeric_limits<double>::lowest() (or initialize to -std::numeric_limits<double>::infinity() / 0.0) for max initializers.

Suggested change
double gfxMax = std::numeric_limits<double>::min();
double nicMin = std::numeric_limits<double>::max();
double nicAvg = 0.0;
double nicMax = std::numeric_limits<double>::min();
double gfxMax = std::numeric_limits<double>::lowest();
double nicMin = std::numeric_limits<double>::max();
double nicAvg = 0.0;
double nicMax = std::numeric_limits<double>::lowest();

Copilot uses AI. Check for mistakes.
Comment thread src/client/Presets/PodRing.hpp Outdated
Comment thread src/client/Utilities.hpp Outdated
Comment thread src/client/Utilities.hpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants