Skip to content

Modification to Pod A2A#259

Open
AtlantaPepsi wants to merge 1 commit intoROCm:candidatefrom
AtlantaPepsi:poda2a_mod
Open

Modification to Pod A2A#259
AtlantaPepsi wants to merge 1 commit intoROCm:candidatefrom
AtlantaPepsi:poda2a_mod

Conversation

@AtlantaPepsi
Copy link
Copy Markdown
Contributor

Motivation

  • Make all groups in a pod run all-to-all in parallel, previously execution of groups are serial
  • Modify output able to reflect A2A group members

Technical Details

Test Plan

Tested on 8 devices (2 node with 4 GPU each launched by MPI) in a single pod, multi-pod scenario yet investigated

Test Result

Example Output from a 2 x 4G cluster

Default (1 Group of size 8 in natural order)

[AllToAll Related]
A2A_LOCAL            =            0 : Exclude local transfers
A2A_MODE             =            0 : Copy
MEM_TYPE             =            0 : Using default GPU GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            4 : Using 4 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_REMOTE_READ      =            0 : Using SRC as executor
STRIDE               =            1 : Reordering devices by taking 1 steps
GROUP_SIZE           =            8 : Dividing all devices into groups of 8 for a2a

GPU-GFX IntraPod All-To-All benchmark:
==============================
[268435456 bytes per Transfer] [GFX:8] [1 Read(s) 1 Write(s)] [MemType:default GPU] [NIC QueuePairs:0] [#Ranks:2]
A2A group 0: R0:G0, R0:G1, R0:G2, R0:G3, R1:G0, R1:G1, R1:G2, R1:G3

--- Pod AllToAll Group 0 ---
┌-------------┬------------┬------------------------------------┬------------------------------------┐
│ SRC+EXE\DST │            │ Rank 00                            │ Rank 01                            │
├-------------┼------------┼------------------------------------┼------------------------------------┤
│             │ Mem Device │  GPU 00   GPU 01   GPU 02   GPU 03 │  GPU 00   GPU 01   GPU 02   GPU 03 │
├-------------┼------------┼------------------------------------┼------------------------------------┤
│     Rank 00 │     GPU 00 │     N/A    74.21    74.43    87.73 │   87.64   102.58   102.74   103.02 │
│             │     GPU 01 │   83.21      N/A    82.99    83.27 │   83.29   103.38   103.55   103.92 │
│             │     GPU 02 │   96.29    95.45      N/A    95.50 │   95.73    95.96    96.31    96.59 │
│             │     GPU 03 │   87.71    87.75    87.83      N/A │   88.18   111.02   111.10   111.25 │
├-------------┼------------┼------------------------------------┼------------------------------------┤
│     Rank 01 │     GPU 00 │   90.73    89.91    91.06    89.90 │     N/A    89.77    89.74    91.05 │
│             │     GPU 01 │   83.53    83.73    83.59    83.69 │  102.92      N/A   103.05   103.21 │
│             │     GPU 02 │   83.82    83.67    83.92    84.07 │  103.17   103.28      N/A   103.53 │
│             │     GPU 03 │   77.40    77.71    90.59    90.64 │   90.56    93.93    94.02      N/A │
└-------------┴------------┴------------------------------------┴------------------------------------┘

2 Groups with Stride 2

[AllToAll Related]
A2A_LOCAL            =            0 : Exclude local transfers
A2A_MODE             =            0 : Copy
MEM_TYPE             =            0 : Using default GPU GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            4 : Using 4 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_REMOTE_READ      =            0 : Using SRC as executor
STRIDE               =            2 : Reordering devices by taking 2 steps
GROUP_SIZE           =            4 : Dividing all devices into groups of 4 for a2a

GPU-GFX IntraPod All-To-All benchmark:
==============================
[268435456 bytes per Transfer] [GFX:8] [1 Read(s) 1 Write(s)] [MemType:default GPU] [NIC QueuePairs:0] [#Ranks:2]
A2A group 0: R0:G0, R0:G1, R1:G0, R1:G1
A2A group 1: R0:G2, R0:G3, R1:G2, R1:G3

--- Pod AllToAll Group 0 ---
┌-------------┬------------┬------------------┬------------------┐
│ SRC+EXE\DST │            │ Rank 00          │ Rank 01          │
├-------------┼------------┼------------------┼------------------┤
│             │ Mem Device │  GPU 00   GPU 01 │  GPU 00   GPU 01 │
├-------------┼------------┼------------------┼------------------┤
│     Rank 00 │     GPU 00 │     N/A  104.90  │  104.78  105.07  │
│             │     GPU 01 │  104.01     N/A  │  103.89  104.01  │
├-------------┼------------┼------------------┼------------------┤
│     Rank 01 │     GPU 00 │  104.33  104.55  │     N/A  104.64  │
│             │     GPU 01 │  104.42  104.54  │  103.98     N/A  │
└-------------┴------------┴------------------┴------------------┘

--- Pod AllToAll Group 1 ---
┌-------------┬------------┬------------------┬------------------┐
│ SRC+EXE\DST │            │ Rank 00          │ Rank 01          │
├-------------┼------------┼------------------┼------------------┤
│             │ Mem Device │  GPU 02   GPU 03 │  GPU 02   GPU 03 │
├-------------┼------------┼------------------┼------------------┤
│     Rank 00 │     GPU 02 │     N/A  103.63  │  103.44  103.71  │
│             │     GPU 03 │  107.86     N/A  │  108.52  108.92  │
├-------------┼------------┼------------------┼------------------┤
│     Rank 01 │     GPU 02 │  106.24  106.01  │     N/A  106.28  │
│             │     GPU 03 │  105.08  106.18  │  106.10     N/A  │
└-------------┴------------┴------------------┴------------------┘

Submission Checklist

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the intra-pod all-to-all preset to execute all A2A groups within a pod concurrently (single RunTransfers per pod) and enhances the output to show explicit A2A group membership and a rank-grouped display layout.

Changes:

  • Build transfers for all groups in a pod and execute them together to enable concurrent cross-group traffic.
  • Print A2A group membership (rank/GPU) in the output.
  • Reformat the per-group bandwidth table to group columns by MPI rank (with multiple GPUs per rank column).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +182 to +184
groupTransferBase[group] = (int)podTransfers.size();
groupReIndexes[group].assign(groupSize, std::vector<int>(groupSize, -1));
std::vector<std::vector<int>>& groupReIndex = groupReIndexes[group];
Comment on lines +274 to +281
std::vector<int> order(groupSize);
for (int i = 0; i < groupSize; i++) order[i] = i;
std::sort(order.begin(), order.end(), [&](int a, int b) {
MemDevice const& da = devices[groupBase + a];
MemDevice const& db = devices[groupBase + b];
if (da.memRank != db.memRank) return da.memRank < db.memRank;
return da.memIndex < db.memIndex;
});
});
for (size_t si = 0; si < ord.size(); si++) {
MemDevice const& d = devices[gb + ord[si]];
Utils::Print("%s R%d:G%d", si ? "," : "", d.memRank, d.memIndex);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants