Modification to Pod A2A by AtlantaPepsi · Pull Request #259 · ROCm/TransferBench

AtlantaPepsi · 2026-04-20T15:35:55Z

Motivation

Make all groups in a pod run all-to-all in parallel, previously execution of groups are serial
Modify output able to reflect A2A group members

Technical Details

Test Plan

Tested on 8 devices (2 node with 4 GPU each launched by MPI) in a single pod, multi-pod scenario yet investigated

Test Result

Example Output from a 2 x 4G cluster

Default (1 Group of size 8 in natural order)

[AllToAll Related]
A2A_LOCAL            =            0 : Exclude local transfers
A2A_MODE             =            0 : Copy
MEM_TYPE             =            0 : Using default GPU GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            4 : Using 4 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_REMOTE_READ      =            0 : Using SRC as executor
STRIDE               =            1 : Reordering devices by taking 1 steps
GROUP_SIZE           =            8 : Dividing all devices into groups of 8 for a2a

GPU-GFX IntraPod All-To-All benchmark:
==============================
[268435456 bytes per Transfer] [GFX:8] [1 Read(s) 1 Write(s)] [MemType:default GPU] [NIC QueuePairs:0] [#Ranks:2]
A2A group 0: R0:G0, R0:G1, R0:G2, R0:G3, R1:G0, R1:G1, R1:G2, R1:G3

--- Pod AllToAll Group 0 ---
┌-------------┬------------┬------------------------------------┬------------------------------------┐
│ SRC+EXE\DST │            │ Rank 00                            │ Rank 01                            │
├-------------┼------------┼------------------------------------┼------------------------------------┤
│             │ Mem Device │  GPU 00   GPU 01   GPU 02   GPU 03 │  GPU 00   GPU 01   GPU 02   GPU 03 │
├-------------┼------------┼------------------------------------┼------------------------------------┤
│     Rank 00 │     GPU 00 │     N/A    74.21    74.43    87.73 │   87.64   102.58   102.74   103.02 │
│             │     GPU 01 │   83.21      N/A    82.99    83.27 │   83.29   103.38   103.55   103.92 │
│             │     GPU 02 │   96.29    95.45      N/A    95.50 │   95.73    95.96    96.31    96.59 │
│             │     GPU 03 │   87.71    87.75    87.83      N/A │   88.18   111.02   111.10   111.25 │
├-------------┼------------┼------------------------------------┼------------------------------------┤
│     Rank 01 │     GPU 00 │   90.73    89.91    91.06    89.90 │     N/A    89.77    89.74    91.05 │
│             │     GPU 01 │   83.53    83.73    83.59    83.69 │  102.92      N/A   103.05   103.21 │
│             │     GPU 02 │   83.82    83.67    83.92    84.07 │  103.17   103.28      N/A   103.53 │
│             │     GPU 03 │   77.40    77.71    90.59    90.64 │   90.56    93.93    94.02      N/A │
└-------------┴------------┴------------------------------------┴------------------------------------┘

2 Groups with Stride 2

[AllToAll Related]
A2A_LOCAL            =            0 : Exclude local transfers
A2A_MODE             =            0 : Copy
MEM_TYPE             =            0 : Using default GPU GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            4 : Using 4 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_REMOTE_READ      =            0 : Using SRC as executor
STRIDE               =            2 : Reordering devices by taking 2 steps
GROUP_SIZE           =            4 : Dividing all devices into groups of 4 for a2a

GPU-GFX IntraPod All-To-All benchmark:
==============================
[268435456 bytes per Transfer] [GFX:8] [1 Read(s) 1 Write(s)] [MemType:default GPU] [NIC QueuePairs:0] [#Ranks:2]
A2A group 0: R0:G0, R0:G1, R1:G0, R1:G1
A2A group 1: R0:G2, R0:G3, R1:G2, R1:G3

--- Pod AllToAll Group 0 ---
┌-------------┬------------┬------------------┬------------------┐
│ SRC+EXE\DST │            │ Rank 00          │ Rank 01          │
├-------------┼------------┼------------------┼------------------┤
│             │ Mem Device │  GPU 00   GPU 01 │  GPU 00   GPU 01 │
├-------------┼------------┼------------------┼------------------┤
│     Rank 00 │     GPU 00 │     N/A  104.90  │  104.78  105.07  │
│             │     GPU 01 │  104.01     N/A  │  103.89  104.01  │
├-------------┼------------┼------------------┼------------------┤
│     Rank 01 │     GPU 00 │  104.33  104.55  │     N/A  104.64  │
│             │     GPU 01 │  104.42  104.54  │  103.98     N/A  │
└-------------┴------------┴------------------┴------------------┘

--- Pod AllToAll Group 1 ---
┌-------------┬------------┬------------------┬------------------┐
│ SRC+EXE\DST │            │ Rank 00          │ Rank 01          │
├-------------┼------------┼------------------┼------------------┤
│             │ Mem Device │  GPU 02   GPU 03 │  GPU 02   GPU 03 │
├-------------┼------------┼------------------┼------------------┤
│     Rank 00 │     GPU 02 │     N/A  103.63  │  103.44  103.71  │
│             │     GPU 03 │  107.86     N/A  │  108.52  108.92  │
├-------------┼------------┼------------------┼------------------┤
│     Rank 01 │     GPU 02 │  106.24  106.01  │     N/A  106.28  │
│             │     GPU 03 │  105.08  106.18  │  106.10     N/A  │
└-------------┴------------┴------------------┴------------------┘

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR updates the intra-pod all-to-all preset to execute all A2A groups within a pod concurrently (single RunTransfers per pod) and enhances the output to show explicit A2A group membership and a rank-grouped display layout.

Changes:

Build transfers for all groups in a pod and execute them together to enable concurrent cross-group traffic.
Print A2A group membership (rank/GPU) in the output.
Reformat the per-group bandwidth table to group columns by MPI rank (with multiple GPUs per rank column).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+      groupTransferBase[group] = (int)podTransfers.size();
+      groupReIndexes[group].assign(groupSize, std::vector<int>(groupSize, -1));
+      std::vector<std::vector<int>>& groupReIndex = groupReIndexes[group];


+        std::vector<int> order(groupSize);
+        for (int i = 0; i < groupSize; i++) order[i] = i;
+        std::sort(order.begin(), order.end(), [&](int a, int b) {
+          MemDevice const& da = devices[groupBase + a];
+          MemDevice const& db = devices[groupBase + b];
+          if (da.memRank != db.memRank) return da.memRank < db.memRank;
+          return da.memIndex < db.memIndex;
+        });


+        });
+        for (size_t si = 0; si < ord.size(); si++) {
+          MemDevice const& d = devices[gb + ord[si]];
+          Utils::Print("%s R%d:G%d", si ? "," : "", d.memRank, d.memIndex);


concurrent execution + update output

e4d59e5

AtlantaPepsi requested a review from a team as a code owner April 20, 2026 15:35

AtlantaPepsi requested a review from Copilot April 20, 2026 15:36

Copilot started reviewing on behalf of AtlantaPepsi April 20, 2026 15:37 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modification to Pod A2A#259

Modification to Pod A2A#259
AtlantaPepsi wants to merge 1 commit intoROCm:candidatefrom
AtlantaPepsi:poda2a_mod

AtlantaPepsi commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AtlantaPepsi commented Apr 20, 2026

Motivation

Technical Details

Test Plan

Test Result

Example Output from a 2 x 4G cluster

Default (1 Group of size 8 in natural order)

2 Groups with Stride 2

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants