Skip to content

CI: build relocatable packages via TheRock SDK#260

Closed
thananon wants to merge 8 commits intodevelopfrom
users/thananon/therock-packages-workflow
Closed

CI: build relocatable packages via TheRock SDK#260
thananon wants to merge 8 commits intodevelopfrom
users/thananon/therock-packages-workflow

Conversation

@thananon
Copy link
Copy Markdown
Contributor

Summary

  • Adds .github/workflows/build-relocatable-packages.yml and build_packages_local.sh to build relocatable DEB/RPM/TGZ packages of TransferBench against the TheRock nightly ROCm SDK.
  • Modeled on the ROCmValidationSuite packaging workflow. Packages install to /opt/rocm/extras-<MAJOR> with \$ORIGIN-relative RPATH.
  • Adds a new BUILD_RELOCATABLE_PACKAGE CMake option that switches packaging away from rocm_create_package and uses amdrocm<MAJOR>-transferbench naming. Default cmake .. behavior is unchanged.
  • Includes .github/workflows/README_BUILD_PACKAGES.md documenting triggers, local usage, S3 path layout, IAM trust policy, and apt/yum install snippets.

This PR opens primarily to exercise the new workflow in CI. The build-ubuntu and build-manylinux jobs should both succeed; S3 upload is gated on AWS_S3_BUCKET repo variable being set, so it will skip cleanly until that is configured.

Test plan

  • build-ubuntu job builds DEB + TGZ and uploads them as artifacts
  • build-manylinux job builds RPM + TGZ and uploads them as artifacts
  • release-summary job produces a build-report artifact
  • Locally: sudo -E ./build_packages_local.sh produces relocatable packages

🤖 Generated with Claude Code

Adds a CI workflow that builds DEB/RPM/TGZ packages of TransferBench
against the TheRock nightly ROCm SDK, modeled on the
ROCmValidationSuite packaging workflow. Packages install to
/opt/rocm/extras-<MAJOR> with $ORIGIN-relative RPATH so they are
relocatable.

- build_packages_local.sh: single source of truth for both local and
  CI builds. Detects Ubuntu vs AlmaLinux/manylinux, installs deps,
  fetches TheRock SDK tarball, configures CMake with relocatable
  RPATH and the new BUILD_RELOCATABLE_PACKAGE option, builds, and
  invokes CPack for DEB/RPM/TGZ.
- .github/workflows/build-relocatable-packages.yml: parallel
  Ubuntu 22.04 + manylinux_2_28 jobs triggered on push, PR, daily
  cron, and workflow_dispatch. OIDC-based S3 upload gated on
  AWS_S3_BUCKET being set; apt/yum repo metadata generated for
  non-PR builds. Build report artifact summarizes S3 paths.
- .github/workflows/README_BUILD_PACKAGES.md: workflow docs covering
  triggers, local usage, S3 layout, IAM trust policy, and apt/yum
  install snippets.
- CMakeLists.txt: new BUILD_RELOCATABLE_PACKAGE option that bypasses
  rocm_install/rocm_create_package, names the package
  amdrocm<MAJOR>-transferbench, and honors caller-set install prefix
  and CPACK_*_PACKAGE_RELEASE env vars. Default cmake .. behavior is
  unchanged.

Co-Authored-By: Claude Opus 4 <[email protected]>
@thananon thananon requested review from a team as code owners April 23, 2026 15:51
The tarballs are published as:
  therock-dist-linux-<family>-<version>.tar.gz

not <family>-<version>.tar.gz as I had it. Also TheRock does not
publish a per-family LATEST.txt, so the auto-fetch path now lists
the bucket via S3 ListObjectsV2 and picks the highest version key
with `sort -V`. Updates the pinned fallback to a version that
actually exists on the bucket today (7.13.0a20260423).

Co-Authored-By: Claude Opus 4 <[email protected]>
@nileshnegi nileshnegi requested a review from Copilot April 23, 2026 15:57
The bucket includes non-release ad-hoc builds (e.g.
therock-dist-linux-gfx94X-dcgpu-ADHOCBUILD-7.0.0rc20250625.tar.gz)
which `sort -V | tail -1` was selecting because 'A' lexically
sorts after digits. The downstream `printf '%02d'` then crashed
trying to parse `ADHOCBUILD-7` as the ROCm major.

Restrict the auto-fetch grep to keys matching
  <prefix><MAJOR>.<MINOR>.<patch+suffix>.tar.gz

so only properly versioned releases are considered.

Co-Authored-By: Claude Opus 4 <[email protected]>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CI + local tooling to build relocatable TransferBench packages (DEB/RPM/TGZ) against the TheRock nightly ROCm SDK, including a new CMake switch to use direct CPack-based packaging and new GitHub Actions workflow/docs.

Changes:

  • Add build_packages_local.sh to fetch a TheRock SDK tarball, configure/build, and emit DEB/RPM/TGZ via CPack.
  • Add BUILD_RELOCATABLE_PACKAGE CMake option to bypass rocm_create_package and produce amdrocm<MAJOR>-transferbench packages.
  • Add GitHub Actions workflow + README to build/upload artifacts (and optionally publish to S3 with repo metadata).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
build_packages_local.sh New local/CI build script that installs deps, fetches TheRock SDK, builds, and runs CPack.
CMakeLists.txt Adds relocatable packaging mode with direct CPack configuration and new option flag.
.github/workflows/build-relocatable-packages.yml New workflow to build packages on Ubuntu + manylinux and optionally upload to S3.
.github/workflows/README_BUILD_PACKAGES.md Documentation for triggers, local usage, and S3 repo layout/consumption.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/build-relocatable-packages.yml
Comment thread .github/workflows/build-relocatable-packages.yml Outdated
Comment thread build_packages_local.sh
Comment thread build_packages_local.sh Outdated
Comment thread CMakeLists.txt
Comment thread CMakeLists.txt Outdated
Comment thread build_packages_local.sh Outdated
Comment thread build_packages_local.sh Outdated
Comment thread build_packages_local.sh
thananon and others added 4 commits April 23, 2026 12:10
The <barrier> header is C++20-only and missing from manylinux_2_28's
libstdc++ (gcc 8.5 baseline), breaking the relocatable RPM build.
No std::barrier / std::latch usage exists in the codebase, so the
include is dead.

Co-Authored-By: Claude Opus 4 <[email protected]>
- Gate S3 upload + repo metadata steps on github.repository ==
  'ROCm/TransferBench' so forks that set AWS_S3_BUCKET don't try
  unauthenticated uploads.
- Add EUID root check to build_packages_local.sh up front (clearer
  error than failing later in apt/dnf).
- Rename TAROBALL_BASE -> TARBALL_BASE (typo).
- Sanitize PKG_RELEASE: collapse non-alphanumerics into dots so
  feature-branch names stay valid in DEB/RPM release fields.
- Drop ${ROCM_PATH}/lib from RPATH_LIST so the ephemeral SDK
  download path is not embedded into the packaged binary.
- Make CPACK_RPM_EXCLUDE_FROM_AUTO_FILELIST_ADDITION track the
  actual CPACK_PACKAGING_INSTALL_PREFIX instead of hard-coded
  /opt/rocm/extras-N paths.

Co-Authored-By: Claude Opus 4 <[email protected]>
manylinux_2_28's libstdc++ (gcc 8.5) doesn't transitively pull
<iomanip>, <numeric>, or <limits> the way newer libstdc++ does.
Add direct includes where the symbols are used:

- src/client/Utilities.hpp: <iomanip> for std::setprecision/setfill/setw
- src/client/Presets/NicPeerToPeer.hpp: <numeric> for std::iota
- src/client/Presets/AllToAll{,N}.hpp: <limits> for std::numeric_limits

Modern toolchains compile fine without these because <iostream>,
<vector>, etc. happen to drag them in; older standard libraries
don't. Fixes the relocatable RPM build under manylinux_2_28.

Co-Authored-By: Claude Opus 4 <[email protected]>
manylinux_2_28's gcc 8.x baseline ships std::filesystem in a
separate library libstdc++fs that must be linked explicitly.
On gcc 9+ libstdc++fs exists as a backward-compat stub, so
linking it unconditionally when found is harmless on newer
toolchains.

Resolves the link-time failures:
  ld.lld: error: undefined symbol: std::filesystem::status(...)
  ld.lld: error: undefined symbol: std::filesystem::canonical(...)

Co-Authored-By: Claude Opus 4 <[email protected]>
@nileshnegi
Copy link
Copy Markdown
Collaborator

do you mind moving the header file changes to a different PR?

find_library(stdc++fs) doesn't search gcc-private libdirs like
/usr/lib/gcc/x86_64-redhat-linux/8/, where AlmaLinux 8 / manylinux_2_28
ships libstdc++fs.a. The compiler-driven `-lstdc++fs` does. On gcc 9+
toolchains this resolves to a no-op stub archive — harmless.

Verified locally by reproducing the manylinux_2_28 build inside
quay.io/pypa/manylinux_2_28_x86_64 (see repro_manylinux.sh).

Co-Authored-By: Claude Opus 4 <[email protected]>
@thananon
Copy link
Copy Markdown
Contributor Author

manylinux_2_28 build fixes — summary

Getting the Build (manylinux_2_28) job to pass against TheRock SDK required four code changes, all stacked on this branch. The container is quay.io/pypa/manylinux_2_28_x86_64 (AlmaLinux 8 base, gcc 8.5, glibc 2.28) — the tooling is much older than what Ubuntu 22.04 ships, so a few things that "just work" elsewhere don't here.

1. Remove unused #include <barrier> (cf95f88)

  • Symptom: fatal error: 'barrier' file not found
  • Why: <barrier> is C++20 and requires libstdc++ ≥ 10. gcc 8 doesn't ship it.
  • Fix: verified std::barrier was never actually used in the code, then dropped the include from src/header/TransferBench.hpp.

2. Add missing transitive #includes (b3d5f92)

  • Symptom: 'iomanip' / 'numeric' / 'limits' undeclared (e.g. std::setprecision, std::iota, std::numeric_limits).
  • Why: older libstdc++ doesn't pull these in via other standard headers; newer libstdc++ does. Code was relying on transitive includes that don't exist on gcc 8.
  • Fix: explicit includes in Utilities.hpp, NicPeerToPeer.hpp, AllToAll.hpp, AllToAllN.hpp.

3. Link libstdc++fs for std::filesystem (53cfb559fabaaf)

  • Symptom: ld.lld: error: undefined symbol: std::filesystem::status / canonical / __cxx11::path::_M_split_cmpts.
  • Why: gcc < 9 ships std::filesystem in a separate libstdc++fs.a archive. On AlmaLinux 8 this lives in a gcc-private libdir (/usr/lib/gcc/x86_64-redhat-linux/8/libstdc++fs.a) — not on the standard linker search path.
  • First attempt (53cfb55): find_library(STDCXXFS_LIBRARY stdc++fs) + conditional link. Did not work — CMake's find_library doesn't search gcc-internal directories, so the variable came back empty and -lstdc++fs was never emitted.
  • Final fix (9fabaaf): drop find_library entirely and pass bare target_link_libraries(... stdc++fs). The compiler driver knows where its own private libdir is. On gcc 9+ toolchains libstdc++fs.a is shipped as a no-op stub for compatibility — harmless to link.

4. Reproduction loop

  • Added repro_manylinux.sh that spawns quay.io/pypa/manylinux_2_28_x86_64 as a long-lived detached container, mounts the repo, and drives build_packages_local.sh via docker exec. State (apt deps, ROCm SDK tarball, build dir) survives across iterations so the link error can be poked at without re-paying setup cost each time. This is what verified fix Adding support for NUMA nodes without CPUs #3 worked before pushing.

Verification

  • Local: build inside the manylinux container produces amdrocm7-transferbench-1.66.02-Linux.rpm (1 MB) + .tar.gz (3.6 MB) + binary.
  • CI: run 24852754410 — all 3 jobs (Ubuntu 22.04, manylinux_2_28, Build Report) green.

Note on infra flakiness

While iterating, the manylinux job intermittently failed with System.IO.IOException: No space left on device on the GitHub-hosted runner — pure runner-infrastructure ENOSPC, not a code issue. Auto-retried via gh run rerun --failed; eventually landed on a clean runner.

@thananon
Copy link
Copy Markdown
Contributor Author

Closing this PR. Will create 2 separate PRs for this.

@thananon
Copy link
Copy Markdown
Contributor Author

Closing in favor of two narrower PRs:

  • PR A — Fix portability for AlmaLinux 8 / manylinux_2_28 (gcc 8) #261 — portability fixes for AlmaLinux 8 / manylinux_2_28 (gcc 8). Independently useful: unblocks anyone building TransferBench on RHEL/AlmaLinux/manylinux toolchains regardless of CI changes.
  • PR B (to be opened after A merges) — the GitHub Actions workflow + build_packages_local.sh for building relocatable DEB/RPM/TGZ packages against TheRock SDK and publishing to S3.

No work is being thrown away — the same commits are being re-landed on the two new branches in cleaner form. The source branch users/thananon/therock-packages-workflow is being kept around as the reference for the cherry-picks into PR B and as a recovery point.

@thananon thananon closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants