Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new ROCm 6.3 container recipe to the repo’s dockerfile/ collection, targeting the rocm/pytorch-training:v25.6 base image and layering SuperBench build/install steps plus common tooling needed for benchmarks.
Changes:
- Introduces
dockerfile/rocm6.3.x.dockerfilefor a ROCm 6.3.4 + PyTorch training base image. - Installs additional system tools (Docker client, OFED if missing, Intel MLC) and configures SSH/limits.
- Builds SuperBench third-party dependencies and installs the package with AMD worker extras.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ARG NUM_MAKE_JOBS=64 | ||
|
|
| ENV USE_HIPBLAS_COMPUTETYPE=1 | ||
| RUN python3 -m pip install --no-build-isolation .[amdworker] && \ | ||
| CXX=/opt/rocm/bin/hipcc make cppbuild && \ | ||
| make postinstall |
| # Get Ubuntu version and set as an environment variable | ||
| RUN echo "Ubuntu version: $(lsb_release -r -s)" | ||
| ARG UBUNTU_VERSION=22.04 | ||
|
|
||
| # Install OFED | ||
| ENV OFED_VERSION=24.10-1.1.4.0 | ||
| # Check if ofed_info is present and has a version | ||
| RUN if ! command -v ofed_info >/dev/null 2>&1; then \ | ||
| echo "OFED not found. Installing OFED..."; \ |
| RUN python3 -m pip install --upgrade pip wheel setuptools==65.7 && \ | ||
| python3 -c "import pkg_resources" || python3 -m pip install setuptools | ||
|
|
| RUN if ! command -v ofed_info >/dev/null 2>&1; then \ | ||
| echo "OFED not found. Installing OFED..."; \ | ||
| cd /tmp && \ | ||
| wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \ |
| # Upgrading botocore/boto3 to versions compatible with urllib3 2.x. | ||
| RUN python3 -m pip install --upgrade botocore boto3 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #790 +/- ##
==========================================
- Coverage 85.70% 85.16% -0.55%
==========================================
Files 102 103 +1
Lines 7703 8298 +595
==========================================
+ Hits 6602 7067 +465
- Misses 1101 1231 +130
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new ROCm 6.3 build target to the repository’s Docker image build pipeline, enabling CI to produce a superbench/main:rocm6.3 image variant.
Changes:
- Introduces
dockerfile/rocm6.3.x.dockerfilebased onrocm/pytorch-training:v25.6, with additional system deps, Docker CLI, OFED (conditional), and SuperBench build/install steps. - Updates
.github/workflows/build-image.ymlto build and tag the new ROCm 6.3 image on the self-hosted ROCm runner.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| dockerfile/rocm6.3.x.dockerfile | New ROCm 6.3 Dockerfile building SuperBench on top of rocm/pytorch-training:v25.6. |
| .github/workflows/build-image.yml | Adds a rocm6.3 entry to the build matrix to produce/push superbench/main:rocm6.3. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ADD third_party third_party | ||
| # perftest_rocm6.patch skipped — upstream perftest already includes the equivalent changes | ||
| RUN make RCCL_HOME=/opt/rocm ROCBLAS_BRANCH=release-staging/rocm-rel-6.3 HIPBLASLT_BRANCH=release-staging/rocm-rel-6.3 ROCM_VER=rocm-5.5.0 -C third_party rocm -o cpu_hpl -o cpu_stream -o megatron_lm -o rocm_megatron_lm |
| sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config && \ | ||
| sed -i "s/[# ]*PermitUserEnvironment no/PermitUserEnvironment yes/" /etc/ssh/sshd_config && \ | ||
| sed -i "s/[# ]*Port.*/Port 22/" /etc/ssh/sshd_config && \ | ||
| echo "* soft nofile 1048576\n* hard nofile 1048576" >> /etc/security/limits.conf && \ | ||
| echo "root soft nofile 1048576\nroot hard nofile 1048576" >> /etc/security/limits.conf |
|
|
||
| # Install OFED | ||
| ENV OFED_VERSION=24.10-1.1.4.0 | ||
| # Check if ofed_info is present and has a version |
| ADD . . | ||
| ENV USE_HIP_DATATYPE=1 | ||
| ENV USE_HIPBLAS_COMPUTETYPE=1 | ||
| RUN python3 -m pip install --no-build-isolation .[amdworker] && \ | ||
| CXX=/opt/rocm/bin/hipcc make cppbuild && \ | ||
| make postinstall |
| RUN python3 -m pip install --upgrade pip wheel setuptools==65.7 && \ | ||
| python3 -c "import pkg_resources" || python3 -m pip install setuptools |
| ENV OFED_VERSION=24.10-1.1.4.0 | ||
| # Check if ofed_info is present and has a version | ||
| RUN if ! command -v ofed_info >/dev/null 2>&1; then \ | ||
| echo "OFED not found. Installing OFED..."; \ | ||
| cd /tmp && \ | ||
| wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \ |
There was a problem hiding this comment.
Pull request overview
Adds a new ROCm 6.3 container build path to the repo, and updates third-party build logic to support explicit ROCm GPU arch selection during rccl-tests compilation.
Changes:
- Add
dockerfile/rocm6.3.x.dockerfilebased onrocm/pytorch-training:v25.6and install additional tooling (Docker client, OFED, MLC). - Update
third_party/Makefileto optionally buildrccl-testswith explicit--offload-archflags whenAMDGPU_TARGETSis provided. - Update the GitHub Actions image build matrix to include a
rocm6.3build.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| third_party/Makefile | Adds conditional arch-aware rccl-tests build flags driven by AMDGPU_TARGETS. |
| dockerfile/rocm6.3.x.dockerfile | Introduces a new ROCm 6.3 image definition and wires in third-party builds and package installs. |
| .github/workflows/build-image.yml | Adds the new rocm6.3 image to CI build/push matrix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \ | ||
| tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \ | ||
| PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \ | ||
| rm -rf MLNX_OFED_LINUX-${OFED_VERSION}* ; \ | ||
| fi |
There was a problem hiding this comment.
The OFED tarball is downloaded over plain HTTP, which is vulnerable to MITM tampering during the image build. Prefer HTTPS if available and consider verifying the downloaded artifact (checksum/signature) before executing the installer.
| ENV USE_HIPBLAS_COMPUTETYPE=1 | ||
| RUN python3 -m pip install --no-build-isolation .[amdworker] && \ | ||
| CXX=/opt/rocm/bin/hipcc make cppbuild && \ | ||
| make postinstall |
There was a problem hiding this comment.
After ADD . ., this Dockerfile doesn’t remove the .git directory. In this repo the build context includes .git (see existing CUDA/ROCm Dockerfiles which rm -rf .git at the end), so leaving it inflates the final image size and can slow pushes/pulls. Consider deleting .git in the final layer after postinstall.
| make postinstall | |
| make postinstall && \ | |
| rm -rf .git |
| ifdef AMDGPU_TARGETS | ||
| cd ./rccl-tests && make MPI=1 MPI_HOME=$(MPI_HOME) HIPCUFLAGS="-std=c++14 -O3 $(ROCM_OFFLOAD_ARCH_FLAGS) -I$(ROCM_PATH)/include -I$(ROCM_PATH)/include/rccl -I$(ROCM_PATH)/include/hip -DMPI_SUPPORT -I$(MPI_HOME)/include -I$(MPI_HOME)/include/mpi" -j | ||
| else | ||
| cd ./rccl-tests && make MPI=1 MPI_HOME=$(MPI_HOME) -j | ||
| endif |
There was a problem hiding this comment.
ifdef AMDGPU_TARGETS triggers when the variable is defined even if it’s empty, which can unexpectedly switch to the HIPCUFLAGS path and potentially break the build. Consider switching the condition to a non-empty check (e.g., using ifneq ($(strip $(AMDGPU_TARGETS)),)), so the fallback path is used when it’s unset or blank.
| # Upgrading botocore/boto3 to versions compatible with urllib3 2.x. | ||
| RUN python3 -m pip install --upgrade botocore boto3 |
There was a problem hiding this comment.
This upgrades botocore/boto3 to whatever the latest versions are at build time, which makes the image non-reproducible and can break future builds unexpectedly. Consider pinning to known-good versions that are compatible with urllib3 2.x (and optionally using --no-cache-dir).
| # Upgrading botocore/boto3 to versions compatible with urllib3 2.x. | |
| RUN python3 -m pip install --upgrade botocore boto3 | |
| # Upgrade botocore/boto3 to specific versions compatible with urllib3 2.x. | |
| RUN python3 -m pip install --no-cache-dir "botocore==1.35.98" "boto3==1.35.98" |
| ARG NUM_MAKE_JOBS=64 | ||
|
|
||
| # Install Docker | ||
| ENV DOCKER_VERSION=27.5.1 | ||
| RUN cd /tmp && \ |
There was a problem hiding this comment.
NUM_MAKE_JOBS is declared as a build arg here but isn’t referenced anywhere in this Dockerfile, so the workflow’s build_args: NUM_MAKE_JOBS=... has no effect. Either remove the ARG or wire it into the build steps (e.g., pass it to make -j / cmake builds) to make parallelism configurable.
| ARG NUM_MAKE_JOBS=64 | |
| # Install Docker | |
| ENV DOCKER_VERSION=27.5.1 | |
| RUN cd /tmp && \ | |
| # Install Docker | |
| ENV DOCKER_VERSION=27.5.1 | |
| RUN cd /tmp && \ | |
| RUN cd /tmp && \ |
Description
Add ROCm6.3 dockerfile.