Skip to content

feat(ops): add RmsNorm with Iluvatar, NVIDIA, CPU backends and fp16/bf16 support#6

Open
zhangyue207 wants to merge 5 commits intofeat/dev-infrafrom
feat/dev-rmsnorm-cuda
Open

feat(ops): add RmsNorm with Iluvatar, NVIDIA, CPU backends and fp16/bf16 support#6
zhangyue207 wants to merge 5 commits intofeat/dev-infrafrom
feat/dev-rmsnorm-cuda

Conversation

@zhangyue207
Copy link

  • Add 'RmsNorm' operator with 'CPU', 'NVIDIA', and 'Iluvatar' implementations
  • Support fp32/fp16/bf16 on NVIDIA and Iluvatar; fp32 only on CPU
  • Add shared CUDA kernel (kernel.cuh) and backend-specific wrappers
  • Extend generate_wrappers.py and CMake for RmsNorm
  • Add tests covering backends and dtypes

@zhangyue207 zhangyue207 changed the base branch from master to feat/dev-infra March 2, 2026 02:52
@zhangyue207 zhangyue207 force-pushed the feat/dev-rmsnorm-cuda branch from ea03f0f to 10187f4 Compare March 2, 2026 03:22
@zhangyue207 zhangyue207 changed the title feat(ops): add RmsNorm with Iluvatar, NVIDIA, CPU backends and fp16/bf16 support feat(ops): add 'RmsNorm' with Iluvatar, NVIDIA, CPU backends and fp16/bf16 support Mar 2, 2026
@zhangyue207 zhangyue207 changed the title feat(ops): add 'RmsNorm' with Iluvatar, NVIDIA, CPU backends and fp16/bf16 support feat(ops): add RmsNorm with Iluvatar, NVIDIA, CPU backends and fp16/bf16 support Mar 2, 2026
@zhangyue207
Copy link
Author

zhangyue207 commented Mar 2, 2026

Iluvatar

root@iluvatar:/workspace/InfiniOps# pytest
==================================== test session starts ====================================
platform linux -- Python 3.10.18, pytest-9.0.2, pluggy-1.6.0
rootdir: /workspace/InfiniOps
configfile: pyproject.toml
plugins: anyio-4.9.0, cov-7.0.0, xdist-3.8.0, typeguard-4.4.4
collected 572 items                                                                         

tests/test_add.py ....................................                                [  6%]
tests/test_gemm.py .................................................................. [ 17%]
..................................................................................... [ 32%]
..................................................................................... [ 47%]
..................................................................................... [ 62%]
..................................................................................... [ 77%]
..................................................................................... [ 92%]
.........                                                                             [ 93%]
tests/test_rms_norm.py ....................................                           [100%]

==================================== 572 passed in 1.61s ====================================

@zhangyue207
Copy link
Author

Nvidia

(python3.10) zhangyue@server:~/InfiniOps$ pytest
========================================== test session starts ==========================================
platform linux -- Python 3.10.19, pytest-9.0.2, pluggy-1.6.0
rootdir: /home/zhangyue/InfiniOps
configfile: pyproject.toml
plugins: xdist-3.8.0, cov-7.0.0
collected 572 items                                                                                     

tests/test_add.py ....................................                                            [  6%]
tests/test_gemm.py .............................................................................. [ 19%]
................................................................................................. [ 36%]
................................................................................................. [ 53%]
................................................................................................. [ 70%]
................................................................................................. [ 87%]
..................................                                                                [ 93%]
tests/test_rms_norm.py ....................................                                       [100%]

========================================== 572 passed in 2.20s ==========================================


# NVIDIA and Iluvatar are parallel backends; only one GPU backend at a time.
if(WITH_NVIDIA AND WITH_ILUVATAR)
message(FATAL_ERROR "WITH_NVIDIA and WITH_ILUVATAR cannot both be ON. Build one GPU backend at a time.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用 Markdown 语法:"`WITH_NVIDIA` and `WITH_ILUVATAR` cannot both be `ON`. Build one GPU backend at a time."

find_package(CUDAToolkit REQUIRED)
endif()

# Iluvatar: CUDA-compatible device, uses clang++ with -x ivcore (not nvcc).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用 Markdown 语法:

# Iluvatar: CUDA-compatible device, uses `clang++` with `-x ivcore` (not `nvcc`).
# Reference: `InfiniCore` `xmake/iluvatar.lua`.

if(NOT WITH_NVIDIA)
enable_language(CUDA)
find_package(CUDAToolkit REQUIRED)
set(ILUVATAR_ARCH "ivcore20" CACHE STRING "Iluvatar GPU architecture")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

天数上我开发的不太多,但是我记得之前开发 AddGemm 的时候好像没有这些也编译通过了,能简单讲一下这两段代码的作用和引入理由嘛。

from tests.utils import Payload, empty_strided, randn_strided


def _rms_norm(x, w, out, *, epsilon=1e-6):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_rms_norm_torch_rms_norm 是私有的,应该放在文件后方,也就是 test_rms_norm 后面。

return out


def _torch_rms_norm(x, w, out, *, epsilon=1e-6):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以直接用 torch.nn.functional.rms_norm


} // namespace

template <unsigned int BLOCK_SIZE, typename Tcompute, typename Tdata,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct NvidiaBackend {
using stream_t = cudaStream_t;

static constexpr auto setDevice = [](int) {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个地方为啥是空的?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件中的命名和顺序应该按照前面说的改成跟 PyTorch 对齐的。

CUDA_STANDARD_REQUIRED ON)
endif()

# Iluvatar: CUDA-compatible device; -x ivcore and flags from top-level CMakeLists.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这行注释可以去掉。

target_link_libraries(infiniops PUBLIC CUDA::cudart CUDA::cublas CUDA::cuda_driver)

set_target_properties(infiniops PROPERTIES CUDA_STANDARD 17
CUDA_STANDARD_REQUIRED ON)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA_STANDARD_REQUIREDCUDA_STANDARD 平齐吧,当然这个格式咱们暂时也没有确定的标准。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants