Triton norms dispatch refactor #305

Micky774 · 2025-09-05T21:52:49Z

Description

This PR disentangles the backend triton implementation from the front-end API, creating a unified intermediate te_norm_fwd_triton which is a generalized dispatch function. This PR is fully backwards compatible, as te_rmsnorm_fwd_triton and te_layernorm_fwd_triton are preserved and implemented as thin wrappers around te_norm_fwd_triton.

This way, when bugs appear, we fix them once without needing to duplicate across norms.

Consequently, there are some changes to the imports to accommodate this restructuring. This PR also includes a minor cleanup/simplification of previously redundant behavior in the layernorm fwd implementation, as well as support for Float8CurrentScalingQuantizer.

FWIW I don't think we can apply a similar unification to the backwards passes, as it seems that -- at least for layernorm -- the backwards implementations are pretty specialized and have asymmetric heuristics.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Copilot

Pull Request Overview

This PR refactors the Triton normalization (RMSNorm and LayerNorm) implementations by creating a unified dispatch mechanism. It introduces a new te_norm_fwd_triton function that serves as a generalized entry point for both norm types, while preserving backward compatibility by maintaining the existing API functions as thin wrappers.

Key changes include:

Created a unified te_norm_fwd_triton dispatch function in a new norms.py file
Modified kernel signatures to support both RMSNorm and LayerNorm use cases
Updated imports across multiple modules to reference the new consolidated location

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`transformer_engine/pytorch/triton_kernels/norms.py`	New file containing unified norm dispatch logic and relocated function implementations
`transformer_engine/pytorch/triton_kernels/rmsnorm.py`	Removed `te_rmsnorm_fwd_triton` function and updated kernel signature for unification
`transformer_engine/pytorch/triton_kernels/layernorm.py`	Removed forward/backward functions and simplified reduction kernel signature
`transformer_engine/pytorch/ops/basic/rmsnorm.py`	Updated import to reference new `norms` module
`transformer_engine/pytorch/ops/basic/layer_norm.py`	Updated import to reference new `norms` module
`transformer_engine/pytorch/module/layernorm_mlp.py`	Consolidated imports from new `norms` module
`transformer_engine/pytorch/module/layernorm_linear.py`	Consolidated imports from new `norms` module
`transformer_engine/pytorch/module/_common.py`	Consolidated imports from new `norms` module
`tests/pytorch/triton_kernels/test_norms.py`	Updated imports to reference new `norms` module

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

transformer_engine/pytorch/triton_kernels/rmsnorm.py

transformer_engine/pytorch/triton_kernels/layernorm.py

transformer_engine/pytorch/triton_kernels/norms_common.py

transformer_engine/pytorch/triton_kernels/norms.py

Micky774 · 2025-09-05T23:11:09Z

Note that currently some of the layernorm tests are failing, but they're citing NaN vals in the expected tensor, i.e. the HIP reference kernel. I tried including @alextmagro's PR #303 but it still fails. @alextmagro have you seen such an error as well? Is it something related?

alextmagro · 2025-09-05T23:57:36Z

Note that currently some of the layernorm tests are failing, but they're citing NaN vals in the expected tensor, i.e. the HIP reference kernel. I tried including @alextmagro's PR #303 but it still fails. @alextmagro have you seen such an error as well? Is it something related?

I haven't seen anything like that.

transformer_engine/pytorch/triton_kernels/norms_common.py

tests/pytorch/triton_kernels/test_norms.py

Micky774 · 2025-09-08T19:27:31Z

Turns out something in this PR makes it so that the layernorm kernel has bad memory behavior. Specifically, it mutates either the weight tensor, or the bias tensor in the test. This happens, I believe, because they are allocated on GPU contiguously wrt each other (i.e. first input array, then gamma, then bias) which leads me to suspect that there's some kind of masking problem with the layernorm kernel, but I have not been able to pinpoint it yet.

Everything seems to work on dev, but I don't have any functional changes aside from a logical simplification of non-atomic layernorm fwd cases. Most of it is just variable renaming.

wenchenvincent · 2026-01-14T03:43:18Z

@Micky774 Could you remind me of what we had decided on this PR?

Micky774 · 2026-01-14T17:45:56Z

@Micky774 Could you remind me of what we had decided on this PR?

@wenchenvincent last we talked about this, I think @matthiasdiener was supposed to eventually take it over. There's currently a bug exposed by this PR that will require a bit of work to resolve I think. There's some kind of memory mismanagement occurring, where output tensors' memory is being overwritten after being produced.

Micky774 · 2026-01-28T20:19:29Z

Note that I skip the tests where the HIP kernel generates nan values, but if desired, I can instead test against a reference torch implementation. I prefer the skip, since it highlights the fact that the test isn't fully-enabled as opposed to us forgetting that certain cases were handled with a torch reference implementation...

This was a manifestation of the aforementioned bug

Micky774 · 2026-01-30T20:05:40Z

cc: @wenchenvincent @wangye805
I finally found the underlying issue. There was a problem with the sizing of the amax array in the layernorm kernel, which led to unsafe memory writes corrupting adjacent data. I've corrected this.

The PR is ready for review!

wenchenvincent · 2026-02-04T15:36:58Z

LGTM. @ipanfilo You reviewed it a while ago. Do you have further comments?

wangye805 · 2026-02-09T15:50:27Z

tests/pytorch/triton_kernels/test_norms.py

@@ -1,29 +1,30 @@
-# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.


wangye805 · 2026-02-09T15:52:17Z

tests/pytorch/triton_kernels/test_norms.py

 import pytest
 from functools import partial
 from itertools import product
+from torch.utils.cpp_extension import IS_HIP_EXTENSION


This entire file is ROCM specific. Basically we can assume IS_HIP_EXTENSION is true when running this pytest

It was unused anyways -- fixed.

wangye805 · 2026-02-09T15:55:28Z

tests/pytorch/triton_kernels/test_norms.py

+
+            # The scale_inv values may differ slightly, but will still dequantize close enough to 
+            # pass the earlier comparisons.
+            compare_func = partial(te_compare_results, atol=1, rtol=0, use_torch_semantics=True)


For mxfp8 data and scale inv comparison, we can reuse the same logic in cpp gtest:

TransformerEngine/tests/cpp/test_common.cu

Line 730 in 0dfee56

void adjust_ref_for_e8m0_scale_error(const std::string &name,

TransformerEngine/tests/cpp/operator/test_cast_mxfp8.cu

Lines 331 to 355 in 0dfee56

#ifdef __HIP_PLATFORM_AMD__

if (::testing::Test::HasFatalFailure()) return;

adjust_ref_for_e8m0_scale_error("scales", mismatches_scales_indices, gpu_scales_ptr,

ref_output_scales.get(), scales_stride, rows, cols, rowwise,

ref_output_c.get(), otype);

mismatches_scales = 0;

#endif

const size_t mismatches_elts = 32 * mismatches_scales;

auto [atol, rtol] = getTolerances(otype);

compareResults("output_c", output_c, ref_output_c.get(), rowwise, atol, rtol, true, mismatches_elts);

if (processing_method == ProcessingMethod::CAST_DBIAS

|| processing_method == ProcessingMethod::CAST_DBIAS_DACT)

{

auto [atol_dbias, rtol_dbias] = getTolerances(itype);

if (itype == DType::kFloat32) {

atol_dbias = 1e-4;

rtol_dbias *= sqrt(static_cast<double>(rows)) ;

} else {

rtol_dbias *= 4;

}

compareResults("output_dbias", output_dbias, ref_output_dbias.get(), true, atol_dbias, rtol_dbias);

}

}

We essentially already do this implicitly by relying on the dequantization of the MXFP8Tensors before comparison. While we could handle this explicitly as in the C tests, I don't think that's necessary given that the dequantization behavior has its own testing which passes. Let me know if you have other thoughts on the matter.

wangye805 · 2026-02-09T15:58:02Z

tests/pytorch/triton_kernels/test_norms.py

+            # The MXFP8 tensors carry their scale_inv values in a padded
+            # format, hence we must omit the padded values.
+            input_shape = out_triton.shape
+            unpadded_scale_inv_shape = (math.prod(input_shape[:-1]), input_shape[-1] // MXFP8_BLOCK_SCALING_SIZE)


Should we have different shape for row-wise and col-wise scaling?

We resolve this with re-indexing, but I've updated the variable name for a bit of extra clarity.

wangye805 · 2026-02-09T15:58:57Z

tests/pytorch/triton_kernels/test_utils.py

@@ -1,11 +1,11 @@
-# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.


wangye805 · 2026-02-09T16:00:22Z

transformer_engine/pytorch/ops/basic/layer_norm.py

 if IS_HIP_EXTENSION:
-    from ...triton_kernels.layernorm import te_layernorm_fwd_triton, te_layernorm_bwd_triton
+    from ...triton_kernels.norms_common import te_layernorm_fwd_triton, te_layernorm_bwd_triton
+from ...fp8 import FP8GlobalStateManager


Do we need to import FP8GlobalStateManager and QuantizedTensor here? For both NV upstream and us

Removed extra import

transformer_engine/pytorch/module/_common.py

transformer_engine/pytorch/ops/basic/layer_norm.py

ipanfilo · 2026-02-09T19:22:38Z

transformer_engine/pytorch/ops/basic/layer_norm.py

 if IS_HIP_EXTENSION:
-    from ...triton_kernels.layernorm import te_layernorm_fwd_triton, te_layernorm_bwd_triton
+    from ...triton_kernels.norms_common import te_layernorm_fwd_triton, te_layernorm_bwd_triton
+from ...tensor import QuantizedTensor


Is it needed for this PR?

No, it was leftover on accident. Removed.

transformer_engine/pytorch/ops/basic/rmsnorm.py

ci/pytorch.sh

ipanfilo · 2026-02-10T16:32:07Z

transformer_engine/pytorch/module/layernorm_mlp.py

Copyright date

Done, along with a few others I missed.

Initial refactor

7c72ee0

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners September 5, 2025 21:52

Micky774 requested a review from Copilot September 5, 2025 21:52

Copilot AI reviewed Sep 5, 2025

View reviewed changes

Micky774 added 3 commits September 5, 2025 16:54

Merge branch 'dev' into zain/triton-dispatch

45e6236

Minor API correction

5b2ea1c

Corrected atomic behaivor

077b8fc

Micky774 marked this pull request as draft September 5, 2025 22:35

API update

0011e5f

Micky774 marked this pull request as ready for review September 5, 2025 23:09

Micky774 added 2 commits September 8, 2025 09:35

Merge branch 'dev' into zain/triton-dispatch

5ada1bd

Formatting

26298c8

ipanfilo reviewed Sep 8, 2025

View reviewed changes

transformer_engine/pytorch/triton_kernels/norms_common.py Show resolved Hide resolved

tests/pytorch/triton_kernels/test_norms.py Show resolved Hide resolved

Micky774 added 7 commits January 28, 2026 11:52

Merge branch 'dev' into zain/triton-dispatch

cc02444

Added skip for failing HIP kernels

18eb6e7

Updated to account for alignment args

a72b507

Updated CI script for MI350 runs, minor code cleaning

a64b5f1

Streamlined implementation

1d2554c

Corrected alignment calculation

fd59057

Add copyright

bbd4240

Updated alignment calculation

92aecf2

Micky774 mentioned this pull request Jan 28, 2026

RMSNorm alignment calculation hotfix #436

Closed

13 tasks

Micky774 added 2 commits January 29, 2026 10:45

Corrected FP8_CS handling

12bb156

Corrected layernorm memory access bug

1925039

Micky774 added 2 commits January 30, 2026 14:34

Corrected amax dims

6f9b6c5

Adjusted amax init

dc3ed87

wenchenvincent requested a review from aris134 February 4, 2026 17:03

aris134 approved these changes Feb 4, 2026

View reviewed changes

Micky774 added 2 commits February 6, 2026 16:19

Updated file names, and copyright

24fbcc4

Corrected MXFP8 testing behavior

0d6d00f

wangye805 requested changes Feb 9, 2026

View reviewed changes

Micky774 added 3 commits February 9, 2026 17:00

Update copyright, clarify test, clean imports

499d14b

Updated test script to respect renaming

c526c8d

Merge branch 'dev' into zain/triton-dispatch

26cd12c

ipanfilo requested changes Feb 9, 2026

View reviewed changes

Update copyrights, clean import

d031266

ipanfilo reviewed Feb 10, 2026

View reviewed changes

Copyrights

8a5a786

ipanfilo approved these changes Feb 10, 2026

View reviewed changes

		@@ -1,29 +1,30 @@
		# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.
		# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.

	#ifdef __HIP_PLATFORM_AMD__
	if (::testing::Test::HasFatalFailure()) return;
	adjust_ref_for_e8m0_scale_error("scales", mismatches_scales_indices, gpu_scales_ptr,
	ref_output_scales.get(), scales_stride, rows, cols, rowwise,
	ref_output_c.get(), otype);
	mismatches_scales = 0;
	#endif

	const size_t mismatches_elts = 32 * mismatches_scales;
	auto [atol, rtol] = getTolerances(otype);
	compareResults("output_c", output_c, ref_output_c.get(), rowwise, atol, rtol, true, mismatches_elts);

	if (processing_method == ProcessingMethod::CAST_DBIAS
	\|\| processing_method == ProcessingMethod::CAST_DBIAS_DACT)
	{
	auto [atol_dbias, rtol_dbias] = getTolerances(itype);
	if (itype == DType::kFloat32) {
	atol_dbias = 1e-4;
	rtol_dbias *= sqrt(static_cast<double>(rows)) ;
	} else {
	rtol_dbias *= 4;
	}
	compareResults("output_dbias", output_dbias, ref_output_dbias.get(), true, atol_dbias, rtol_dbias);
	}
	}

		@@ -1,11 +1,11 @@
		# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.
		# Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved.

Triton norms dispatch refactor #305

Are you sure you want to change the base?

Triton norms dispatch refactor #305

Conversation

Micky774 commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Micky774 commented Sep 5, 2025

Uh oh!

alextmagro commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Micky774 commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenchenvincent commented Jan 14, 2026

Uh oh!

Micky774 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Micky774 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Micky774 commented Jan 30, 2026

Uh oh!

wenchenvincent commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Micky774 commented Sep 5, 2025 •

edited

Loading

Micky774 commented Sep 8, 2025 •

edited

Loading

Micky774 commented Jan 14, 2026 •

edited

Loading

Micky774 commented Jan 28, 2026 •

edited

Loading