MXFP4 Cast Transpose Triton [WIP] #422

wangye805 · 2026-02-02T15:03:09Z

You will need to add this pytest into our ci script (somewhere near

TransformerEngine/ci/pytorch.sh

Line 74 in 9d6b0e5

run_default_fa 1 triton_kernels/test_norms.py

) otherwise it won't be tested

wangye805 · 2026-02-02T04:51:25Z

We previously already defined env NVTE_USE_CAST_TRANSPOSE_TRITON.

wangye805 · 2026-02-02T04:59:28Z

Can we add some prime numbers like

TransformerEngine/tests/cpp/operator/test_cast_transpose.cu

Lines 90 to 92 in 9d6b0e5

{1, 3221}, // Prime 456

{2333, 1}, // Prime 345

{1481, 677}}; // Primes 234, 123

wangye805 · 2026-02-02T04:57:31Z

If FP4 data shuffle is not yet supported in Triton kernel, why do we need to add it here?

This is kept to ensure API consistency between Triton and the upcoming hip kernel for which I'll create a separate PR. In the hip kernel we were able to fuse the shuffle.

hip vs triton flow
Input: BF16 [M, N]
↓
MXFP4Quantizer.update_quantized()
↓
tex.cast_transpose_mxfp4_fused_shuffle() [Single HIP kernel]
↓
├─→ Rowwise FP4 [M, K/2] (MFMA shuffled)
├─→ Rowwise Scale [M_pad, K/32_pad] (shuffled)
├─→ Colwise FP4 [N, M/2] (MFMA shuffled)
└─→ Colwise Scale [N_pad, M/32_pad] (shuffled)
↓
AITER gemm_a4w4 (zero-copy)

vs

Input: BF16 [M, N]
↓
MXFP4Quantizer.update_quantized()
↓
te_cast_transpose_mxfp4_triton() [Triton JIT kernel]
↓
├─→ Rowwise FP4 [M, K/2] (linear layout)
├─→ Rowwise Scale [M_pad, K/32_pad] (shuffled)
├─→ Colwise FP4 [N, M/2] (linear layout)
└─→ Colwise Scale [N_pad, M/32_pad] (shuffled)
↓
aiter.ops.shuffle.shuffle_weight() [External call]
↓
FP4 data → MFMA layout
↓
AITER gemm_a4w4

wangye805 · 2026-02-02T05:02:05Z

Data tol seems to be quite large. You can follow our mxfp8 scale and data adjustment scheme:

TransformerEngine/tests/cpp/test_common.cu

Line 730 in 9d6b0e5

void adjust_ref_for_e8m0_scale_error(const std::string &name,

wangye805 · 2026-02-02T05:02:54Z

What is fp4 shuffle?

fp4 shuffle basically rearranges [M, K/2] linear layout → MFMA instruction layout (16×16).

The currently flow training workflow if TE MXFP4 Quantization Kernel is used is as follows
TE Triton Kernel → Linear FP4 [N, K/2] → aiter.ops.shuffle_weight() → MFMA FP4 → aiter.gemm_a4w4()

You can find the shuffle code in aiter/aiter/ops/shuffle.py

wangye805 · 2026-02-02T14:52:14Z

If we are going to enable kFloat4E2M1, there are other related changes needed. Search for https://github.com/search?q=repo%3AROCm%2FTransformerEngine%20kFloat4E2M1&type=code for more details:

-Original file line number
+Diff line change
@@ Expand Up / @@ -108,7 +108,8 @@ @@
           .value("kFloat16", transformer_engine::DType::kFloat16)                                      \
           .value("kBFloat16", transformer_engine::DType::kBFloat16)                                    \
           .value("kFloat8E4M3", transformer_engine::DType::kFloat8E4M3)                                \
-          .value("kFloat8E5M2", transformer_engine::DType::kFloat8E5M2);                               \
+          .value("kFloat8E5M2", transformer_engine::DType::kFloat8E5M2)                                \
+          .value("kFloat4E2M1", transformer_engine::DType::kFloat4E2M1);                               \
       pybind11::enum_<NVTE_Bias_Type>(m, "NVTE_Bias_Type", pybind11::module_local())                   \
           .value("NVTE_NO_BIAS", NVTE_Bias_Type::NVTE_NO_BIAS)                                         \
           .value("NVTE_PRE_SCALE_BIAS", NVTE_Bias_Type::NVTE_PRE_SCALE_BIAS)                           \
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXFP4 Cast Transpose Triton [WIP] #422

Uh oh!

Diff view

Diff view

There are no files selected for viewing

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

sarthak-amd Feb 3, 2026

Uh oh!

sarthak-amd Feb 3, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

sarthak-amd Feb 3, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

MXFP4 Cast Transpose Triton [WIP] #422

Are you sure you want to change the base?

Uh oh!

MXFP4 Cast Transpose Triton [WIP] #422

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!