Proposal
MiniMax-M2.7's config.json declares MTP architecture parameters (use_mtp: true, num_mtp_modules: 3, mtp_transformer_layers: 1), but the published safetensors contain no trained MTP weights — verified that 0 of the ~1369 tensor keys match the mtp.* pattern in the upstream FP8 release.
Are MTP weights planned for release? If yes, what is the rough timeline?
Why it matters
Without MTP, M2.7 is limited to single-token autoregressive decoding. With trained MTP heads + speculative decoding, peer models like Qwen3.5-397B-A17B (which ships trained MTP) achieve roughly 2–3× decode throughput, making them substantially more competitive than M2.7 for self-hosted inference latency despite their larger active-parameter count.
Related
Alternative
If trained MTP heads are not planned, publishing a small distilled MiniMax-M2.7 variant (~1–3B parameters) sharing the M2.7 tokenizer would let the community train EAGLE-style draft heads independently. Currently every published MiniMax-family model (M1 / M2 / M2.1 / M2.5 / M2.7 / Text-01) is 229B+ parameters, leaving no realistic draft-model option for tokenizer-aligned speculative decoding.
Proposal
MiniMax-M2.7's
config.jsondeclares MTP architecture parameters (use_mtp: true,num_mtp_modules: 3,mtp_transformer_layers: 1), but the published safetensors contain no trained MTP weights — verified that 0 of the ~1369 tensor keys match themtp.*pattern in the upstream FP8 release.Are MTP weights planned for release? If yes, what is the rough timeline?
Why it matters
Without MTP, M2.7 is limited to single-token autoregressive decoding. With trained MTP heads + speculative decoding, peer models like Qwen3.5-397B-A17B (which ships trained MTP) achieve roughly 2–3× decode throughput, making them substantially more competitive than M2.7 for self-hosted inference latency despite their larger active-parameter count.
Related
Alternative
If trained MTP heads are not planned, publishing a small distilled MiniMax-M2.7 variant (~1–3B parameters) sharing the M2.7 tokenizer would let the community train EAGLE-style draft heads independently. Currently every published MiniMax-family model (M1 / M2 / M2.1 / M2.5 / M2.7 / Text-01) is 229B+ parameters, leaving no realistic draft-model option for tokenizer-aligned speculative decoding.