TrtLLMGen MoE Kernels in FlashInfer: What NVIDIA Just Open-Sourced and Why It Matters

On April 3, 2026, Alexander Zhurkevich (NVIDIA, @cudagdb) submitted PR #2917 to the FlashInfer repository: +4,872,992 lines added across 2,753 files. This is NVIDIA open-sourcing TrtLLMGen — the internal kernel generation system behind TensorRT-LLM’s Mixture-of-Experts inference — into a community-governed project. A follow-up PR #3034 adds Flash Multi-Head Attention (FMHA) kernels from the same codebase. Both target SM100 (Blackwell) architecture.

This is one of the most significant open-source contributions to the LLM inference ecosystem in 2026.

1. What Is TrtLLMGen?

The Name

TrtLLMGen = TensorRT LLM Generated kernels. It is NVIDIA’s internal code generation system for producing highly optimized CUDA kernels specifically for large language model inference workloads. The “Gen” is literal — the system generates CUDA source code, which is why the PR adds nearly 5 million lines: most of it is auto-generated kernel instantiations.

Where It Sits in NVIDIA’s Stack

User-facing frameworks (vLLM, SGLang, TGI, etc.)
        |
        v
TensorRT-LLM (NVIDIA's LLM inference library)
        |
        v
TrtLLMGen (kernel generation engine)  <-- THIS IS WHAT WAS OPEN-SOURCED
        |
        v
Generated CUDA kernels (batched GEMM, MoE routing, FMHA)
        |
        v
SM100 hardware (Blackwell Tensor Cores, TMA, etc.)

TensorRT-LLM is NVIDIA’s high-level inference library. Underneath it, TrtLLMGen is the engine that produces the actual GPU kernels — the innermost compute loops that determine whether you get 90% or 60% of peak hardware utilization. These kernels are what NVIDIA uses internally to win benchmarks.

The Team

The contributor list reads like an NVIDIA inference all-star roster:

Alexander Zhurkevich (PR author, @aleozlx) — NVIDIA kernel engineer
Julien Demouth — well-known NVIDIA CUDA optimization expert, frequent GTC presenter
Nikita Korobov (@nekorobov)
Louis Sugy, Jiqun Tu, David Clark, Maxim Milakov, Tian Zheng, Anthony Chang, and others

This is not a side project. This is the team that writes the kernels NVIDIA uses to set MLPerf records.

2. Why MoE Kernels Are the Bottleneck

The MoE Architecture Recap

Mixture of Experts models (Mixtral 8x7B, Mixtral 8x22B, DeepSeek-V2/V3, Grok-1, DBRX, Snowflake Arctic) replace the standard feed-forward network in each transformer block with a routing mechanism:

Input tokens
     |
  Router (small linear layer)
     |
  Top-K expert selection (typically K=2 of N=8..64 experts)
     |
  Dispatch tokens to selected experts (each expert is an FFN)
     |
  Combine expert outputs (weighted sum)
     |
Output tokens

The key property: each token activates only K of N experts. A 141B-parameter Mixtral 8x22B model only uses ~44B parameters per token (2 of 8 experts). This gives you the capacity of a large model with the compute cost of a smaller one.

Why MoE Is Hard to Serve Efficiently

The problem is that MoE creates an extremely irregular workload for GPUs:

Variable-size batched GEMM: Each expert may receive a different number of tokens. Expert 0 might get 47 tokens, expert 3 might get 2, expert 7 might get 0. You need to run N parallel GEMMs of wildly different sizes.
Token routing overhead: Scatter-gather operations to route tokens to their assigned experts and collect results add memory traffic.
Load imbalance: If the router sends 80% of tokens to 2 of 8 experts, those experts become bottlenecks while the other 6 sit idle. Tail latency is set by the most loaded expert.
Small-batch decode: During autoregressive decode, batch sizes per expert can be tiny (1-4 tokens). Standard cuBLAS GEMM is catastrophically inefficient at these sizes.
Memory bandwidth wall: Expert weights do not fit in L2 cache. Each decode step must stream all active expert weights from HBM. With 8 experts at 22B params each in FP8, that is ~22 GB of weight reads per layer.

This is why generic GEMM libraries fail for MoE inference. You need kernels specifically designed for the access patterns, batch size distributions, and memory traffic of MoE routing.

The Performance Gap

Before TrtLLMGen-class kernels, the state of MoE inference was:

Approach	Relative Performance	Problem
Naive per-expert cuBLAS	1x (baseline)	Kernel launch overhead dominates at small batch
Grouped GEMM (Triton)	1.5-2x	Better but still not fused, poor occupancy
Padding to uniform batch	1.2x	Wastes compute on padding tokens
TrtLLMGen fused MoE	3-5x	Fused routing + GEMM + activation, batched dispatch

The gap is real and it matters at scale: the difference between serving DeepSeek-V3 at $0.50/ M t o k e n s v s$ 2.00/M tokens.

3. Technical Deep Dive: The Kernels

3.1 Kernel Generation Architecture

TrtLLMGen is a metaprogramming system. Rather than writing kernels by hand, it generates CUDA source code from parameterized templates. This is why the PR adds 2,753 files and ~5M lines — each file is a specialized kernel instantiation for a specific combination of:

Precision: FP8 (E4M3/E5M2), FP16, BF16, FP4
Expert count: 8, 16, 32, 64, 128
Top-K: 1, 2, 4, 8
Hidden dimension: 4096, 5120, 7168, 8192, 14336, etc.
Activation function: SwiGLU, GELU, ReLU
Batch size ranges: decode (1-8), small prefill (16-128), large prefill (256-4096)
Schedule pattern: how work is distributed across CTAs and warps
Output layout: transposed vs. non-transposed

The combinatorial explosion is enormous. A single “MoE GEMM kernel” actually needs hundreds of specialized variants to cover the workload space. Hand-writing all of these is impractical. Code generation makes it manageable.

3.2 SM100 (Blackwell) Specific Optimizations

The kernels target SM100 architecture. This matters because Blackwell introduces several features that these kernels exploit:

Tensor Memory Accelerator (TMA) 2.0

Blackwell’s TMA is a dedicated DMA engine on each SM that can perform multi-dimensional tensor copies between global memory and shared memory without consuming CUDA core cycles. TrtLLMGen kernels use TMA for:

Asynchronous expert weight loading (prefetch next expert while computing current)
Strided token gathering during MoE dispatch (gather non-contiguous tokens into contiguous shared memory tiles)
Output scattering (write results back to correct token positions)

On Hopper (SM90), TMA already existed but with limitations. Blackwell’s TMA adds wider transfer widths and better support for irregular access patterns — exactly what MoE routing needs.

5th Generation Tensor Cores

Blackwell Tensor Cores support:

FP4 (E2M1) x FP4 accumulate to FP16: 2x throughput vs FP8
FP8 (E4M3) x FP8 accumulate to FP32: the workhorse precision for MoE inference
Larger matrix tiles: 128x256x64 for FP8, vs 64x256x32 on Hopper

The generated kernels select tile sizes that map exactly to the hardware’s native matrix dimensions, avoiding partial-tile waste.

Tensor Memory (TMEM)

SM100 introduces a new memory level: Tensor Memory, which is a register file extension specifically for tensor core operands. TrtLLMGen kernels manage TMEM allocation explicitly to keep operands resident across multiple matrix multiply-accumulate (MMA) iterations, reducing register spilling.

The kernel naming conventions visible in the PR reveal these features:

trtllmgen_moe_gemm_sm100_fp8_swiGlu_e8k2_h14336_dynbatch_tout_sched3
                  ^^^^^  ^^^  ^^^^^^  ^^^^  ^^^^^^  ^^^^^^^^  ^^^^  ^^^^^^
                  arch   prec  act    exp/k  hidden  batching  out   schedule

3.3 Batched GEMM for MoE: The Core Innovation

The fundamental operation in MoE is a batched GEMM where each batch element (expert) has a different M dimension (number of tokens assigned to it). This is called a “variable-batch” or “grouped” GEMM.

Standard approaches:

Loop over experts: Launch N separate GEMM kernels. Terrible — kernel launch overhead dominates for small M.
Pad to max M: Waste compute proportional to load imbalance.
Concatenate and track offsets: Better, but shared memory management is complex.

TrtLLMGen’s approach (inferred from kernel signatures and NVIDIA’s published MLPerf submissions):

Fused MoE Kernel: A single kernel launch handles the entire MoE layer:

Routing phase: Compute top-K expert assignments per token (small linear layer + softmax + top-K selection)
Dispatch phase: Scatter tokens to expert-specific tiles in shared memory using TMA
Compute phase: Run the expert FFN (GEMM → activation → GEMM) using Tensor Cores, with each CTA block assigned to one or more experts
Combine phase: Gather and weighted-sum expert outputs back to token positions

All four phases are fused into a single kernel. No intermediate writes to global memory between routing and compute.

Dynamic load balancing: The “dynbatch” in kernel names refers to a scheme where CTAs dynamically pull work from a shared token queue rather than being statically assigned to experts. This handles load imbalance at runtime — if expert 3 has 200 tokens and expert 7 has 5, more CTAs will service expert 3.

SwiGLU fusion: Modern MoE models (DeepSeek, Mixtral) use SwiGLU activation in experts. TrtLLMGen fuses the gate projection, up projection, SwiGLU activation, and down projection into a single kernel, eliminating three intermediate tensor materializations.

3.4 PTX-Level Control

The PR includes cuda_ptx.h, a large generated header that was initially tracked via Git LFS. This file contains inline PTX (Parallel Thread Execution) assembly wrappers for:

cp.async.bulk.tensor: TMA copy instructions for multi-dimensional async copies
wgmma: Warpgroup-level matrix multiply-accumulate (the Tensor Core instruction on SM90+)
fence.proxy.tensormap: Memory fence operations for TMA descriptor updates
setmaxnreg: Dynamic register allocation control
elect.sync: Single-thread election within a warp (for leader-thread patterns)
tmem operations: Tensor Memory load/store/allocation

This level of PTX control is how NVIDIA’s own kernel engineers squeeze the last 10-20% of performance. Compiler-generated code from NVCC/CUDA C++ often cannot make optimal decisions about register allocation, instruction scheduling, and memory fence placement for these complex multi-stage kernels. Hand-written (or carefully generated) PTX is the only way to get deterministic scheduling.

3.5 cuda_ptx.h: The Generated Instruction Layer

The fact that cuda_ptx.h was generated (and massive enough for Git LFS) tells us TrtLLMGen includes an instruction-level code generator. This is likely a Python or C++ program that:

Takes a kernel specification (precision, tile sizes, pipeline depth, expert count)
Generates the PTX inline assembly wrappers needed for that configuration
Generates the CUDA kernel source that uses those wrappers
Compiles and benchmarks the result

This is the same methodology NVIDIA uses for cuBLAS and CUTLASS — generate thousands of kernel variants, benchmark all of them, and select the fastest for each problem size. The difference is TrtLLMGen is specifically optimized for the access patterns of LLM inference rather than general-purpose GEMM.

4. Why Open-Source This into FlashInfer?

What Is FlashInfer?

FlashInfer is a community-driven library of GPU kernels for LLM serving, started by researchers at CMU and the University of Washington. It provides:

PagedAttention kernels (pioneered by vLLM)
FlashAttention variants optimized for serving (variable-length, append, etc.)
Prefill and decode attention kernels
RoPE (rotary position embedding) kernels
Sampling kernels (top-K, top-P, speculative decoding verification)

FlashInfer is used by several major serving frameworks: SGLang, vLLM (partially), and others. It has become the de facto community kernel library for LLM serving.

NVIDIA’s Strategic Calculus

NVIDIA open-sourcing their benchmark-winning kernels into a community project is surprising. Several forces likely drove this:

1. Framework fragmentation is bad for NVIDIA

The LLM serving ecosystem is fragmented: vLLM, SGLang, TGI, TensorRT-LLM, DeepSpeed, LMDeploy, etc. Each reimplements kernels at varying quality levels. This means most deployments run suboptimal kernels, which means they need more GPUs, which sounds good for NVIDIA’s revenue but actually:

Makes customers consider AMD/Intel/custom silicon alternatives (if NVIDIA software is hard to use optimally)
Slows ecosystem growth (high inference costs limit deployment)
Creates support burden (NVIDIA has to help each framework separately)

By putting the best kernels in a neutral community library, NVIDIA ensures all frameworks can access peak performance on NVIDIA hardware. This raises the baseline and makes NVIDIA GPUs harder to beat.

2. Blackwell adoption acceleration

SM100-specific kernels in FlashInfer mean that any framework using FlashInfer automatically gets Blackwell-optimized MoE inference. This accelerates Blackwell adoption without requiring each framework to write their own SM100 kernels (which would take months).

3. MoE is becoming the default architecture

DeepSeek-V3, Mixtral, Grok-1, and likely GPT-5 use MoE. If MoE inference is slow, customers delay adoption of MoE models, which delays GPU purchases. Fast MoE kernels available everywhere accelerates the whole market.

4. Competitive response to open-source momentum

FlashInfer, Triton, ThunderKittens, and other open-source kernel projects have shown that the community can produce competitive kernels. By contributing their best work, NVIDIA positions themselves as a contributor rather than a competitor to the open-source ecosystem. This is the “if you can’t beat them, lead them” strategy.

5. Apache 2.0 licensing

The permissive Apache 2.0 license means any company can use these kernels in commercial products. This is a deliberate choice — NVIDIA wants maximum adoption on NVIDIA hardware.

5. Connection to Benchmarks

MLPerf Inference

NVIDIA has dominated MLPerf Inference benchmarks for years. The TrtLLMGen kernels are what power their submissions. From Zhurkevich’s tweet: these are “the fastest prefill and decode kernels for our target workloads” — written specifically to win MLPerf and InferenceX.

MLPerf submission details (public, from MLPerf Inference v4.1, 2024-2025):

NVIDIA’s LLM submissions use TensorRT-LLM with custom kernels
MoE submissions (Mixtral) showed 2-3x throughput advantage over community implementations
The “secret sauce” was always the kernel layer, which was closed-source — until now

InferenceX

InferenceX is a newer benchmark focused on real-world LLM serving workloads (mixed prefill/decode, variable sequence lengths, production-like traffic patterns). TrtLLMGen kernels were written with these workloads in mind, hence the “dynbatch” (dynamic batching) variants.

What This Means for Benchmark Parity

With TrtLLMGen kernels in FlashInfer:

SGLang and other frameworks using FlashInfer can now match TensorRT-LLM’s kernel performance
The playing field shifts from “who has the fastest kernels” to “who has the best scheduling, memory management, and system-level optimization”
MLPerf submissions from non-NVIDIA entrants (AMD, Intel) now have a clear performance target to match

6. The FMHA Follow-Up: PR #3034

Two weeks after the MoE kernels, PR #3034 (“trtllm fmha oss kernels”) adds Flash Multi-Head Attention kernels from the same TrtLLMGen codebase.

What FMHA Adds

FMHA (Fused Multi-Head Attention) handles the attention mechanism:

Q * K^T / sqrt(d) -> softmax -> * V

The TrtLLMGen FMHA kernels likely include:

Prefill FMHA: Long-context attention for prompt processing (competing with FlashAttention-3)
Decode FMHA: Single-token attention against KV cache (competing with PagedAttention)
GQA/MQA support: Grouped-Query and Multi-Query Attention variants used in Llama, Mistral, etc.
Speculative decoding verification: Batch verification of speculated tokens
SM100 optimizations: TMA-based KV cache access, TMEM for Q/K/V operands

Why FMHA Matters Separately

MoE kernels handle the FFN layers. FMHA handles the attention layers. Together, they cover the two most compute-intensive parts of transformer inference. Having both open-sourced means FlashInfer now has NVIDIA-quality kernels for the entire forward pass of an MoE transformer.

7. Impact on the Inference Ecosystem

Short-Term (0-6 months)

SGLang gets a massive upgrade: SGLang already uses FlashInfer. Once these kernels are integrated, SGLang on Blackwell will match TensorRT-LLM’s MoE performance.
vLLM integration pressure: vLLM has been building its own kernel stack. With TrtLLMGen kernels in FlashInfer, there is strong pressure to adopt FlashInfer more deeply or fall behind on MoE performance.
DeepSeek-V3 serving gets cheaper: The model that broke cost records on paper can now be served at the kernel efficiency that NVIDIA achieves internally, not just what open-source frameworks managed.
Blackwell becomes the obvious MoE platform: With optimized kernels freely available, the cost-performance gap between Blackwell and Hopper for MoE workloads widens further in Blackwell’s favor.

Medium-Term (6-18 months)

Kernel competition shifts to AMD and Intel: With NVIDIA’s best kernels now public, the pressure shifts to AMD (ROCm/hipBLAS) and Intel (oneAPI) to produce competitive MoE kernels for their hardware. The performance target is no longer hidden.
Consolidation around FlashInfer: If FlashInfer has both NVIDIA’s best kernels and community-contributed AMD kernels, it becomes the universal kernel abstraction layer. Serving frameworks can focus on scheduling and system design rather than kernel engineering.
Kernel generation as a technique spreads: TrtLLMGen’s approach — generate thousands of specialized kernel variants from templates — will likely be adopted by other projects. Expect Triton-based code generators, ThunderKittens generators, etc.

The Bigger Picture

NVIDIA is making a calculated bet: the value in the inference stack is moving up from kernels to systems (scheduling, memory management, multi-node orchestration, speculative decoding strategies). By commoditizing the kernel layer, NVIDIA:

Ensures their hardware always runs at peak efficiency regardless of which framework is used
Shifts competitive differentiation to areas where they have structural advantages (NVLink, NVSwitch, Grace-Blackwell integration)
Makes it harder for competitors to differentiate on software alone

This is analogous to Intel open-sourcing their compiler optimizations — you give away the layer that makes your hardware fast, because that makes your hardware the obvious choice.

8. Technical Significance Summary

Aspect	Detail
Scale	~5M lines, 2,753 files — largest single contribution to FlashInfer
Architecture	SM100 (Blackwell) targeting with TMA, TMEM, 5th gen Tensor Cores
Kernels	Fused MoE GEMM (routing + dispatch + compute + combine in one kernel)
Precisions	FP4, FP8, BF16, FP16 with per-expert dynamic batching
Activations	SwiGLU, GELU, ReLU fused into GEMM kernels
Generation	Auto-generated kernel variants covering the combinatorial space
PTX control	Hand-tuned inline PTX for TMA, wgmma, TMEM, fence operations
License	Apache 2.0 (fully permissive commercial use)
Follow-up	FMHA kernels (PR #3034) cover the attention half of the forward pass
Benchmark pedigree	These kernels power NVIDIA’s MLPerf and InferenceX submissions

9. Key Takeaways

NVIDIA just gave away the crown jewels of their inference kernel stack. The TrtLLMGen MoE kernels are not “reference implementations” — they are the actual production kernels that set benchmark records.
MoE serving performance is about to equalize across frameworks. The kernel advantage that TensorRT-LLM had over vLLM/SGLang for MoE models is largely eliminated.
The code generation approach matters as much as the kernels themselves. The fact that these are generated kernels means the methodology can be extended to new hardware (SM120?), new precisions, and new model architectures.
Blackwell is the target. SM100-specific optimizations mean these kernels will not run on Hopper or Ampere. This is both a technical decision (SM100 has the features these kernels need) and a strategic one (drives Blackwell adoption).
FlashInfer is becoming the Linux of LLM kernels. With NVIDIA contributing their best work to a community project, FlashInfer is positioning itself as the neutral ground where the best kernels from all contributors coexist. This is healthy for the ecosystem.
The inference cost curve just got steeper. Better kernels + Blackwell hardware + open availability = faster cost-per-token decline. MoE models that were marginally economical to serve are now clearly viable.

Alan's PKB

Explorer

TrtLLMGen MoE Kernels