The 2026 Inference Optimization Stack: How SpectralQuant, TrtLLMGen, Thunder Kittens, and Blackwell Fit Together

Four distinct pieces of inference optimization research have landed in close succession. Individually, each is significant. Together, they describe a coherent optimization stack whose layers compound multiplicatively. This note maps the interactions, quantifies the combined effect, and identifies the remaining bottlenecks.

1. The Four-Layer Stack

The 2026 inference optimization landscape organizes into four interacting layers:

┌─────────────────────────────────────────────────────────┐
│  SYSTEM LAYER   FlashInfer / vLLM / SGLang              │
│  Scheduling, batching, memory management, PagedAttention │
├─────────────────────────────────────────────────────────┤
│  COMPRESSION    SpectralQuant + TurboQuant               │
│  KV cache: spectral rotation → selective quant → 5.95x  │
├─────────────────────────────────────────────────────────┤
│  KERNEL LAYER   TrtLLMGen + Thunder Kittens              │
│  Fused MoE GEMM, register-tiled FMHA, PTX-level control │
├─────────────────────────────────────────────────────────┤
│  HARDWARE       Blackwell SM100                          │
│  TMA 2.0, TMEM, 5th-gen Tensor Cores, 8 TB/s HBM3e     │
└─────────────────────────────────────────────────────────┘

The stack reads bottom-up: hardware sets the physics, kernels exploit the hardware, compression reduces the data volume kernels must move, and systems compose everything into a serving engine.

2. Hardware Layer: Blackwell SM100

Blackwell’s SM100 architecture introduces several features directly relevant to the inference stack above it:

TMA 2.0 (Tensor Memory Accelerator). The second-generation TMA provides asynchronous, hardware-accelerated bulk data movement between HBM and shared memory. For inference, the critical path is KV cache loads during decode. TMA 2.0 reduces the software overhead of orchestrating these transfers, freeing warp schedulers to focus on compute. TrtLLMGen’s FMHA kernels are explicitly designed around TMA-based KV cache access patterns.

TMEM (Tensor Memory). SM100 exposes a new memory hierarchy level between registers and shared memory, dedicated to feeding the Tensor Cores. This is the hardware manifestation of what Thunder Kittens achieves in software on Hopper: keeping operands close to the compute units. On SM100, TMEM management replaces the manual register-tiling that Thunder Kittens pioneered.

5th-generation Tensor Cores. Higher throughput for FP4/FP8/INT4 matrix operations. When SpectralQuant compresses KV cache entries to low-bit representations, the dequantization and attention computation can run at higher throughput on these Tensor Cores.

8 TB/s HBM3e bandwidth. Blackwell B200’s aggregate memory bandwidth is ~8 TB/s across 192 GB of HBM3e. This is the denominator in the memory-bandwidth-bound decode equation. Every optimization in the layers above effectively multiplies this number: SpectralQuant’s 5.95x compression turns 8 TB/s of effective KV cache throughput into ~47.6 TB/s equivalent.

3. Kernel Layer: TrtLLMGen and Thunder Kittens

TrtLLMGen: Production-Grade MoE Inference Kernels

NVIDIA open-sourced the TRT-LLM generation kernels into FlashInfer, providing fused MoE GEMM and FMHA (Flash Multi-Head Attention) implementations targeting both Hopper and SM100. The key design choices:

Fused MoE GEMM. For MoE models like DeepSeek-V3 (256 experts, top-8 routing), the critical bottleneck is the expert GEMM dispatch. TrtLLMGen fuses the gate computation, expert selection, and batched GEMM into a single kernel launch, eliminating the kernel launch overhead and intermediate memory traffic that naive implementations incur. On DeepSeek-V3’s scale, this means dispatching to 8 of 256 experts without 8 separate GEMM kernel launches per token.
SM100 TMA-based KV cache access. The FMHA kernels use TMA 2.0 to prefetch KV cache pages asynchronously. During decode, each token’s attention computation requires reading potentially millions of cached KV pairs. TMA 2.0 allows the kernel to issue these reads ahead of the compute pipeline, hiding memory latency.
PTX-level control. TrtLLMGen drops to PTX (Parallel Thread Execution) assembly for critical inner loops, bypassing NVCC’s instruction scheduling to achieve near-peak throughput on Tensor Core MMA (Matrix Multiply-Accumulate) instructions.

Thunder Kittens: Register-Tiled Kernel Design Philosophy

Thunder Kittens (Stanford/HazyResearch) introduced a register-tiling philosophy for CUDA kernels: organize all computation around 16x16 tiles that fit in the register file, minimizing shared memory traffic. The key insight is that on modern NVIDIA GPUs, the register file is the fastest storage, and shared memory is already a bottleneck for memory-bound kernels.

The direct mapping to SM100:

Register-tiling maps to TMEM management. Thunder Kittens manually tiles data into registers on Hopper. On SM100, TMEM provides dedicated Tensor Core memory that serves the same purpose with hardware support. The algorithmic decomposition (16x16 tiles, careful data layout) transfers directly, but the implementation shifts from register allocation tricks to TMEM API calls.
Warp-level programming. Both Thunder Kittens and TrtLLMGen operate at the warp level (32 threads). SM100’s warpgroup feature (4 warps, 128 threads cooperating on a single MMA) extends this, and TrtLLMGen’s kernels are designed for warpgroup-level Tensor Core dispatch.

4. Compression Layer: SpectralQuant and the d_eff Finding

SpectralQuant introduces spectral rotation as a principled approach to KV cache compression. The key results:

5.95x KV cache compression. By rotating KV cache vectors into a spectral basis (via a learned rotation matrix from 15 seconds of calibration), SpectralQuant concentrates information into a small number of dimensions. The rotation is applied once at cache insertion time, and the inverse rotation is fused into the attention output projection. This achieves 5.95x compression vs TurboQuant’s provably near-optimal 5.02x.

d_eff ≈ 4. The most striking finding: across 6 models in 4 families, the effective dimensionality of KV cache key vectors is approximately 4 out of 128. After spectral rotation, only ~4 dimensions carry meaningful signal; the rest are noise. This holds stably across model sizes (3B to 14B+) and architectures (Qwen, Llama).

Selective error correction. TurboQuant’s QJL error correction applied to noise dimensions actually hurts quality — correcting near-zero errors with noisy estimates injects noise. SpectralQuant applies QJL only to the ~4 signal dimensions, saving 124 bits per key while improving cosine similarity by +2.59pp.

Keys vs values asymmetry. d_eff ≈ 4 for keys (narrow selectors) but ~50 for values (broad information carriers). This fundamental asymmetry means truncation works for keys but is catastrophic for values — all 128 dimensions must be kept, just at varying precision.

5. System Layer: How These Compose in Serving Frameworks

FlashInfer is the direct integration target. NVIDIA contributed TrtLLMGen’s kernels into FlashInfer’s codebase, making them available to any serving framework that uses FlashInfer as its attention backend. FlashInfer’s PagedAttention implementation already manages KV cache in GPU memory pages; SpectralQuant’s compressed format slots into this paging system by reducing the per-page memory footprint.

vLLM uses FlashInfer (or its own CUDA kernels) for attention, and manages KV cache via PagedAttention. With SpectralQuant compression, vLLM’s effective KV cache capacity increases by ~6x, directly translating to either: (a) 6x more concurrent requests at the same memory budget, or (b) 6x longer context windows per request.

SGLang adds RadixAttention for prefix caching on top of similar kernel infrastructure. SpectralQuant’s spectral rotation commutes with prefix sharing because the rotation matrix is model-global (not sequence-dependent), so compressed prefix caches remain shareable across requests.

6. Cross-Layer Interactions and Compound Effects

The layers do not merely add; they multiply.

6.1 SpectralQuant x TrtLLMGen: Compression Meets Fused Attention

SpectralQuant’s 5.95x KV compression directly reduces the memory bandwidth pressure that TrtLLMGen’s FMHA kernels face during decode. The decode phase is almost purely memory-bandwidth-bound: each generated token requires reading the full KV cache for all attention heads. On a B200 with 8 TB/s bandwidth, serving a model with 100 GB of KV cache per batch requires ~12.5ms per decode step just for KV cache reads. With SpectralQuant, this drops to ~2.1ms.

TrtLLMGen’s SM100 TMA usage for KV cache access pairs naturally with compressed KV formats. TMA 2.0 performs address generation and data movement in hardware; it does not care whether it is reading FP16 KV entries or SpectralQuant’s mixed-precision compressed entries. The compressed entries are smaller, so TMA transfers complete faster, and the TMA prefetch queue can hold more entries (more future tokens’ KV data) in the same buffer space.

6.2 Thunder Kittens’ Philosophy x SM100 TMEM

Thunder Kittens demonstrated that register-tiling is the right abstraction for memory-bound GPU kernels. SM100’s TMEM provides hardware support for this abstraction. The lineage is clear:

Hopper (SM90): Thunder Kittens manually manages register allocation for 16x16 tiles.
Blackwell (SM100): TMEM provides dedicated Tensor Core memory, replacing manual register tricks with a hardware-managed pool.
TrtLLMGen: Designed from the start for SM100’s memory hierarchy, including TMEM. Its FMHA kernels use TMEM for the QKV tiles that feed attention computation.

The Thunder Kittens insight that did not require hardware support, and remains relevant on SM100, is the tile scheduling strategy: the order in which tiles are processed to maximize data reuse.

6.3 The d_eff Finding and Kernel-Level Implications

SpectralQuant’s d_eff ≈ 4 finding has implications beyond compression ratios:

Mixed-precision attention. The top-4 signal dimensions could be computed in FP16 while the remaining dimensions use FP4 or INT4, directly leveraging Blackwell’s mixed-precision Tensor Cores. This is not just a storage optimization but a compute optimization: FP4 MMA on SM100 runs at 2x the throughput of FP8.
Low-rank attention approximation. If d_eff = 4, the attention matrix is effectively rank-4. Kernel implementations could compute exact attention on the 4 signal dimensions and approximate the remaining dimensions, reducing the O(nd) decode cost to O(n4) for the high-precision path.
Speculative decoding synergy. During speculative decoding’s verification phase, a cheap “speculative verification” on just 4 dimensions could reject most incorrect draft tokens early, avoiding the full cache read.

6.4 NVIDIA Open-Sourcing Creates Compound Effects

NVIDIA open-sourcing TrtLLMGen kernels into FlashInfer, combined with SpectralQuant’s academic publication, creates a compound effect:

Before: Production inference stacks used closed-source TRT-LLM kernels with FP16/FP8 KV caches.
After: Open-source TrtLLMGen kernels + SpectralQuant compression + Blackwell hardware. Each is independently accessible, and the combination is available to anyone building on FlashInfer.

The competitive implication: the gap between NVIDIA’s internal serving performance and what open-source frameworks achieve has narrowed significantly.

7. Quantified Impact: Inference Cost for DeepSeek-V3

To ground this analysis, consider DeepSeek-V3 (685B parameters, 256 experts, top-8 routing, Multi-Head Latent Attention) served on Blackwell B200 GPUs.

Baseline (FP8, no KV compression, pre-TrtLLMGen open-source)

Hardware cost: 8x B200 cluster, ~ $15/ GP U - h o u r =$ 120/hour
Decode throughput: ~3,000-4,000 output tokens/second (full cluster, batched)
KV cache memory: ~80 GB for a 128K context batch, limiting concurrency
Cost: ~$8-10 per million output tokens

With Full Optimization Stack (SpectralQuant + TrtLLMGen + Blackwell)

SpectralQuant 5.95x KV compression: KV cache drops from ~80 GB to ~13.4 GB. This allows ~6x more concurrent requests or equivalently 6x higher throughput at the same latency.
TrtLLMGen fused MoE GEMM: Eliminates per-expert kernel launch overhead. For top-8 routing across 256 experts, this reduces MoE layer latency by ~30-40%.
TrtLLMGen FMHA with TMA 2.0: Attention decode latency improves ~20-30% from TMA-based KV prefetching.
Combined throughput improvement: Conservatively 4-5x over baseline.
Projected cost: $1.50-2.50 per million output tokens

Note: DeepSeek-V3 already uses Multi-Head Latent Attention (MLA), which provides its own KV compression. SpectralQuant’s gains on MLA models may be smaller than 5.95x because MLA already projects KV to a lower-dimensional latent space. The interaction between MLA and spectral rotation is an open question.

8. Remaining Bottlenecks After Full-Stack Optimization

Even with all four layers optimized, bottlenecks remain:

8.1 Expert Routing Imbalance (MoE Models)

TrtLLMGen’s fused MoE GEMM assumes reasonably balanced expert utilization. When routing is skewed (some experts receive 10x more tokens than others), the fused kernel’s workload partitioning becomes suboptimal.

8.2 Prefill vs. Decode Disaggregation

The optimizations above primarily benefit decode (token generation). Prefill (processing the input prompt) is compute-bound, not memory-bound, and benefits less from KV compression. The trend toward prefill-decode disaggregation means these optimizations primarily reduce the cost of the decode pool.

8.3 Network Bandwidth for Distributed Inference

For models requiring multi-node inference (8+ GPUs across machines), the inter-node network becomes the bottleneck. NVLink 5.0 at 1.8 TB/s helps within a node, but cross-node InfiniBand at 400 Gb/s remains a constraint for expert-parallel MoE inference.

8.4 Scheduling and Batching Overhead

As per-token compute cost drops, the overhead of request scheduling, dynamic batching, and memory management in the serving framework becomes a larger fraction of total latency.

8.5 Quality Degradation at Extreme Compression

SpectralQuant’s d_eff = 4 finding may not hold uniformly across all model architectures, sequence positions, or task types. The interaction with long-context retrieval at 128K+ and chain-of-thought reasoning has not been thoroughly characterized.

9. Open Integration Questions

Does SpectralQuant’s spectral rotation compose cleanly with TrtLLMGen’s FMHA? SpectralQuant applies rotation R at KV write time and R^(-1) at read time. TrtLLMGen’s FMHA expects standard KV format. Integration requires either: (a) modifying FMHA to accept pre-rotated KV and fuse R^(-1) into the output projection, or (b) a separate pre-processing step. Option (a) is feasible — the rotation is a 128x128 matrix multiply fusible into existing projections.

Can Thunder Kittens’ approach be adapted for SM100? TK’s core value on SM100 is not register-tiling itself (TMEM handles that) but tile-scheduling algorithms and DSL-like abstractions. A “Thunder Kittens for Blackwell” would likely become a kernel generation framework targeting TMEM + TMA 2.0.

How does SpectralQuant interact with MLA? DeepSeek-V3’s MLA already compresses KV to a 512-dim latent space. Applying spectral rotation on top may yield diminishing returns since MLA was designed to remove redundancy. The compound compression ratio may be less than the product of individual ratios.

Can the d_eff finding be exploited at the hardware level? If future models consistently show d_eff in 4-8 range, GPU architectures could include hardware support for rank-k attention. This would be a 2028+ hardware change, but the algorithmic finding is available now.

10. Summary: The Compound Optimization Thesis

Inference cost is dropping not from any single breakthrough but from the simultaneous maturation of all four stack layers.

Layer	Innovation	Standalone Gain	Compound Contribution
Hardware	Blackwell SM100	~2x over Hopper	Enables all three layers above
Kernels	TrtLLMGen + Thunder Kittens	~1.3-1.5x from fused MoE + FMHA	Larger with compressed KV (less data to move)
Compression	SpectralQuant	5.95x KV compression	Multiplies effective bandwidth for decode
Systems	FlashInfer/vLLM/SGLang	~1.2-1.5x from batching/scheduling	Orchestrates the above, captures compound gains

Conservative combined estimate: 4-6x reduction in inference cost over Q4 2025 baselines. This puts large MoE models (DeepSeek-V3 class) at $1.50 - 2.50/ M o u tp u tt o k e n so n B l a c k w e l l, d o w n f r o m$ 8-10/M on Hopper with FP8.

The remaining gap to close is not in any single layer but in the integration seams between layers: making SpectralQuant’s rotation play nicely with TrtLLMGen’s FMHA, making Thunder Kittens’ scheduling insights portable to SM100’s TMEM, and making the system layer aware of compression and kernel capabilities for optimal scheduling.

The inference cost curve is bending faster than the training cost curve. This is the year that optimization compounds.

Cross-references: SpectralQuant, TrtLLMGen, Thunder Kittens, Blackwell Architecture

Alan's PKB

Explorer

Inference Optimization Stack