Thunder Kittens + CUDA Deep Dive — Key Takeaways
Source: Quinn & Ben YouTube livestream (~5 hrs). Full transcript indexed.
GPU Architecture (H100 Focus)
SM Structure
- 144 SMs manufactured, 132 enabled (yield binning)
- Each SM has 4 quadrants, each with:
- 128 FP32 CUDA cores
- 4 Tensor Cores (4th gen on Hopper)
- 256 KB register file (65,536 x 32-bit)
- 128 KB configurable L1/shared memory
- Warp scheduler: issues 1 instruction per warp (32 threads) per cycle
- Co-schedule at least 2 warps per quadrant to hide latency
Memory Hierarchy
- Registers: ~1 cycle access
- Shared memory (SRAM): fast, user-controlled addressing
- L1 cache: slightly slower than shared mem (cache tagging overhead)
- L2 cache: 50 MB on H100. Cross-chip access can be nearly as slow as HBM
- HBM3: 3.35 TB/s, ~500-700 cycle latency
Key Insight: Instruction Issue Overhead
- FP32 FMA: 20:1 ratio of instruction issue energy vs actual compute energy
- This is why tensor cores exist — amortize instruction overhead over 256K+ FLOPs per instruction
- An HMMA instruction does ~500K FLOPs for the same issue cost
Tensor Cores
- 5th gen (Hopper): m16n8k16 for FP16, m16n8k32 for FP8
- Actually implements A * B^T (transposed multiply), not A * B
- Register layout: 16x16 “core matrices” = one vector register
- Weight-stationary: load weights once, stream activations
Memory Consistency Model (THE HARD PART)
sync_warp
- Needed when threads within a warp access each other’s shared memory writes
- Example: load global → shared memory → then load shared → registers requires sync_warp between
sync_threads
- Needed when threads across warps need each other’s shared memory
- Without it: thread 0 can start computing before thread 128 has loaded its shared memory
- This is the ONLY hard part of CUDA. Everything else is easy.
Async Loads (cp.async)
- Available on SM80+ (Ampere, Ada, Hopper)
- Tells hardware: “start loading, I’ll tell you when to wait”
- Enables pipelining: load iteration N+1 while computing iteration N
- TMA (Tensor Memory Accelerator) on Hopper: even better, handles descriptors automatically
Pipeline Pattern
for each iteration:
load_async(next_chunk) // start loading next
sync_threads() // wait for previous load
compute(current_chunk) // use current data
// next iteration: roles swap
Bank Conflicts
- Shared memory: 32 banks, 32-bit words
- Row-major layout with 16x16 tiles: 8-way bank conflict (very bad)
- Fix: XOR swizzle or padding
- TK provides automatic 32/64/128-byte swizzle layouts
Thunder Kittens (TK)
Design Philosophy
- Primitives, not abstractions — you must understand the hardware
- Header-only C++ library, compiles in ~5 seconds
- Competitive with cuTLASS (within 5-10%) but ~10x less code
- Flash Attention equivalent: ~75 lines vs thousands in cuTLASS
Types
- Global layout: 4D tensor (batch, depth, rows, cols), TMA descriptors auto-generated
- Shared tiles: parameterized by type/rows/cols, auto-swizzled addressing
- Register tiles: row or column layout (matching tensor core A*B^T requirement)
- Register vectors: aligned, ortho, naive layouts
Operations
- Warp-level: load/store, MMA, elementwise ops
- Group-level: collaborative loads across multiple warps, warp group sync
- LCSF template: Load-Compute-Store-Finish (producer-consumer pattern)
Producer-Consumer Pattern
- Divide warps into producers (memory) and consumers (compute)
- Producers: load from global → shared memory, handle TMA
- Consumers: load shared → registers, compute, store back
- Semaphores (M barriers) synchronize between them
- TMA auto-triggers semaphore when load completes
Deep Seek Whale Kernel (HD=576, VD=512)
- 576 is 64*9 — awkward for tensor cores (16x16 base)
- Partitioned Q into 4 warps x 144 columns each
- Attention accumulation across warps via shared memory atomic adds
- Race condition found: Q shared memory overlapped with K shared memory
- Performance: ~par with PyTorch SDPA, needs more optimization
- Key lesson: blocking structure choice is the entire game
Practical Lessons
- Write the copy kernel first — get data flow right before compute
- Memory consistency bugs are the only hard bugs — compute is never the problem
- Use NCU (Nsight Compute) — read the SASS, understand stall reasons
- Async pipelining is mandatory on H100 — Flash Attention 3 vs FA2 = 2x from pipelining
- Bank conflicts matter — 8-way conflict = 1/8 shared memory bandwidth
- Profile, don’t guess — NCU shows exactly where time is spent
See also: TrtLLMGen MoE Kernels (register-tiling evolved into TMEM management on SM100), Blackwell Architecture, Inference Stack Synthesis