Thunder Kittens + CUDA Deep Dive — Key Takeaways

Source: Quinn & Ben YouTube livestream (~5 hrs). Full transcript indexed.

GPU Architecture (H100 Focus)

SM Structure

144 SMs manufactured, 132 enabled (yield binning)
Each SM has 4 quadrants, each with:
- 128 FP32 CUDA cores
- 4 Tensor Cores (4th gen on Hopper)
- 256 KB register file (65,536 x 32-bit)
- 128 KB configurable L1/shared memory
Warp scheduler: issues 1 instruction per warp (32 threads) per cycle
Co-schedule at least 2 warps per quadrant to hide latency

Memory Hierarchy

Registers: ~1 cycle access
Shared memory (SRAM): fast, user-controlled addressing
L1 cache: slightly slower than shared mem (cache tagging overhead)
L2 cache: 50 MB on H100. Cross-chip access can be nearly as slow as HBM
HBM3: 3.35 TB/s, ~500-700 cycle latency

Key Insight: Instruction Issue Overhead

FP32 FMA: 20:1 ratio of instruction issue energy vs actual compute energy
This is why tensor cores exist — amortize instruction overhead over 256K+ FLOPs per instruction
An HMMA instruction does ~500K FLOPs for the same issue cost

Tensor Cores

5th gen (Hopper): m16n8k16 for FP16, m16n8k32 for FP8
Actually implements A * B^T (transposed multiply), not A * B
Register layout: 16x16 “core matrices” = one vector register
Weight-stationary: load weights once, stream activations

Memory Consistency Model (THE HARD PART)

sync_warp

Needed when threads within a warp access each other’s shared memory writes
Example: load global → shared memory → then load shared → registers requires sync_warp between

sync_threads

Needed when threads across warps need each other’s shared memory
Without it: thread 0 can start computing before thread 128 has loaded its shared memory
This is the ONLY hard part of CUDA. Everything else is easy.

Async Loads (cp.async)

Available on SM80+ (Ampere, Ada, Hopper)
Tells hardware: “start loading, I’ll tell you when to wait”
Enables pipelining: load iteration N+1 while computing iteration N
TMA (Tensor Memory Accelerator) on Hopper: even better, handles descriptors automatically

Pipeline Pattern

for each iteration:
    load_async(next_chunk)        // start loading next
    sync_threads()                // wait for previous load
    compute(current_chunk)        // use current data
    // next iteration: roles swap

Bank Conflicts

Shared memory: 32 banks, 32-bit words
Row-major layout with 16x16 tiles: 8-way bank conflict (very bad)
Fix: XOR swizzle or padding
TK provides automatic 32/64/128-byte swizzle layouts

Thunder Kittens (TK)

Design Philosophy

Primitives, not abstractions — you must understand the hardware
Header-only C++ library, compiles in ~5 seconds
Competitive with cuTLASS (within 5-10%) but ~10x less code
Flash Attention equivalent: ~75 lines vs thousands in cuTLASS

Types

Global layout: 4D tensor (batch, depth, rows, cols), TMA descriptors auto-generated
Shared tiles: parameterized by type/rows/cols, auto-swizzled addressing
Register tiles: row or column layout (matching tensor core A*B^T requirement)
Register vectors: aligned, ortho, naive layouts

Operations

Warp-level: load/store, MMA, elementwise ops
Group-level: collaborative loads across multiple warps, warp group sync
LCSF template: Load-Compute-Store-Finish (producer-consumer pattern)

Producer-Consumer Pattern

Divide warps into producers (memory) and consumers (compute)
Producers: load from global → shared memory, handle TMA
Consumers: load shared → registers, compute, store back
Semaphores (M barriers) synchronize between them
TMA auto-triggers semaphore when load completes

Deep Seek Whale Kernel (HD=576, VD=512)

576 is 64*9 — awkward for tensor cores (16x16 base)
Partitioned Q into 4 warps x 144 columns each
Attention accumulation across warps via shared memory atomic adds
Race condition found: Q shared memory overlapped with K shared memory
Performance: ~par with PyTorch SDPA, needs more optimization
Key lesson: blocking structure choice is the entire game

Practical Lessons

Write the copy kernel first — get data flow right before compute
Memory consistency bugs are the only hard bugs — compute is never the problem
Use NCU (Nsight Compute) — read the SASS, understand stall reasons
Async pipelining is mandatory on H100 — Flash Attention 3 vs FA2 = 2x from pipelining
Bank conflicts matter — 8-way conflict = 1/8 shared memory bandwidth
Profile, don’t guess — NCU shows exactly where time is spent

See also: TrtLLMGen MoE Kernels (register-tiling evolved into TMEM management on SM100), Blackwell Architecture, Inference Stack Synthesis

Alan's PKB

Explorer

Thunder Kittens CUDA