Thunder Kittens + CUDA Deep Dive — Key Takeaways

Source: Quinn & Ben YouTube livestream (~5 hrs). Full transcript indexed.

GPU Architecture (H100 Focus)

SM Structure

  • 144 SMs manufactured, 132 enabled (yield binning)
  • Each SM has 4 quadrants, each with:
    • 128 FP32 CUDA cores
    • 4 Tensor Cores (4th gen on Hopper)
    • 256 KB register file (65,536 x 32-bit)
    • 128 KB configurable L1/shared memory
  • Warp scheduler: issues 1 instruction per warp (32 threads) per cycle
  • Co-schedule at least 2 warps per quadrant to hide latency

Memory Hierarchy

  • Registers: ~1 cycle access
  • Shared memory (SRAM): fast, user-controlled addressing
  • L1 cache: slightly slower than shared mem (cache tagging overhead)
  • L2 cache: 50 MB on H100. Cross-chip access can be nearly as slow as HBM
  • HBM3: 3.35 TB/s, ~500-700 cycle latency

Key Insight: Instruction Issue Overhead

  • FP32 FMA: 20:1 ratio of instruction issue energy vs actual compute energy
  • This is why tensor cores exist — amortize instruction overhead over 256K+ FLOPs per instruction
  • An HMMA instruction does ~500K FLOPs for the same issue cost

Tensor Cores

  • 5th gen (Hopper): m16n8k16 for FP16, m16n8k32 for FP8
  • Actually implements A * B^T (transposed multiply), not A * B
  • Register layout: 16x16 “core matrices” = one vector register
  • Weight-stationary: load weights once, stream activations

Memory Consistency Model (THE HARD PART)

sync_warp

  • Needed when threads within a warp access each other’s shared memory writes
  • Example: load global shared memory then load shared registers requires sync_warp between

sync_threads

  • Needed when threads across warps need each other’s shared memory
  • Without it: thread 0 can start computing before thread 128 has loaded its shared memory
  • This is the ONLY hard part of CUDA. Everything else is easy.

Async Loads (cp.async)

  • Available on SM80+ (Ampere, Ada, Hopper)
  • Tells hardware: “start loading, I’ll tell you when to wait”
  • Enables pipelining: load iteration N+1 while computing iteration N
  • TMA (Tensor Memory Accelerator) on Hopper: even better, handles descriptors automatically

Pipeline Pattern

for each iteration:
    load_async(next_chunk)        // start loading next
    sync_threads()                // wait for previous load
    compute(current_chunk)        // use current data
    // next iteration: roles swap

Bank Conflicts

  • Shared memory: 32 banks, 32-bit words
  • Row-major layout with 16x16 tiles: 8-way bank conflict (very bad)
  • Fix: XOR swizzle or padding
  • TK provides automatic 32/64/128-byte swizzle layouts

Thunder Kittens (TK)

Design Philosophy

  • Primitives, not abstractions — you must understand the hardware
  • Header-only C++ library, compiles in ~5 seconds
  • Competitive with cuTLASS (within 5-10%) but ~10x less code
  • Flash Attention equivalent: ~75 lines vs thousands in cuTLASS

Types

  • Global layout: 4D tensor (batch, depth, rows, cols), TMA descriptors auto-generated
  • Shared tiles: parameterized by type/rows/cols, auto-swizzled addressing
  • Register tiles: row or column layout (matching tensor core A*B^T requirement)
  • Register vectors: aligned, ortho, naive layouts

Operations

  • Warp-level: load/store, MMA, elementwise ops
  • Group-level: collaborative loads across multiple warps, warp group sync
  • LCSF template: Load-Compute-Store-Finish (producer-consumer pattern)

Producer-Consumer Pattern

  • Divide warps into producers (memory) and consumers (compute)
  • Producers: load from global shared memory, handle TMA
  • Consumers: load shared registers, compute, store back
  • Semaphores (M barriers) synchronize between them
  • TMA auto-triggers semaphore when load completes

Deep Seek Whale Kernel (HD=576, VD=512)

  • 576 is 64*9 — awkward for tensor cores (16x16 base)
  • Partitioned Q into 4 warps x 144 columns each
  • Attention accumulation across warps via shared memory atomic adds
  • Race condition found: Q shared memory overlapped with K shared memory
  • Performance: ~par with PyTorch SDPA, needs more optimization
  • Key lesson: blocking structure choice is the entire game

Practical Lessons

  1. Write the copy kernel first — get data flow right before compute
  2. Memory consistency bugs are the only hard bugs — compute is never the problem
  3. Use NCU (Nsight Compute) — read the SASS, understand stall reasons
  4. Async pipelining is mandatory on H100 — Flash Attention 3 vs FA2 = 2x from pipelining
  5. Bank conflicts matter — 8-way conflict = 1/8 shared memory bandwidth
  6. Profile, don’t guess — NCU shows exactly where time is spent

See also: TrtLLMGen MoE Kernels (register-tiling evolved into TMEM management on SM100), Blackwell Architecture, Inference Stack Synthesis