NVIDIA Blackwell GPU Architecture: Deep Technical Analysis

B200 Reference Card

ParameterValue
ProcessTSMC N4P
Dies2 x ~400 mm2
Transistors~92B total
SMs180 (of 192)
Tensor Cores720 (5th gen)
FP4 TOPS4,500
FP8/INT8 TOPS2,250
BF16 TFLOPS1,500
L2 Cache96 MB (2x Hopper)
HBM3e192 GB, 8 TB/s
NVLink 51.8 TB/s (18 links)
Die-to-die (NV-HBI)10 TB/s
TDP1000 W

Two-Die Yield Math

  • Monolithic 800 mm2: yield = e^(-0.09 x 8.0) = 49%
  • Two dies at 400 mm2: yield per die = e^(-0.09 x 4.0) = 70%
  • Independent binning for lower SKUs (B200A) — much better flexibility

CoWoS-L interposer ~2500 mm2 with Local Silicon Interconnect bridges at ~10 um pitch. Die-to-die energy: ~0.5 pJ/bit (vs ~5-7 pJ/bit off-package NVLink).

Memory System

ParameterB200 (HBM3e)H100 (HBM3)A100 (HBM2e)
Stacks855
Capacity192 GB80 GB80 GB
Bandwidth8.0 TB/s3.35 TB/s2.0 TB/s
Energy/bit~4 pJ/bit~5 pJ/bit~7 pJ/bit

At ~4 pJ/bit: 16-bit read = 64,000 fJ. Compared to 63 fJ/FMA: 1 HBM3e access = 1000 FMAs. The Horowitz relationship holds perfectly at the bleeding edge.

5th Gen Tensor Core: FP4 Deep Dive

FP4 (E2M1): 1 sign, 2 exponent, 1 mantissa. Only 16 distinct values. Multiplier is essentially an AND gate + exponent add.

PrecisionEnergy/FMA (silicon limit)Energy/FMA (B200 whole-chip)Gap
FP16~63 fJ~667 fJ10.6x
FP8~25 fJ~444 fJ17.8x
FP4~10 fJ~222 fJ22.2x

The 10-22x gap between silicon limit and whole-chip is ALL data movement overhead. This is the design space for a purpose-built inference ASIC.

Roofline Analysis

PrecisionPeak TOPSBW (TB/s)Ridge Point (Ops/Byte)
FP44,5008.0562
FP82,2508.0281
BF161,5008.0187

Blackwell’s FP8 ridge point (281) is much lower than Hopper’s (590) — B200 transitions to compute-bound at lower AI. Better balanced for inference.

LLM Decode: Bandwidth-Bound Reality

70B model at FP16 (140 GB weights):

GPUBWMin Decode Latency (B=1)Tokens/secUtilization
B2008.0 TB/s17.5 ms5718.7%
H1003.35 TB/s41.8 ms2428%
A1002.0 TB/s70 ms14

At FP4 weights (35 GB): B200 decode = 4.4 ms = 228 tokens/sec at B=1.

Compute-bound crossover batch sizes:

  • FP16: B=430
  • FP8: B=560
  • FP4: B=1,125

Even at batch 500, B200 is still memory-bandwidth-bound. This is the fundamental physics.

Power Efficiency: GPU vs Custom ASIC Opportunity

ComponentB200 Power%
Compute (SMs + TCs)~450 W45%
HBM3e (8 stacks)~200 W20%
NV-HBI (die-to-die)~50 W5%
NVLink I/O~100 W10%
L2 + NoC~80 W8%
Memory controllers~50 W5%
Misc~70 W7%

At FP8: H100 is actually MORE power-efficient per op (354 fJ vs 444 fJ) than B200. Two-die + 8x HBM3e adds power. B200 wins on absolute throughput and bandwidth.

Inference ASIC Opportunity

A purpose-built inference ASIC at 4nm could potentially achieve ~50-100 fJ/FMA at FP8 whole-chip (vs B200’s 444 fJ) by:

  1. Eliminating GPU generality (no warp scheduler, CUDA cores, register file) — saves 30-40% area/power
  2. Larger on-chip SRAM (like Groq’s 230 MB) — keep weights on-chip
  3. Dataflow architecture — no register file round-trips
  4. Specialized memory access — exploit transformer structure
  5. Aggressive clock/power gating — predictable compute patterns

Estimated 4-5x efficiency advantage over GPU — consistent with historical ASIC vs GPU ratios.

NVL72: The Mega-System

72 Blackwell GPUs + 36 Grace CPUs in one rack:

  • 13.8 TB HBM3e total
  • 324 PFLOPS FP4 aggregate
  • 576 TB/s aggregate memory bandwidth
  • 130 TB/s bisection bandwidth via 18 NVSwitch chips
  • Holds a 7T parameter model at FP4

Competitive Positioning

vsB200 AdvantageB200 Disadvantage
AMD MI300X8.0 vs 5.3 TB/s BW, 1.8 vs 0.9 TB/s NVLinkMI300X has more FP8 TOPS (2615 vs 2250)
Google TPU v5pHigher absolute throughput2.5x more power, less efficient per op
Groq LPULarger model support (192 GB vs 230 MB SRAM)Groq wins on latency (no HBM)
Cerebras WSE-3Cost, ecosystemCerebras wins for models that fit 44 GB SRAM
Apple M4 Ultra10x higher throughput6.7x more power, different design point

Key Numbers for Your Projects

ChipletCostModel: B200 = 2x400mm2 + CoWoS-L + 8x HBM3e. Estimated BoM ~$30K.

RooflineVM: Ridge points at FP4/FP8/BF16 = 562/281/187 ops/byte. L2=96MB, HBM=8 TB/s.

InferBench: Use B200 as primary GPU baseline. Decode at B=1 = 18.7% util (FP16).

AI Hardware Keynote Update: FP4 FMA silicon limit ~10 fJ, B200 whole-chip ~222 fJ. Gap = 22x.


See also: Blackwell Supply Chain, TrtLLMGen MoE Kernels (SM100-targeting), TSMC N2 Economics, Inference Stack Synthesis