NVIDIA Blackwell GPU Architecture: Deep Technical Analysis

B200 Reference Card

Parameter	Value
Process	TSMC N4P
Dies	2 x ~400 mm2
Transistors	~92B total
SMs	180 (of 192)
Tensor Cores	720 (5th gen)
FP4 TOPS	4,500
FP8/INT8 TOPS	2,250
BF16 TFLOPS	1,500
L2 Cache	96 MB (2x Hopper)
HBM3e	192 GB, 8 TB/s
NVLink 5	1.8 TB/s (18 links)
Die-to-die (NV-HBI)	10 TB/s
TDP	1000 W

Two-Die Yield Math

Monolithic 800 mm2: yield = e^(-0.09 x 8.0) = 49%
Two dies at 400 mm2: yield per die = e^(-0.09 x 4.0) = 70%
Independent binning for lower SKUs (B200A) — much better flexibility

CoWoS-L interposer ~2500 mm2 with Local Silicon Interconnect bridges at ~10 um pitch. Die-to-die energy: ~0.5 pJ/bit (vs ~5-7 pJ/bit off-package NVLink).

Memory System

Parameter	B200 (HBM3e)	H100 (HBM3)	A100 (HBM2e)
Stacks	8	5	5
Capacity	192 GB	80 GB	80 GB
Bandwidth	8.0 TB/s	3.35 TB/s	2.0 TB/s
Energy/bit	~4 pJ/bit	~5 pJ/bit	~7 pJ/bit

At ~4 pJ/bit: 16-bit read = 64,000 fJ. Compared to 63 fJ/FMA: 1 HBM3e access = 1000 FMAs. The Horowitz relationship holds perfectly at the bleeding edge.

5th Gen Tensor Core: FP4 Deep Dive

FP4 (E2M1): 1 sign, 2 exponent, 1 mantissa. Only 16 distinct values. Multiplier is essentially an AND gate + exponent add.

Precision	Energy/FMA (silicon limit)	Energy/FMA (B200 whole-chip)	Gap
FP16	~63 fJ	~667 fJ	10.6x
FP8	~25 fJ	~444 fJ	17.8x
FP4	~10 fJ	~222 fJ	22.2x

The 10-22x gap between silicon limit and whole-chip is ALL data movement overhead. This is the design space for a purpose-built inference ASIC.

Roofline Analysis

Precision	Peak TOPS	BW (TB/s)	Ridge Point (Ops/Byte)
FP4	4,500	8.0	562
FP8	2,250	8.0	281
BF16	1,500	8.0	187

Blackwell’s FP8 ridge point (281) is much lower than Hopper’s (590) — B200 transitions to compute-bound at lower AI. Better balanced for inference.

LLM Decode: Bandwidth-Bound Reality

70B model at FP16 (140 GB weights):

GPU	BW	Min Decode Latency (B=1)	Tokens/sec	Utilization
B200	8.0 TB/s	17.5 ms	57	18.7%
H100	3.35 TB/s	41.8 ms	24	28%
A100	2.0 TB/s	70 ms	14	—

At FP4 weights (35 GB): B200 decode = 4.4 ms = 228 tokens/sec at B=1.

Compute-bound crossover batch sizes:

FP16: B=430
FP8: B=560
FP4: B=1,125

Even at batch 500, B200 is still memory-bandwidth-bound. This is the fundamental physics.

Power Efficiency: GPU vs Custom ASIC Opportunity

Component	B200 Power	%
Compute (SMs + TCs)	~450 W	45%
HBM3e (8 stacks)	~200 W	20%
NV-HBI (die-to-die)	~50 W	5%
NVLink I/O	~100 W	10%
L2 + NoC	~80 W	8%
Memory controllers	~50 W	5%
Misc	~70 W	7%

At FP8: H100 is actually MORE power-efficient per op (354 fJ vs 444 fJ) than B200. Two-die + 8x HBM3e adds power. B200 wins on absolute throughput and bandwidth.

Inference ASIC Opportunity

A purpose-built inference ASIC at 4nm could potentially achieve ~50-100 fJ/FMA at FP8 whole-chip (vs B200’s 444 fJ) by:

Eliminating GPU generality (no warp scheduler, CUDA cores, register file) — saves 30-40% area/power
Larger on-chip SRAM (like Groq’s 230 MB) — keep weights on-chip
Dataflow architecture — no register file round-trips
Specialized memory access — exploit transformer structure
Aggressive clock/power gating — predictable compute patterns

Estimated 4-5x efficiency advantage over GPU — consistent with historical ASIC vs GPU ratios.

NVL72: The Mega-System

72 Blackwell GPUs + 36 Grace CPUs in one rack:

13.8 TB HBM3e total
324 PFLOPS FP4 aggregate
576 TB/s aggregate memory bandwidth
130 TB/s bisection bandwidth via 18 NVSwitch chips
Holds a 7T parameter model at FP4

Competitive Positioning

vs	B200 Advantage	B200 Disadvantage
AMD MI300X	8.0 vs 5.3 TB/s BW, 1.8 vs 0.9 TB/s NVLink	MI300X has more FP8 TOPS (2615 vs 2250)
Google TPU v5p	Higher absolute throughput	2.5x more power, less efficient per op
Groq LPU	Larger model support (192 GB vs 230 MB SRAM)	Groq wins on latency (no HBM)
Cerebras WSE-3	Cost, ecosystem	Cerebras wins for models that fit 44 GB SRAM
Apple M4 Ultra	10x higher throughput	6.7x more power, different design point

Key Numbers for Your Projects

ChipletCostModel: B200 = 2x400mm2 + CoWoS-L + 8x HBM3e. Estimated BoM ~$30K.

RooflineVM: Ridge points at FP4/FP8/BF16 = 562/281/187 ops/byte. L2=96MB, HBM=8 TB/s.

InferBench: Use B200 as primary GPU baseline. Decode at B=1 = 18.7% util (FP16).

AI Hardware Keynote Update: FP4 FMA silicon limit ~10 fJ, B200 whole-chip ~222 fJ. Gap = 22x.

Alan's PKB

Explorer

Blackwell Architecture