NVIDIA Blackwell GPU Architecture: Deep Technical Analysis
B200 Reference Card
| Parameter | Value |
|---|---|
| Process | TSMC N4P |
| Dies | 2 x ~400 mm2 |
| Transistors | ~92B total |
| SMs | 180 (of 192) |
| Tensor Cores | 720 (5th gen) |
| FP4 TOPS | 4,500 |
| FP8/INT8 TOPS | 2,250 |
| BF16 TFLOPS | 1,500 |
| L2 Cache | 96 MB (2x Hopper) |
| HBM3e | 192 GB, 8 TB/s |
| NVLink 5 | 1.8 TB/s (18 links) |
| Die-to-die (NV-HBI) | 10 TB/s |
| TDP | 1000 W |
Two-Die Yield Math
- Monolithic 800 mm2: yield = e^(-0.09 x 8.0) = 49%
- Two dies at 400 mm2: yield per die = e^(-0.09 x 4.0) = 70%
- Independent binning for lower SKUs (B200A) — much better flexibility
CoWoS-L interposer ~2500 mm2 with Local Silicon Interconnect bridges at ~10 um pitch. Die-to-die energy: ~0.5 pJ/bit (vs ~5-7 pJ/bit off-package NVLink).
Memory System
| Parameter | B200 (HBM3e) | H100 (HBM3) | A100 (HBM2e) |
|---|---|---|---|
| Stacks | 8 | 5 | 5 |
| Capacity | 192 GB | 80 GB | 80 GB |
| Bandwidth | 8.0 TB/s | 3.35 TB/s | 2.0 TB/s |
| Energy/bit | ~4 pJ/bit | ~5 pJ/bit | ~7 pJ/bit |
At ~4 pJ/bit: 16-bit read = 64,000 fJ. Compared to 63 fJ/FMA: 1 HBM3e access = 1000 FMAs. The Horowitz relationship holds perfectly at the bleeding edge.
5th Gen Tensor Core: FP4 Deep Dive
FP4 (E2M1): 1 sign, 2 exponent, 1 mantissa. Only 16 distinct values. Multiplier is essentially an AND gate + exponent add.
| Precision | Energy/FMA (silicon limit) | Energy/FMA (B200 whole-chip) | Gap |
|---|---|---|---|
| FP16 | ~63 fJ | ~667 fJ | 10.6x |
| FP8 | ~25 fJ | ~444 fJ | 17.8x |
| FP4 | ~10 fJ | ~222 fJ | 22.2x |
The 10-22x gap between silicon limit and whole-chip is ALL data movement overhead. This is the design space for a purpose-built inference ASIC.
Roofline Analysis
| Precision | Peak TOPS | BW (TB/s) | Ridge Point (Ops/Byte) |
|---|---|---|---|
| FP4 | 4,500 | 8.0 | 562 |
| FP8 | 2,250 | 8.0 | 281 |
| BF16 | 1,500 | 8.0 | 187 |
Blackwell’s FP8 ridge point (281) is much lower than Hopper’s (590) — B200 transitions to compute-bound at lower AI. Better balanced for inference.
LLM Decode: Bandwidth-Bound Reality
70B model at FP16 (140 GB weights):
| GPU | BW | Min Decode Latency (B=1) | Tokens/sec | Utilization |
|---|---|---|---|---|
| B200 | 8.0 TB/s | 17.5 ms | 57 | 18.7% |
| H100 | 3.35 TB/s | 41.8 ms | 24 | 28% |
| A100 | 2.0 TB/s | 70 ms | 14 | — |
At FP4 weights (35 GB): B200 decode = 4.4 ms = 228 tokens/sec at B=1.
Compute-bound crossover batch sizes:
- FP16: B=430
- FP8: B=560
- FP4: B=1,125
Even at batch 500, B200 is still memory-bandwidth-bound. This is the fundamental physics.
Power Efficiency: GPU vs Custom ASIC Opportunity
| Component | B200 Power | % |
|---|---|---|
| Compute (SMs + TCs) | ~450 W | 45% |
| HBM3e (8 stacks) | ~200 W | 20% |
| NV-HBI (die-to-die) | ~50 W | 5% |
| NVLink I/O | ~100 W | 10% |
| L2 + NoC | ~80 W | 8% |
| Memory controllers | ~50 W | 5% |
| Misc | ~70 W | 7% |
At FP8: H100 is actually MORE power-efficient per op (354 fJ vs 444 fJ) than B200. Two-die + 8x HBM3e adds power. B200 wins on absolute throughput and bandwidth.
Inference ASIC Opportunity
A purpose-built inference ASIC at 4nm could potentially achieve ~50-100 fJ/FMA at FP8 whole-chip (vs B200’s 444 fJ) by:
- Eliminating GPU generality (no warp scheduler, CUDA cores, register file) — saves 30-40% area/power
- Larger on-chip SRAM (like Groq’s 230 MB) — keep weights on-chip
- Dataflow architecture — no register file round-trips
- Specialized memory access — exploit transformer structure
- Aggressive clock/power gating — predictable compute patterns
Estimated 4-5x efficiency advantage over GPU — consistent with historical ASIC vs GPU ratios.
NVL72: The Mega-System
72 Blackwell GPUs + 36 Grace CPUs in one rack:
- 13.8 TB HBM3e total
- 324 PFLOPS FP4 aggregate
- 576 TB/s aggregate memory bandwidth
- 130 TB/s bisection bandwidth via 18 NVSwitch chips
- Holds a 7T parameter model at FP4
Competitive Positioning
| vs | B200 Advantage | B200 Disadvantage |
|---|---|---|
| AMD MI300X | 8.0 vs 5.3 TB/s BW, 1.8 vs 0.9 TB/s NVLink | MI300X has more FP8 TOPS (2615 vs 2250) |
| Google TPU v5p | Higher absolute throughput | 2.5x more power, less efficient per op |
| Groq LPU | Larger model support (192 GB vs 230 MB SRAM) | Groq wins on latency (no HBM) |
| Cerebras WSE-3 | Cost, ecosystem | Cerebras wins for models that fit 44 GB SRAM |
| Apple M4 Ultra | 10x higher throughput | 6.7x more power, different design point |
Key Numbers for Your Projects
ChipletCostModel: B200 = 2x400mm2 + CoWoS-L + 8x HBM3e. Estimated BoM ~$30K.
RooflineVM: Ridge points at FP4/FP8/BF16 = 562/281/187 ops/byte. L2=96MB, HBM=8 TB/s.
InferBench: Use B200 as primary GPU baseline. Decode at B=1 = 18.7% util (FP16).
AI Hardware Keynote Update: FP4 FMA silicon limit ~10 fJ, B200 whole-chip ~222 fJ. Gap = 22x.
See also: Blackwell Supply Chain, TrtLLMGen MoE Kernels (SM100-targeting), TSMC N2 Economics, Inference Stack Synthesis