Breaking Down Blackwell

nvidia’s blackwell B200 is a $30-40K, 1000-watt, two-die GPU that nvidia claims is “the world’s most powerful chip.” i went through the specs, ran the math, and read bjarke roune’s book on AI chip design to figure out what’s actually going on inside this thing.

here’s my breakdown.

the B200 reference card

parameter	value
process	TSMC N4P
dies	2 × ~400 mm² (CoWoS-L)
transistors	~92B total
SMs	180 (of 192)
tensor cores	720 (5th gen, 64×64)
FP4 TOPS	4,500
FP8 TOPS	2,250
BF16 TFLOPS	1,500
L2 cache	96 MB (2× Hopper)
HBM3e	192 GB, 8 TB/s (8 stacks)
NVLink 5	1.8 TB/s (18 links)
die-to-die (NV-HBI)	10 TB/s
TDP	1000 W

where blackwell’s transistors actually go

there’s a claim that floats around AI hardware twitter every few months, usually accompanied by a smug “this is why ASICs will eat nvidia’s lunch.” it goes like this: only 3.3% of an H100’s transistors are dedicated to matrix multiplication. the other 96.7% is overhead. waste. bloat. general-purpose GPU tax.

and look — the arithmetic checks out. i’ll walk through it myself so you can see. the H100 has 528 tensor cores. each tensor core does 512 FMA operations per cycle. a single FMA costs about 10,000 transistors at the gate level when you include the surrounding mux logic and pipeline registers:

528 tensor cores × 512 FMAs × 10,000 transistors ≈ 2.7 billion transistors

the H100 has ~80 billion transistors total. 2.7B / 80B = 3.4%. there it is. 96.6% of your $30,000 chip is “not doing math.”

now the same exercise for blackwell. the B200 has 720 tensor cores (5th gen, 64×64 output per cycle at FP8 = 4096 FMAs per core). at FP8, each FMA needs fewer transistors — call it ~6,000 for the reduced-precision datapath:

720 tensor cores × 4096 FMAs × 6,000 transistors ≈ 17.7 billion transistors

on ~92 billion total, that’s 19.2%. already much better than the H100 number, partly because blackwell’s tensor cores are genuinely bigger (64×64 vs 16×16 scheduling blocks), and partly because FP8 datapaths are more transistor-dense per useful operation.

but here’s the thing: both of these percentages are correct and useless.

bjarke roune — who was the software lead for TPUv3 at google, so he’s actually designed the compiler for one of these chips — has the right framing. the question isn’t “what percentage of transistors do math?” the question is “what could an ASIC actually eliminate?”

let me walk through where blackwell’s ~92B transistors live:

L2 cache: ~16-20B transistors. blackwell has 96 MB of L2, doubled from hopper. with tag arrays, ECC, coherency logic, and the crossbar connecting 180 SMs, you’re looking at 55-60 mm² of silicon across both dies. does an inference ASIC need this? yes. you need to buffer KV cache, store activations between layers, handle attention scores. google’s TPUs have massive on-chip memory. groq’s entire thesis is “put 230 MB of SRAM on-chip.” the SRAM bitcells don’t go away — maybe you simplify the cache logic, but the memory stays.

L1 / shared memory: ~10-12B transistors. 180 SMs × 228 KB = ~40 MB total. an ASIC needs local storage near the compute. you can simplify configurability, but the SRAM stays.

register files: ~8-10B transistors. 180 SMs × 256 KB per register file = ~45 MB of multi-ported SRAM. this is legitimately reclaimable — a systolic array with fixed dataflow doesn’t need 65K general-purpose registers per compute unit.

HBM3e controllers: ~3-4B transistors. 8 channels of HBM3e with PHY, controller logic, ECC. non-negotiable for any chip that talks to HBM.

NVLink 5 SerDes: ~4-5B transistors. 18 links with analog transceivers, PLLs, CDR circuits. reclaimable only if you give up multi-chip scaling.

NV-HBI die-to-die: ~2-3B transistors. specific to the chiplet design. a monolithic ASIC doesn’t need it, but monolithic at 800 mm² also has terrible yield.

what’s actually reclaimable:

component	transistors	ASIC needs it?
warp schedulers	~3-4B	no
instruction fetch/decode	~2-3B	no
CUDA cores (FP32/INT32)	~4-5B	no
RT cores	~2-3B	no
MIG logic	~0.5-1B	no
register files (partial)	~4-5B	reduced
total reclaimable	~16-21B

that’s 17-23% of die area. subtract 3-5% for ASIC-specific control logic (DMA engines, weight routing, configuration registers), and you net ~12-18% of die area savings.

here’s my honest version of the ASIC argument:

reclaim ~15% of die area + achieve ~80-90% utilization instead of GPU’s 30-40% average utilization. combined, that’s a real 3-5× efficiency advantage. but it’s not 30×, and it’s definitely not the “96.7% waste” story.

the utilization gap is the bigger factor. nvidia’s tensor cores spend most of their life waiting — waiting for data from HBM, waiting for the warp scheduler to issue instructions, waiting for a thread block. a fixed-function chip with deterministic dataflow keeps its systolic arrays fed consistently. that’s where the real ASIC advantage lives.

the two-die architecture

blackwell is two ~400 mm² dies on a CoWoS-L interposer. the yield math explains why:

monolithic 800 mm²: yield = e^(-0.09 × 8.0) = 49%
two dies at 400 mm²: yield per die = e^(-0.09 × 4.0) = 70%
combined good-pair yield: 70% × 70% = 49%… wait, that’s the same?

not quite. the chiplet advantage isn’t just yield — it’s binning. nvidia can independently test and bin each die. a die with 1 defective SM out of 96 becomes a B200A or a lower SKU, rather than scrapping the entire 800 mm² monolithic chip. the effective yield of usable silicon is significantly higher.

the dies connect via NV-HBI at ~10 TB/s, using Local Silicon Interconnect bridges at ~10 μm pitch. die-to-die energy: ~0.5 pJ/bit (vs ~5-7 pJ/bit for off-package NVLink). so inter-die communication is 10-14× more energy-efficient than chip-to-chip. this matters because at 10 TB/s, the NV-HBI would cost ~50W if it ran at NVLink energy levels. at 0.5 pJ/bit, it’s only ~5W. that’s 45W saved for the same bandwidth.

5th gen tensor cores — FP4 and the energy hierarchy

FP4 (E2M1): 1 sign bit, 2 exponent bits, 1 mantissa bit. only 16 distinct values. the multiplier is essentially an AND gate + exponent add. it’s barely a computation.

precision	energy/FMA (silicon limit)	energy/FMA (B200 whole-chip)	gap
FP16	~63 fJ	~667 fJ	10.6×
FP8	~25 fJ	~444 fJ	17.8×
FP4	~10 fJ	~222 fJ	22.2×

the 10-22× gap between silicon limit and whole-chip is ALL data movement overhead — reading operands from registers, routing through the memory hierarchy, scheduling on the warp scheduler. this is roune’s core insight: the systolic array itself is efficient; everything around it is where the power goes.

at ~4 pJ/bit for HBM3e: a single 16-bit read = 64,000 fJ. compared to 63 fJ for an FP16 FMA: 1 HBM3e access costs the same energy as 1000 FMAs. the horowitz relationship holds at the bleeding edge.

the memory system — why 192 GB still isn’t enough

nvidia doubled capacity from H100 (80 GB) to B200 (192 GB) and bandwidth from 3.35 to 8.0 TB/s. sounds great. let me do the math.

using roune’s KV cache formula for llama 70B (GQA with 8 KV heads, 80 layers, head_dim=128):

per-token KV = 2 × 80 × 8 × 128 × 1 byte (FP8) = 160 KB/token

160 kilobytes per token per sequence. now multiply by batch × context:

scenario	batch	context	KV cache	+ weights (FP8)	total	fits in 192 GB?
aggressive serving	256	4,096	168 GB	70 GB	238 GB	no
conservative	128	4,096	84 GB	70 GB	154 GB	yes (38 GB free)
long context	64	8,192	84 GB	70 GB	154 GB	yes
high throughput	512	2,048	168 GB	70 GB	238 GB	no

the B200 can serve llama 70B at batch 128 with 4K context, or batch 64 with 8K. the moment you push past ~batch 140 at 4K, you’re out of memory. and this is the smallest model anyone considers production-scale.

for llama 405B at FP8? the weights alone are 405 GB. that’s 2.1× the entire HBM capacity. you need the NVL72.

HBM3e costs ~ $15 - 20/ GB .192 GB = * *$ 3,000-3,800 just for memory** on a ~$30K BoM chip. and it’s still not big enough.

roune’s observation: companies keep buying more expensive HBM because they can’t reduce KV cache requirements without AI research breakthroughs. sparse attention could theoretically reduce effective sequence length from 1,000,000 to 1,000 — a 1000× reduction. but it’s still research-grade, not production-ready. nvidia’s answer is to just keep stacking more HBM. blackwell ultra will have 288 GB. the treadmill continues.

stress-testing the bandwidth

start with batch=1 decode for llama 70B:

FP16 weights (140 GB): 140 / 8,000 GB/s = 17.5 ms = 57 tok/s
FP8 weights  (70 GB):   70 / 8,000     =  8.75 ms = 114 tok/s
FP4 weights  (35 GB):   35 / 8,000     =  4.4 ms  = 228 tok/s

at batch=1 FP8: arithmetic intensity = 140B ops / 70B bytes = 2 ops/byte. the ridge point is 2,250 TOPS / 8 TB/s = 281 ops/byte. we are 140× below the ridge point. the tensor cores are at 0.7% utilization. $30,000 of compute sitting idle, acting as an expensive memory-bandwidth pipe.

but here’s the real story. B200’s ridge point (281 ops/byte) is less than half of H100’s (590 ops/byte). blackwell transitions to compute-bound at a lower batch size. on H100, you needed batch ~295 to become compute-bound at FP8. on B200, you need batch ~141.

nvidia achieved this by scaling bandwidth faster than compute: H100→B200 compute went 1.14×, bandwidth went 2.39×. the bandwidth scaling outpaced compute by 2.1×. this is the right design choice for inference.

but even at batch=128 — realistic production — are we compute-bound?

AI at batch=128 = 128 × 2 = 256 ops/byte
ridge point = 281 ops/byte

256 < 281. still bandwidth-bound. barely, but we are. and the batch=141 crossover requires 113 GB of KV cache at 4K context + 70 GB weights = 183 GB out of 192 GB. right at the knife’s edge of both the memory wall and the compute-bandwidth crossover.

this is not a coincidence. nvidia’s architects sized HBM capacity, bandwidth, and compute to all hit their walls at roughly the same operating point. good chip design — no resource dramatically over-provisioned. but no free lunch either.

decode — the bandwidth-bound reality

the conventional wisdom: “decode is memory-bandwidth-bound, therefore buy the chip with the most bandwidth.” nvidia’s marketing loves this framing. but the actual story is way more nuanced.

during autoregressive decode, generating one token requires reading ALL weights (for FF) and ALL relevant KV cache entries (for attention). the systolic array utilization problem is severe:

blackwell’s tensor cores are 64×64. with batch=1, the FF matmul is 1×8192 × 8192×8192. tiled into 64×64 blocks, you need N=64 rows to fill the array. you have N=1. utilization: 1/64 ≈ 1.6%.

98.4% of those 720 tensor cores are idle. roune calls decode “the more troublesome kind of inference.” i think that undersells it.

but there are three tricks that transform the picture:

trick 1: grouped query attention (GQA). llama 70B groups 128 heads into G=8 groups → 16 heads share each KV set. N goes from 1 to 16. attention utilization: 16/64 = 25%. a 16× improvement from a model architecture choice, not hardware.

trick 2: speculative decoding. guess 3-4 tokens ahead with a draft model, verify all at once. N = 4 × 16 = 64. attention utilization: 64/64 = 100% (in theory).

trick 3: batching for FF. serve 64 users simultaneously → FF sees 64×8192 activation matrix. N=64. utilization: 100%.

note the asymmetry roune is careful to point out: batching helps FF but NOT attention. different users have different KV caches. you need GQA for attention utilization and batching for FF utilization — different tricks targeting different parts.

with the full stack (GQA + spec decode + batch=64), check whether we’re still bandwidth-bound:

FF layers: AI = 64 × 4 × 2 = 512 ops/byte. ridge = 281. compute-bound! decode, which everyone calls “memory-bandwidth-bound,” is compute-bound on FF once you batch and speculate properly.

attention layers: KV cache reads for 4K context across 80 layers ≈ 52 GB per batch element. with batch=64: 3.3 TB of reads. at 8 TB/s: 415 ms. solidly bandwidth-bound.

FF is compute-bound, attention is bandwidth-bound. classic imbalance within a single forward pass.

this is exactly where roune’s managed aggregation becomes critical: if you run both prefill (compute-bound) and decode on the same chip, the prefill work can overlap with decode’s bandwidth-bound attention. the memory bus serves KV reads while tensor cores crunch prefill tokens. neither subsystem is idle.

my honest take: blackwell’s decode story is legitimately strong, but only in a world where (a) models use aggressive GQA, (b) speculative decoding works for your workload, and (c) your serving stack implements managed aggregation. that’s a lot of ifs. the chip is capable of near-100% utilization. whether anyone gets there in production is a different question.

blackwell through roune’s lens

roune’s core insight: all AI hardware is systolic arrays with marketing names. tensor cores, MXUs, matrix cores — the entire industry converged on the same circuit. the interesting question is the size.

larger systolic arrays are inherently more efficient. double the vector width (N→2N): 4× math per cycle, but scalar overhead barely scales. nvidia’s own history:

generation	year	tensor core size	FMAs/core/cycle
volta	2017	4×4	16
turing	2018	16×16	256
blackwell	2025	64×64	4,096

but google started at 256×256 in 2016. even at 128×128, a TPU MXU has 16,384 FMAs/cycle vs blackwell’s 4,096. using roune’s scaling math:

64×64 (blackwell): overhead-to-compute ratio improves 16× vs 4×4 baseline
128×128 (TPU v4): improves 32×
256×256 (TPU v1): improves 64×

a TPU v4’s MXU has 2× better overhead ratio than a blackwell tensor core.

so why doesn’t nvidia build 256×256? the mono-sized problem.

every blackwell tensor core is 64×64. the same array handles:

FF layers: K = 8192. tiles beautifully into 64×64. utilization ~95%+.
attention: K = 128. gets 128/64 = 2 passes. decent.
small attention heads: K = 16. utilization = 16/64 = 25%. painful.

now imagine a 256×256 TPU MXU doing attention at K=128: 128/256 = 0.5. can’t even fill the K dimension once. pad with zeros = waste half your compute.

roune’s solution is elegant: build two types of cores. a small number of large arrays (256×256) for FF, a larger number of small arrays (64×64) for attention. both always doing what they’re best at.

nobody has shipped this. google sticks with large mono-sized MXUs and pays the attention tax. nvidia sticks with medium mono-sized tensor cores and pays the FF efficiency tax.

nvidia’s choice is defensible: CUDA ecosystem compatibility is the moat. introducing heterogeneous on-chip compute would break existing software and force a multi-year ecosystem migration. so nvidia ships 720 identical 64×64 tensor cores, lets CUDA and Flash Attention handle the tiling, relies on brute-force parallelism to smooth over utilization gaps, and charges $30-40K per chip.

at 1000W per chip and electricity at $0.05 - 0.10/ k W h, t h e p o w er bi l l f or an N V L 72 r u nnin g 24/7 i s$ 50-100K/year. the chips cost $2-3M per rack. power is 2-5% of TCO annually. nvidia can afford to be 2-3× less efficient than a perfect ASIC and still win because their chips exist, their software works, and their supply chain delivers.

but if you’re designing from scratch — cerebras, groq, etched, or a stealth startup with a blank sheet — roune’s dual-core insight is the highest-leverage architectural idea in the space right now. my bet: google’s TPU v7 or v8 might be the first to try, given roune literally worked on their compiler.

NVL72 — the mega-system as inference platform

72 blackwell GPUs + 36 grace CPUs in a single rack:

13.8 TB HBM3e total
576 TB/s aggregate memory bandwidth
162 PFLOPS FP8 / 324 PFLOPS FP4
130 TB/s bisection bandwidth via 18 NVSwitch chips

that last number is the most underappreciated spec in the system.

capacity math: llama 405B at FP8 (405 GB weights) across 72 GPUs = 5.6 GB/GPU. remaining: 186 GB/GPU for KV cache = 13.4 TB total. at 160 KB/token: 83 million tokens of KV cache capacity in a single rack. batch 1000 at 8K context = 1.3 TB. less than 10% of available memory.

but does the interconnect keep up? for managed aggregation — assigning some GPUs to prefill, others to decode, dynamically — you need to transfer KV cache between them. one sequence at 4K context = 655 MB. for 1000 sequences/second: 655 GB/s of KV transfer.

at 130 TB/s bisection bandwidth: 655 GB/s = 0.5% of available bandwidth. the NVSwitch fabric makes KV cache transfer effectively free.

this is the key insight: NVL72 is not primarily a bigger GPU cluster. it’s the infrastructure that makes flexible prefill/decode disaggregation practical. you can shift GPUs between prefill and decode on the fly, maintain roune’s optimal managed aggregation ratio, and keep both compute and bandwidth fully utilized.

compare to disaggregation across separate machines on InfiniBand at 400 GB/s per link. that same 655 GB/s transfer saturates more than a full link. NVL72’s internal fabric is 325× the bandwidth at ~300ns latency. qualitative difference.

the cost question: at $3 - 5 M p er r a c k, r o ug h r e v e n u e ma t h —72 K t o k e n s / seco n d a t$ 0.50/million tokens = ~ $2, 150/ d a y . t ha t^{'} s a 5 - y e a r p a y ba c k b e f or ee l ec t r i c i t y, coo l in g, an d s t a f f . y o u n ee d > 50$ 1/million token pricing to make it work in 3 years.

what i actually think

the B200 is a beautifully over-engineered compromise between silicon efficiency and software compatibility.

what nvidia got right:

bandwidth scaling outpacing compute scaling (ridge point halved — the right move for inference)
chiplet design at exactly the right die size for N4P yield economics
NVL72 fabric bandwidth enabling managed aggregation at scale
FP4 support as an escape hatch for bandwidth-bound decode

what nvidia left on the table:

mono-sized 64×64 tensor cores when roune’s dual-core architecture could gain 2-3× efficiency
192 GB HBM3e runs out at modest batch sizes on a 70B model
1000W TDP when a purpose-built inference ASIC could potentially achieve the same throughput at 200-300W

competitive positioning:

vs	B200 advantage	B200 disadvantage
AMD MI300X	8.0 vs 5.3 TB/s BW	MI300X has more FP8 TOPS (2615 vs 2250)
Google TPU v5p	higher absolute throughput	2.5× more power, less efficient per op
Groq LPU	larger model support (192 GB vs 500 MB SRAM)	groq wins on latency
Cerebras WSE-3	cost, ecosystem	cerebras wins if model fits 44 GB SRAM

the history of computing is a history of “good enough” winning over “optimal.” nvidia knows this better than anyone. blackwell is good enough. the CUDA ecosystem is the moat. and until someone ships roune’s dual-core architecture with a working compiler and production-ready software stack, “good enough” will keep winning.

but that 2-3× efficiency gap is real money at datacenter scale. someone should build it.

this analysis draws heavily from bjarke roune’s “Designing AI Chip Software and Hardware” (2026), the best single document on AI chip design i’ve read. if you work on AI hardware, read it.

Alan's PKB

Explorer