The Physics of Intelligence: AI Hardware from Atoms to Architectures
Source: Alan Ma’s AI Hardware Deep Dive keynote + companion decks
Core Thesis
Compute is nearly free. Data movement is everything.
Total Energy = sum(data movement) + epsilon(compute)
Every innovation that matters — systolic arrays, HBM, NVLink, block floating point, sparsity, KV-cache compression, SSMs — is about moving less data or moving it shorter distances.
Part I: Energy Limits
The Energy Stack (FP8 FMA at 5nm)
| Component | Energy | % of Total |
|---|---|---|
| Sign XOR | 0.05 fJ | 0.1% |
| Exponent add | 0.50 fJ | 0.8% |
| 4x4 mantissa multiply | 1.70 fJ | 2.7% |
| Normalize + round | 0.75 fJ | 1.2% |
| FP32 accumulator | 10.10 fJ | 16.0% |
| Clock + control | 6.30 fJ | 10.0% |
| Wire energy | 13.70 fJ | 21.7% |
| Register file R/W | 30.00 fJ | 47.5% |
| TOTAL | 63 fJ |
The actual multiply is 1.7 fJ (2.7%). Moving data to the multiply is 43.7 fJ (69%).
Industry State
| Chip | Node | Chip-level fJ/FMA | Core-only fJ/FMA |
|---|---|---|---|
| H100 | 4N | 710 | 210-250 |
| B200 | 4NP | 220 | 70-100 |
| GB300 (proj.) | 3nm | ~120 | ~40-60 |
| TSMC test chip | N3E | — | 35 (published) |
Gap between theory and practice: only 2-3x. Tensor cores are shockingly close to physics.
The Hierarchy of Limits
Landauer (thermodynamics): 0.001 fJ (irrelevant)
Wire-limited floor: 30-50 fJ (the real wall)
Full-custom CMOS: 50-80 fJ
Production tensor core: 70-100 fJ
Full chip (B200): 220 fJ
We are limited by wires, not thermodynamics.
Part II: Number Formats
MXFP (Microscaling) — The Winner
MXFP4 achieves 0.03-0.05 pJ/MAC. 10x cheaper than FP8 with ~1% accuracy loss. OCP standard adopted by NVIDIA, AMD, Intel, ARM, Qualcomm.
BitNet b1.58 — Eliminates the Multiplier
Ternary weights {-1, 0, +1}. At 3B scale matches FP16 Llama. “MAC” becomes conditional negate + add: ~0.02 pJ. 15x cheaper than FP8.
Analog Compute — Dead for Training
ADC tax erases the 200x “free multiply” advantage. System-level: 1-3x vs digital. At 3nm digital, even that vanishes. Mythic, Luminous, Rain — all failed or pivoted. IBM is the last credible lab but digital curves have crossed.
Part III: The Memory Wall
Horowitz Table (2025)
| Level | Energy per Access |
|---|---|
| FP8 multiply | 1.7 fJ |
| SRAM read (on-chip) | 2-7 fJ |
| HBM3E read | 2,500 fJ |
| DDR5 read | 15,000 fJ |
| NAND read | 20,000,000 fJ |
One DRAM access = 1,000 multiplies.
HBM Evolution
| Gen | BW/stack | Capacity | pJ/bit |
|---|---|---|---|
| HBM3E (2024) | 1,180 GB/s | 36 GB | ~2.5 |
| HBM4 (2025) | ~2,000 GB/s | 48 GB | ~2.0 |
8x HBM3E (B200) = 9.4 TB/s. Used to be SRAM-only territory.
Part IV: Beyond Attention
Architecture Comparison
| Architecture | Quality at Scale | Training Efficiency | Inference Memory |
|---|---|---|---|
| Transformers | Best | Good | O(n) KV cache |
| Mamba (SSM) | Slightly worse | Good | O(1) |
| Hybrids (Jamba) | Matches transformers | Good | O(1)-ish |
No non-transformer has demonstrated better scaling laws. But hybrids capture 90% of efficiency gains while matching quality. Likely the future.
Part V: Compression Limits
| Method | Bits/weight | Size (7B model) |
|---|---|---|
| FP16 | 16 | 14 GB |
| INT4 (GPTQ) | 4 | 3.5 GB |
| QuIP# | 2 | 1.75 GB |
| BitNet b1.58 | 1.58 | 1.4 GB |
| Theoretical floor | ~1 | ~100 MB for English fluency |
Gap between theory (~100 MB) and practice (~300 MB) is only ~3x. About 1 order of magnitude of headroom.
Key Takeaways for Project Proposals
- ChipletCostModel: Use the fJ/FMA numbers to model energy per inference, not just cost
- RooflineVM: The memory wall data (Horowitz table) is the analytical backbone
- SystolicDiff: Weight-stationary systolic arrays reduce register file cost from 30 fJ to 3 fJ (10x)
- InferBench: Use MXFP4/BitNet energy numbers for architecture comparison
- RoboEdge: On-chip SRAM at 2-7 fJ vs DRAM at 2,500 fJ = 400-1000x. This is why weight-stationary with large scratchpad wins for diffusion policies
- DiffusionASIC: MXFP4 at 0.03 pJ/MAC for the denoising network could enable <1W diffusion inference at the edge