The Physics of Intelligence: AI Hardware from Atoms to Architectures

Source: Alan Ma’s AI Hardware Deep Dive keynote + companion decks

Core Thesis

Compute is nearly free. Data movement is everything.

Total Energy = sum(data movement) + epsilon(compute)

Every innovation that matters — systolic arrays, HBM, NVLink, block floating point, sparsity, KV-cache compression, SSMs — is about moving less data or moving it shorter distances.

Part I: Energy Limits

The Energy Stack (FP8 FMA at 5nm)

Component	Energy	% of Total
Sign XOR	0.05 fJ	0.1%
Exponent add	0.50 fJ	0.8%
4x4 mantissa multiply	1.70 fJ	2.7%
Normalize + round	0.75 fJ	1.2%
FP32 accumulator	10.10 fJ	16.0%
Clock + control	6.30 fJ	10.0%
Wire energy	13.70 fJ	21.7%
Register file R/W	30.00 fJ	47.5%
TOTAL	63 fJ

The actual multiply is 1.7 fJ (2.7%). Moving data to the multiply is 43.7 fJ (69%).

Industry State

Chip	Node	Chip-level fJ/FMA	Core-only fJ/FMA
H100	4N	710	210-250
B200	4NP	220	70-100
GB300 (proj.)	3nm	~120	~40-60
TSMC test chip	N3E	—	35 (published)

Gap between theory and practice: only 2-3x. Tensor cores are shockingly close to physics.

The Hierarchy of Limits

Landauer (thermodynamics):    0.001 fJ   (irrelevant)
Wire-limited floor:           30-50 fJ   (the real wall)
Full-custom CMOS:             50-80 fJ
Production tensor core:       70-100 fJ
Full chip (B200):             220 fJ

We are limited by wires, not thermodynamics.

Part II: Number Formats

MXFP (Microscaling) — The Winner

MXFP4 achieves 0.03-0.05 pJ/MAC. 10x cheaper than FP8 with ~1% accuracy loss. OCP standard adopted by NVIDIA, AMD, Intel, ARM, Qualcomm.

BitNet b1.58 — Eliminates the Multiplier

Ternary weights {-1, 0, +1}. At 3B scale matches FP16 Llama. “MAC” becomes conditional negate + add: ~0.02 pJ. 15x cheaper than FP8.

Analog Compute — Dead for Training

ADC tax erases the 200x “free multiply” advantage. System-level: 1-3x vs digital. At 3nm digital, even that vanishes. Mythic, Luminous, Rain — all failed or pivoted. IBM is the last credible lab but digital curves have crossed.

Part III: The Memory Wall

Horowitz Table (2025)

Level	Energy per Access
FP8 multiply	1.7 fJ
SRAM read (on-chip)	2-7 fJ
HBM3E read	2,500 fJ
DDR5 read	15,000 fJ
NAND read	20,000,000 fJ

One DRAM access = 1,000 multiplies.

HBM Evolution

Gen	BW/stack	Capacity	pJ/bit
HBM3E (2024)	1,180 GB/s	36 GB	~2.5
HBM4 (2025)	~2,000 GB/s	48 GB	~2.0

8x HBM3E (B200) = 9.4 TB/s. Used to be SRAM-only territory.

Part IV: Beyond Attention

Architecture Comparison

Architecture	Quality at Scale	Training Efficiency	Inference Memory
Transformers	Best	Good	O(n) KV cache
Mamba (SSM)	Slightly worse	Good	O(1)
Hybrids (Jamba)	Matches transformers	Good	O(1)-ish

No non-transformer has demonstrated better scaling laws. But hybrids capture 90% of efficiency gains while matching quality. Likely the future.

Part V: Compression Limits

Method	Bits/weight	Size (7B model)
FP16	16	14 GB
INT4 (GPTQ)	4	3.5 GB
QuIP#	2	1.75 GB
BitNet b1.58	1.58	1.4 GB
Theoretical floor	~1	~100 MB for English fluency

Gap between theory (~100 MB) and practice (~300 MB) is only ~3x. About 1 order of magnitude of headroom.

Key Takeaways for Project Proposals

ChipletCostModel: Use the fJ/FMA numbers to model energy per inference, not just cost
RooflineVM: The memory wall data (Horowitz table) is the analytical backbone
SystolicDiff: Weight-stationary systolic arrays reduce register file cost from 30 fJ to 3 fJ (10x)
InferBench: Use MXFP4/BitNet energy numbers for architecture comparison
RoboEdge: On-chip SRAM at 2-7 fJ vs DRAM at 2,500 fJ = 400-1000x. This is why weight-stationary with large scratchpad wins for diffusion policies
DiffusionASIC: MXFP4 at 0.03 pJ/MAC for the denoising network could enable <1W diffusion inference at the edge

Alan's PKB

Explorer

AI Hardware Landscape