The Physics of Intelligence: AI Hardware from Atoms to Architectures

Source: Alan Ma’s AI Hardware Deep Dive keynote + companion decks

Core Thesis

Compute is nearly free. Data movement is everything.

Total Energy = sum(data movement) + epsilon(compute)

Every innovation that matters — systolic arrays, HBM, NVLink, block floating point, sparsity, KV-cache compression, SSMs — is about moving less data or moving it shorter distances.

Part I: Energy Limits

The Energy Stack (FP8 FMA at 5nm)

ComponentEnergy% of Total
Sign XOR0.05 fJ0.1%
Exponent add0.50 fJ0.8%
4x4 mantissa multiply1.70 fJ2.7%
Normalize + round0.75 fJ1.2%
FP32 accumulator10.10 fJ16.0%
Clock + control6.30 fJ10.0%
Wire energy13.70 fJ21.7%
Register file R/W30.00 fJ47.5%
TOTAL63 fJ

The actual multiply is 1.7 fJ (2.7%). Moving data to the multiply is 43.7 fJ (69%).

Industry State

ChipNodeChip-level fJ/FMACore-only fJ/FMA
H1004N710210-250
B2004NP22070-100
GB300 (proj.)3nm~120~40-60
TSMC test chipN3E35 (published)

Gap between theory and practice: only 2-3x. Tensor cores are shockingly close to physics.

The Hierarchy of Limits

Landauer (thermodynamics):    0.001 fJ   (irrelevant)
Wire-limited floor:           30-50 fJ   (the real wall)
Full-custom CMOS:             50-80 fJ
Production tensor core:       70-100 fJ
Full chip (B200):             220 fJ

We are limited by wires, not thermodynamics.

Part II: Number Formats

MXFP (Microscaling) — The Winner

MXFP4 achieves 0.03-0.05 pJ/MAC. 10x cheaper than FP8 with ~1% accuracy loss. OCP standard adopted by NVIDIA, AMD, Intel, ARM, Qualcomm.

BitNet b1.58 — Eliminates the Multiplier

Ternary weights {-1, 0, +1}. At 3B scale matches FP16 Llama. “MAC” becomes conditional negate + add: ~0.02 pJ. 15x cheaper than FP8.

Analog Compute — Dead for Training

ADC tax erases the 200x “free multiply” advantage. System-level: 1-3x vs digital. At 3nm digital, even that vanishes. Mythic, Luminous, Rain — all failed or pivoted. IBM is the last credible lab but digital curves have crossed.

Part III: The Memory Wall

Horowitz Table (2025)

LevelEnergy per Access
FP8 multiply1.7 fJ
SRAM read (on-chip)2-7 fJ
HBM3E read2,500 fJ
DDR5 read15,000 fJ
NAND read20,000,000 fJ

One DRAM access = 1,000 multiplies.

HBM Evolution

GenBW/stackCapacitypJ/bit
HBM3E (2024)1,180 GB/s36 GB~2.5
HBM4 (2025)~2,000 GB/s48 GB~2.0

8x HBM3E (B200) = 9.4 TB/s. Used to be SRAM-only territory.

Part IV: Beyond Attention

Architecture Comparison

ArchitectureQuality at ScaleTraining EfficiencyInference Memory
TransformersBestGoodO(n) KV cache
Mamba (SSM)Slightly worseGoodO(1)
Hybrids (Jamba)Matches transformersGoodO(1)-ish

No non-transformer has demonstrated better scaling laws. But hybrids capture 90% of efficiency gains while matching quality. Likely the future.

Part V: Compression Limits

MethodBits/weightSize (7B model)
FP161614 GB
INT4 (GPTQ)43.5 GB
QuIP#21.75 GB
BitNet b1.581.581.4 GB
Theoretical floor~1~100 MB for English fluency

Gap between theory (~100 MB) and practice (~300 MB) is only ~3x. About 1 order of magnitude of headroom.

Key Takeaways for Project Proposals

  1. ChipletCostModel: Use the fJ/FMA numbers to model energy per inference, not just cost
  2. RooflineVM: The memory wall data (Horowitz table) is the analytical backbone
  3. SystolicDiff: Weight-stationary systolic arrays reduce register file cost from 30 fJ to 3 fJ (10x)
  4. InferBench: Use MXFP4/BitNet energy numbers for architecture comparison
  5. RoboEdge: On-chip SRAM at 2-7 fJ vs DRAM at 2,500 fJ = 400-1000x. This is why weight-stationary with large scratchpad wins for diffusion policies
  6. DiffusionASIC: MXFP4 at 0.03 pJ/MAC for the denoising network could enable <1W diffusion inference at the edge