designing AI chip hardware and software — key ideas from bjarke roune

Bjarke Hammersholt Roune was the technical software lead for Google’s TPUv3 and worked at Nvidia on GPUs. in 2026 he published a freely available document laying out how he would design an AI chip from scratch. it’s the best single document on AI chip design i’ve ever read.

this page indexes the key ideas i reference across my research notes.

the core thesis

the eventual future of AI accelerators is AI CPUs — traditional CPUs with big caches and large systolic arrays. the caches and systolic arrays take up most of the chip area and power. everything else is overhead.

systolic arrays are the foundation

every AI chip is really just a way to access a systolic array. nvidia calls them Tensor Cores. google calls them MXUs. AMD calls them Matrix Cores. intel calls them AMX. amazon calls them NeuronCores. whatever they call them, it’s all systolic arrays. they were invented in 1978.

key insight: larger systolic arrays are more efficient. doubling the vector width from N to 2N:

4x more math per cycle (area grows as N²)
but only ~2x more surrounding logic needed
going from 4x4 (volta) to 128x128 (TPU): 1024x reduction in scalar compute overhead

this is why google’s small TPU team (~30 people) could compete with nvidia: 128x128 systolic arrays are fundamentally more efficient than nvidia’s 4x4 (volta) through 64x64 (blackwell).

mono-sized systolic arrays are unbalanced

transformers have two kinds of heavy computation:

FF layers: K dimension is large (8,192+). large systolic arrays excel here.
attention: K dimension is small (16-128). large systolic arrays get low utilization.

a 256x256 TPU MXU is very efficient for FF but wastes cycles on attention. a 64x64 GPU tensor core has better potential attention utilization but is inherently less efficient for FF.

roune’s solution: one FF core with a huge systolic array + several smaller attention cores on the same chip. both always doing what they’re best at.

the KV cache formula

the amount of HBM needed for KV cache:

2 * batch * layers * seq_length * attention_head_groups * vector_width * element_size * idle_magnification * pipeline_factor

with realistic numbers (B=64, L=32, N=1M, G=8, W=128, S=1 byte): 4,194 GB. this is why AI chips keep getting more expensive HBM.

decode vs prefill

prefill (reading input): compute-bound. systolic arrays are the bottleneck. high utilization. “nothing speculative about it.”
decode (generating output): memory-bandwidth-bound. poor systolic array utilization. the troublesome one.

naive decode utilization on a 256x256 systolic array: less than 1%. but with GQA + speculative decoding: up to 64x improvement.

managed aggregation

mix prefill and decode on the same chip in a managed ratio. prefill is compute-bound (waiting on systolic arrays), decode is memory-bound (waiting on bandwidth). combine them → both resources fully utilized.

the “3.3% of transistors” claim

etched and others claim only 3.3% of H100 transistors are for matrix multiplication. the math is correct but misleading. an ASIC still needs L2 cache (~17B transistors), L1/shared memory (~12B), register files, HBM controllers. the stuff you can actually eliminate (warp schedulers, instruction decode, RT cores) is maybe 10-15% of die. not 96.7%.

structured sparsity and compression

roune proposes Int7+1 format with 1:2 structured sparsity — the 8th bit indicates which of two adjacent entries is non-zero. this could be the end-state numerics format for inference. combined with huffman encoding for lossless compression on top of lossy quantization.

the TPUv3 turbo mode story

during TPUv3 development, systolic array heat production was underestimated. when chips arrived at google HQ, they ran too hot. roune (as SW lead) wrote compiler-based static analysis that prevented heat-generating instruction patterns, enabling a 25% clock rate increase. he wanted to call it “Turbo Mode” but his manager’s manager informed him he sucked at marketing.

source: “Designing AI Chip Software and Hardware” by Bjarke Hammersholt Roune (2026). freely available. if you work on AI chips, read it.

Alan's PKB

Explorer

Roune: Designing AI Chip Hardware and Software