InferBench: Inference ASIC Benchmark Suite — Deep Research
Why Every Existing Benchmark Is Wrong
| Benchmark | Problem |
|---|---|
| MLPerf | Favors GPUs (software maturity dominates), no cost/power normalization, burdensome submission |
| NVIDIA benchmarks | Vendor-controlled, TDP-based power (inflates efficiency), non-reproducible TensorRT configs |
| Cerebras | No power reporting (WSE-3 draws ~15kW), cost comparison nonsensical (30K H100) |
| Groq | tokens/sec/dollar uses API pricing (includes margin), LPU is fixed-function |
| SemiAnalysis InferenceMAX | Proposal only, no tool, no simulator, relies on vendor numbers |
The gap: No benchmark separates algorithmic efficiency from hardware capability from software maturity.
Workload Characterization (Real Numbers)
LLM Prefill — Llama 3 70B, S=2048
Per token per layer: Q proj 134M + K/V proj 33.6M + O proj 134M + FFN (gate+up+down) 1.41B + attention 32,768xS
Full prefill S=2048: 2048 x 142B = 291 TFLOP
- Weights: 140 GB (loaded once if weight-stationary)
- Arithmetic intensity: ~1,300 FLOP/byte — compute-bound
LLM Decode — Llama 3 70B
| Batch Size | FLOPs/step | AI (FLOP/byte) | Bound on A100 |
|---|---|---|---|
| 1 | 137 GFLOP | 0.97 | Memory BW |
| 8 | 1.1 TFLOP | 7.8 | Memory BW |
| 32 | 4.4 TFLOP | 31 | Memory BW |
| 128 | 17.5 TFLOP | 124 | Memory BW |
| 256 | 35 TFLOP | 246 | Balanced |
At B=1: A100 utilization = 0.6%. Loads 140 GB weights, performs 137 GFLOPs. Batch size is everything.
KV cache per sequence at S=8192: 2.62 GB. Batch 256 at S=8192: 672 GB — exceeds H100 80GB.
Diffusion — SD3 Medium (2B params)
- 24 MMDiT blocks, 4429 tokens (4096 image + 333 text)
- Per denoising step: 8.9 TFLOPs
- 50 steps: 445 TFLOPs
- Flux.1 (12B): 28 steps = 1,456 TFLOPs
- Weights reused 28-50x — more compute-bound than LLM decode
MoE — Mixtral 8x7B
- 32.2B FLOPs per token (only 12.9B active due to top-2 routing)
- At B=1 per expert: GEMV territory, AI ~0.34 FLOP/byte
- Even more batch-sensitive than dense models
Vision — ViT-L (307M)
- 122.7 GFLOPs per image (197 tokens)
- AI ~143 FLOP/byte — balanced at A100 ridge point
- Real-time at 30fps easily achievable on A100 (0.39ms)
Architecture Models
Systolic Array (TPU-like)
peak_ops = n_arrays x N^2 x 2 x f_clk
utilization(M) = min(1, M/N) -- M=1 gives 0.78% util on 128x128 array
Weight-stationary: load weights once, stream activations. 90%+ util for M >= 256.
SIMT (GPU-like)
Roofline model:
if AI > peak_ops/hbm_bw: t = FLOPs / (peak x util) -- compute bound
else: t = bytes / hbm_bw -- memory bound
Utilization lookup (empirical from cuBLAS): M<32: 15%, M=128-512: 70%, M>4096: 85%
Power: P = P_idle + alpha x GFLOPS + beta x GB/s (A100: 60W + 0.2 mW/GFLOP + 50 mW/(GB/s))
In-Memory Compute (d-Matrix-like)
ADC is the bottleneck. 256 ADCs at 1 GHz = 131 TOPS per tile. Precision-throughput tradeoff: n_passes = ceil(input_bits/dac_bits) x ceil(effective_bits/adc_bits) 10-50x energy efficiency over digital for INT4 weights. FP16 multi-pass negates benefit.
Dataflow (Cerebras-like)
When weights fit on-chip (44 GB SRAM on WSE-3): zero weight loading, decode becomes compute-bound. Llama 3 70B at FP16 (140 GB): does NOT fit. Llama 3 8B (16 GB): FITS. Power: ~15 kW for inference is extreme.
Reconfigurable (CGRA/Fractile-like)
Operator fusion advantage: GEMM+bias+ReLU+LayerNorm fused, intermediates stay on-chip. Reconfiguration overhead: ~10us per config, 80 layers x 3 groups = 2.4ms — significant for decode.
Validation Strategy
Compare InferBench simulator (a100_sxm.yaml + workload graphs) against Alan’s energy-study A100 measurements. Target: <15% MAPE for latency, <20% MAPE for energy.
Calibration knobs: utilization table entries, power model coefficients.
Publication Strategy
- Blog: “Why Every Inference ASIC Benchmark Is Wrong” — SemiAnalysis style
- Email Dylan Patel: “You proposed the right metrics at OCP. We built the tool.”
- Twitter thread with comparison table and roofline plots
- Workshop paper: MLSys 2027 or ISCA ML+Arch workshop
- Make competing chip companies contribute their own architecture specs — creates self-sustaining news cycle