InferBench: Inference ASIC Benchmark Suite — Deep Research

Why Every Existing Benchmark Is Wrong

BenchmarkProblem
MLPerfFavors GPUs (software maturity dominates), no cost/power normalization, burdensome submission
NVIDIA benchmarksVendor-controlled, TDP-based power (inflates efficiency), non-reproducible TensorRT configs
CerebrasNo power reporting (WSE-3 draws ~15kW), cost comparison nonsensical (30K H100)
Groqtokens/sec/dollar uses API pricing (includes margin), LPU is fixed-function
SemiAnalysis InferenceMAXProposal only, no tool, no simulator, relies on vendor numbers

The gap: No benchmark separates algorithmic efficiency from hardware capability from software maturity.

Workload Characterization (Real Numbers)

LLM Prefill — Llama 3 70B, S=2048

Per token per layer: Q proj 134M + K/V proj 33.6M + O proj 134M + FFN (gate+up+down) 1.41B + attention 32,768xS

Full prefill S=2048: 2048 x 142B = 291 TFLOP

  • Weights: 140 GB (loaded once if weight-stationary)
  • Arithmetic intensity: ~1,300 FLOP/byte — compute-bound

LLM Decode — Llama 3 70B

Batch SizeFLOPs/stepAI (FLOP/byte)Bound on A100
1137 GFLOP0.97Memory BW
81.1 TFLOP7.8Memory BW
324.4 TFLOP31Memory BW
12817.5 TFLOP124Memory BW
25635 TFLOP246Balanced

At B=1: A100 utilization = 0.6%. Loads 140 GB weights, performs 137 GFLOPs. Batch size is everything.

KV cache per sequence at S=8192: 2.62 GB. Batch 256 at S=8192: 672 GB — exceeds H100 80GB.

Diffusion — SD3 Medium (2B params)

  • 24 MMDiT blocks, 4429 tokens (4096 image + 333 text)
  • Per denoising step: 8.9 TFLOPs
  • 50 steps: 445 TFLOPs
  • Flux.1 (12B): 28 steps = 1,456 TFLOPs
  • Weights reused 28-50x — more compute-bound than LLM decode

MoE — Mixtral 8x7B

  • 32.2B FLOPs per token (only 12.9B active due to top-2 routing)
  • At B=1 per expert: GEMV territory, AI ~0.34 FLOP/byte
  • Even more batch-sensitive than dense models

Vision — ViT-L (307M)

  • 122.7 GFLOPs per image (197 tokens)
  • AI ~143 FLOP/byte — balanced at A100 ridge point
  • Real-time at 30fps easily achievable on A100 (0.39ms)

Architecture Models

Systolic Array (TPU-like)

peak_ops = n_arrays x N^2 x 2 x f_clk
utilization(M) = min(1, M/N)  -- M=1 gives 0.78% util on 128x128 array

Weight-stationary: load weights once, stream activations. 90%+ util for M >= 256.

SIMT (GPU-like)

Roofline model:

if AI > peak_ops/hbm_bw: t = FLOPs / (peak x util)  -- compute bound
else: t = bytes / hbm_bw                              -- memory bound

Utilization lookup (empirical from cuBLAS): M<32: 15%, M=128-512: 70%, M>4096: 85%

Power: P = P_idle + alpha x GFLOPS + beta x GB/s (A100: 60W + 0.2 mW/GFLOP + 50 mW/(GB/s))

In-Memory Compute (d-Matrix-like)

ADC is the bottleneck. 256 ADCs at 1 GHz = 131 TOPS per tile. Precision-throughput tradeoff: n_passes = ceil(input_bits/dac_bits) x ceil(effective_bits/adc_bits) 10-50x energy efficiency over digital for INT4 weights. FP16 multi-pass negates benefit.

Dataflow (Cerebras-like)

When weights fit on-chip (44 GB SRAM on WSE-3): zero weight loading, decode becomes compute-bound. Llama 3 70B at FP16 (140 GB): does NOT fit. Llama 3 8B (16 GB): FITS. Power: ~15 kW for inference is extreme.

Reconfigurable (CGRA/Fractile-like)

Operator fusion advantage: GEMM+bias+ReLU+LayerNorm fused, intermediates stay on-chip. Reconfiguration overhead: ~10us per config, 80 layers x 3 groups = 2.4ms — significant for decode.

Validation Strategy

Compare InferBench simulator (a100_sxm.yaml + workload graphs) against Alan’s energy-study A100 measurements. Target: <15% MAPE for latency, <20% MAPE for energy.

Calibration knobs: utilization table entries, power model coefficients.

Publication Strategy

  1. Blog: “Why Every Inference ASIC Benchmark Is Wrong” — SemiAnalysis style
  2. Email Dylan Patel: “You proposed the right metrics at OCP. We built the tool.”
  3. Twitter thread with comparison table and roofline plots
  4. Workshop paper: MLSys 2027 or ISCA ML+Arch workshop
  5. Make competing chip companies contribute their own architecture specs — creates self-sustaining news cycle