Edge Inference for Robotics: Hardware Accelerator Analysis

The Gap Nobody Is Filling

Current edge accelerators waste 60-80% of peak performance on diffusion-based robotic control policies due to repeated DRAM weight fetches in the denoising loop. A weight-stationary optimization with 32-64 MB on-chip scratchpad can deliver 10-15x latency improvement.

Robotics ML Workload Profiles

Model	Params	FLOPs/step	Memory	Latency Target
ViT-B/14 (DINOv2)	86M	17.6 GF	150 MB	<10ms
ViT-L/14 (DINOv2)	307M	61.6 GF	500 MB	<15ms
Diffusion Policy (U-Net)	25M	5 GF x16 steps = 80 GF	50 MB weights	<100ms (10 Hz)
Diffusion Policy (Transformer)	40M	12 GF x16 steps = 192 GF	80 MB weights	<100ms
Octo-Base	93M	20 GF	200 MB	5-10 Hz
pi0 (Physical Intelligence)	3B	7 TF total	6 GB FP16	3 Hz practical
OpenVLA	7.5B	~15 TF	15 GB FP16	Cloud only

Why Diffusion Policies Are Architecturally Unique

The denoising loop runs the SAME weights 10-100 times with different inputs:

Weights stay constant across steps (weight-stationary opportunity)
Activations change but have identical shapes (predictable allocation)
Loop count known at compile time (prefetch scheduling)

On GPU (Jetson Orin): weights evicted and reloaded from DRAM each step because L2 cache (4-6 MB) cannot hold 50-200 MB of weights. A weight-stationary architecture with 50+ MB SRAM eliminates this entirely.

Estimated speedup: 40M-param network (80 MB FP16) in on-chip SRAM. Each step: activation I/O 40us + compute 120us = 160us. 16 steps = 2.56ms vs 30-40ms on Orin GPU. 12-15x improvement.

VLA Pipeline Latency on Jetson Orin AGX

Camera capture + ISP:           5-10 ms
Vision encoder (ViT-B, FP16):   8-15 ms
LM forward (3B, INT4):          50-100 ms  <-- BOTTLENECK
Action decoder (10 steps):      20-40 ms
Post-processing + safety:       1-2 ms
Total:                          86-171 ms = 6-12 Hz

Need >20 Hz for manipulation. LM portion is the bottleneck.

Hardware Landscape

Platform	TOPS INT8	FP16 TF	Power	Memory BW	Who Uses It
Jetson Orin NX	100	~10	10-25W	102 GB/s	Most research labs
Jetson AGX Orin	275	~27.5	15-60W	205 GB/s	Figure, humanoid prototypes
Jetson Thor (2025)	800	~200	30-75W	~400 GB/s	Next-gen (1X partnership)
Hailo-8	26	N/A	2.5W	~32 GB/s	Industrial, emerging robotics
Tesla FSD HW3	144	N/A	~36W	~64 GB/s	Tesla only
Qualcomm RB5	15	N/A	5-10W	44 GB/s	Drones

What Is Missing

No diffusion-optimized edge accelerator — weight reuse across denoising steps is ignored
No unified VLA accelerator — vision (compute-bound), LM (bandwidth-bound), action head (iterative) all need different datapaths
Sensor fusion on CPU — debayering, undistortion, temporal alignment could be hardened
No dynamic compute allocation — robot workload varies 10x between manipulation and transit

Alan's PKB

Explorer

Robotics Edge Accelerators