Edge Inference for Robotics: Hardware Accelerator Analysis

The Gap Nobody Is Filling

Current edge accelerators waste 60-80% of peak performance on diffusion-based robotic control policies due to repeated DRAM weight fetches in the denoising loop. A weight-stationary optimization with 32-64 MB on-chip scratchpad can deliver 10-15x latency improvement.

Robotics ML Workload Profiles

ModelParamsFLOPs/stepMemoryLatency Target
ViT-B/14 (DINOv2)86M17.6 GF150 MB<10ms
ViT-L/14 (DINOv2)307M61.6 GF500 MB<15ms
Diffusion Policy (U-Net)25M5 GF x16 steps = 80 GF50 MB weights<100ms (10 Hz)
Diffusion Policy (Transformer)40M12 GF x16 steps = 192 GF80 MB weights<100ms
Octo-Base93M20 GF200 MB5-10 Hz
pi0 (Physical Intelligence)3B7 TF total6 GB FP163 Hz practical
OpenVLA7.5B~15 TF15 GB FP16Cloud only

Why Diffusion Policies Are Architecturally Unique

The denoising loop runs the SAME weights 10-100 times with different inputs:

  • Weights stay constant across steps (weight-stationary opportunity)
  • Activations change but have identical shapes (predictable allocation)
  • Loop count known at compile time (prefetch scheduling)

On GPU (Jetson Orin): weights evicted and reloaded from DRAM each step because L2 cache (4-6 MB) cannot hold 50-200 MB of weights. A weight-stationary architecture with 50+ MB SRAM eliminates this entirely.

Estimated speedup: 40M-param network (80 MB FP16) in on-chip SRAM. Each step: activation I/O 40us + compute 120us = 160us. 16 steps = 2.56ms vs 30-40ms on Orin GPU. 12-15x improvement.

VLA Pipeline Latency on Jetson Orin AGX

Camera capture + ISP:           5-10 ms
Vision encoder (ViT-B, FP16):   8-15 ms
LM forward (3B, INT4):          50-100 ms  <-- BOTTLENECK
Action decoder (10 steps):      20-40 ms
Post-processing + safety:       1-2 ms
Total:                          86-171 ms = 6-12 Hz

Need >20 Hz for manipulation. LM portion is the bottleneck.

Hardware Landscape

PlatformTOPS INT8FP16 TFPowerMemory BWWho Uses It
Jetson Orin NX100~1010-25W102 GB/sMost research labs
Jetson AGX Orin275~27.515-60W205 GB/sFigure, humanoid prototypes
Jetson Thor (2025)800~20030-75W~400 GB/sNext-gen (1X partnership)
Hailo-826N/A2.5W~32 GB/sIndustrial, emerging robotics
Tesla FSD HW3144N/A~36W~64 GB/sTesla only
Qualcomm RB515N/A5-10W44 GB/sDrones

What Is Missing

  1. No diffusion-optimized edge accelerator — weight reuse across denoising steps is ignored
  2. No unified VLA accelerator — vision (compute-bound), LM (bandwidth-bound), action head (iterative) all need different datapaths
  3. Sensor fusion on CPU — debayering, undistortion, temporal alignment could be hardened
  4. No dynamic compute allocation — robot workload varies 10x between manipulation and transit