Edge Inference for Robotics: Hardware Accelerator Analysis
The Gap Nobody Is Filling
Current edge accelerators waste 60-80% of peak performance on diffusion-based robotic control policies due to repeated DRAM weight fetches in the denoising loop. A weight-stationary optimization with 32-64 MB on-chip scratchpad can deliver 10-15x latency improvement.
Robotics ML Workload Profiles
| Model | Params | FLOPs/step | Memory | Latency Target |
|---|---|---|---|---|
| ViT-B/14 (DINOv2) | 86M | 17.6 GF | 150 MB | <10ms |
| ViT-L/14 (DINOv2) | 307M | 61.6 GF | 500 MB | <15ms |
| Diffusion Policy (U-Net) | 25M | 5 GF x16 steps = 80 GF | 50 MB weights | <100ms (10 Hz) |
| Diffusion Policy (Transformer) | 40M | 12 GF x16 steps = 192 GF | 80 MB weights | <100ms |
| Octo-Base | 93M | 20 GF | 200 MB | 5-10 Hz |
| pi0 (Physical Intelligence) | 3B | 7 TF total | 6 GB FP16 | 3 Hz practical |
| OpenVLA | 7.5B | ~15 TF | 15 GB FP16 | Cloud only |
Why Diffusion Policies Are Architecturally Unique
The denoising loop runs the SAME weights 10-100 times with different inputs:
- Weights stay constant across steps (weight-stationary opportunity)
- Activations change but have identical shapes (predictable allocation)
- Loop count known at compile time (prefetch scheduling)
On GPU (Jetson Orin): weights evicted and reloaded from DRAM each step because L2 cache (4-6 MB) cannot hold 50-200 MB of weights. A weight-stationary architecture with 50+ MB SRAM eliminates this entirely.
Estimated speedup: 40M-param network (80 MB FP16) in on-chip SRAM. Each step: activation I/O 40us + compute 120us = 160us. 16 steps = 2.56ms vs 30-40ms on Orin GPU. 12-15x improvement.
VLA Pipeline Latency on Jetson Orin AGX
Camera capture + ISP: 5-10 ms
Vision encoder (ViT-B, FP16): 8-15 ms
LM forward (3B, INT4): 50-100 ms <-- BOTTLENECK
Action decoder (10 steps): 20-40 ms
Post-processing + safety: 1-2 ms
Total: 86-171 ms = 6-12 Hz
Need >20 Hz for manipulation. LM portion is the bottleneck.
Hardware Landscape
| Platform | TOPS INT8 | FP16 TF | Power | Memory BW | Who Uses It |
|---|---|---|---|---|---|
| Jetson Orin NX | 100 | ~10 | 10-25W | 102 GB/s | Most research labs |
| Jetson AGX Orin | 275 | ~27.5 | 15-60W | 205 GB/s | Figure, humanoid prototypes |
| Jetson Thor (2025) | 800 | ~200 | 30-75W | ~400 GB/s | Next-gen (1X partnership) |
| Hailo-8 | 26 | N/A | 2.5W | ~32 GB/s | Industrial, emerging robotics |
| Tesla FSD HW3 | 144 | N/A | ~36W | ~64 GB/s | Tesla only |
| Qualcomm RB5 | 15 | N/A | 5-10W | 44 GB/s | Drones |
What Is Missing
- No diffusion-optimized edge accelerator — weight reuse across denoising steps is ignored
- No unified VLA accelerator — vision (compute-bound), LM (bandwidth-bound), action head (iterative) all need different datapaths
- Sensor fusion on CPU — debayering, undistortion, temporal alignment could be hardened
- No dynamic compute allocation — robot workload varies 10x between manipulation and transit