Motivation. Visual token redundancy varies significantly across robot manipulation stages. VLA-ADP exploits end-effector motion as a dynamic gating signal to identify and prune redundant tokens at each timestep, reducing computation without sacrificing task success.
Real-world ALOHA demonstrations — VLA-ADP applied to OpenVLA-OFT (1.5× speed)
We propose Action-aware Dynamic Pruning (ADP), a training-free, plug-and-play method that adaptively prunes redundant visual tokens across manipulation stages by combining text-driven token relevance with an action-aware gating signal derived from end-effector motion.
ADP Architecture. ADP maintains an observation window of past states and uses end-effector velocity/acceleration to produce a dynamic gating decision. The gate selects between sparse and dense token retention ratios, and text-driven cross-attention scores rank tokens by relevance before pruning.
Token Pruning. Spatially redundant background tokens (low attention score) are removed while task-relevant tokens are preserved, maintaining action prediction fidelity.
Comparison against OpenVLA, SparseVLM, FastVLM, and other VLA methods across four LIBERO task suites (Spatial, Object, Goal, Long). VLA-ADP achieves 94.4–99.0% SR with 1.13–1.35× LLM speedup.
LIBERO benchmark task suites used for simulation evaluation: Spatial, Object, Goal, and Long.
VLA-ADP improves SR from 85.8% to 88.3% while reducing latency by 33% (76.9 → 51.8 ms), achieving a 1.49× speedup on real hardware.
Real-world experimental setup: bimanual ALOHA robot performing tabletop manipulation tasks.
@article{pei2025action,
title={Action-aware dynamic pruning for efficient
vision-language-action manipulation},
author={Pei, Xiaohuan and Chen, Yuxing and Xu, Siyu
and Wang, Yunke and Shi, Yuheng and Xu, Chang},
journal={arXiv preprint arXiv:2509.22093},
year={2025}
}
We thank the authors of OpenVLA-OFT, OpenVLA, and Hugging Face Transformers for making their code publicly available. This project page was inspired by the Nerfies template.