SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Abstract
We introduce SmolVLA, a compact Vision-Language-Action (VLA) model trained on the web-scale Open-X-Embodiment (OXE) dataset. Despite its compact architecture (2.1B parameters), SmolVLA demonstrates exceptional performance on real-world robotics benchmarks, outperforming larger and more advanced models like RT-2-X (55B). Notably, SmolVLA achieves efficient inference speeds on consumer-grade hardware, reaching approximately 20-30Hz on an NVIDIA RTX 3090 GPU and 45-60Hz on a laptop-grade RTX 4070 GPU. This efficiency is achieved through the integration of pre-trained vision (SigLIP) and language (Gemma) models, combined with an innovative connector design. The release of SmolVLA aims to democratize robotics on cost-effective hardware, making it a viable option for research and application by the broader community.
Citation
If you use SmolVLA in your research, please consider citing our paper:
@misc{tu2024smolvla,
title={SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics},
author={Cheng-Hao Tu and Zicong Fan and Siyuan Geng and Chuer Pan and Oier Mees and Ridhi K. Jobanputra and Shuangfei Zhai and Ken Ooi and Yevgen Chebotar and Ted Xiao and Andy Zeng and Ting-Wei Lin and Brian Ichter and Sergey Levine and Fei Xia},
year={2024},
eprint={2405.19726},
archivePrefix={arXiv},
primaryClass={cs.RO}
}