Research Paper - SmolVLA

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Cheng-Hao Tu, Zicong Fan, Siyuan Geng, Chuer Pan, Oier Mees, Ridhi K. Jobanputra, Shuangfei Zhai, Ken Ooi, Yevgen Chebotar, Ted Xiao, Andy Zeng, Ting-Wei Lin, Brian Ichter, Sergey Levine, Fei Xia

Abstract

We introduce SmolVLA, a compact Vision-Language-Action (VLA) model trained on the web-scale Open-X-Embodiment (OXE) dataset. Despite its compact architecture (2.1B parameters), SmolVLA demonstrates exceptional performance on real-world robotics benchmarks, outperforming larger and more advanced models like RT-2-X (55B). Notably, SmolVLA achieves efficient inference speeds on consumer-grade hardware, reaching approximately 20-30Hz on an NVIDIA RTX 3090 GPU and 45-60Hz on a laptop-grade RTX 4070 GPU. This efficiency is achieved through the integration of pre-trained vision (SigLIP) and language (Gemma) models, combined with an innovative connector design. The release of SmolVLA aims to democratize robotics on cost-effective hardware, making it a viable option for research and application by the broader community.

Citation

If you use SmolVLA in your research, please consider citing our paper:

@misc{tu2024smolvla,
      title={SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics}, 
      author={Cheng-Hao Tu and Zicong Fan and Siyuan Geng and Chuer Pan and Oier Mees and Ridhi K. Jobanputra and Shuangfei Zhai and Ken Ooi and Yevgen Chebotar and Ted Xiao and Andy Zeng and Ting-Wei Lin and Brian Ichter and Sergey Levine and Fei Xia},
      year={2024},
      eprint={2405.19726},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}