SmolVLA
A Compact, Efficient, and Powerful Vision-Language-Action Model for Modern Robotics.
What is SmolVLA?
SmolVLA is a state-of-the-art, open-source Vision-Language-Action (VLA) model developed by Hugging Face, in collaboration with Google DeepMind, and based on the SigLIP and PaLM-2 models. It's designed to be exceptionally compact and efficient, making advanced robotics automation accessible on consumer-grade hardware. With just 495 million parameters, SmolVLA significantly reduces computational demands without compromising performance, democratizing robotics development for everyone.
Key Features
Compact and Efficient
With only 495M parameters, SmolVLA runs smoothly on standard consumer hardware, lowering the barrier to entry for robotics research and development.
High Performance
Despite its small size, SmolVLA delivers state-of-the-art performance in various robotic manipulation tasks, from simple pick-and-place to complex sequential operations.
Open Source
The model, along with its training and evaluation code, is fully open-source, promoting collaboration and innovation within the robotics community.
Watch it in Action
Watch SmolVLA in action, performing a variety of tasks that showcase its capabilities in a real-world environment.
SmolVLA Overview
Community DIY Robot Video
Official Research Paper
For a deep dive into the technical details, read the official research paper. It provides comprehensive information on the model architecture, training methodology, and performance benchmarks.