English

SmolVLA

A Compact, Efficient, and Powerful Vision-Language-Action Model for Modern Robotics.

What is SmolVLA?

SmolVLA is a state-of-the-art, open-source Vision-Language-Action (VLA) model developed by Hugging Face, in collaboration with Google DeepMind, and based on the SigLIP and PaLM-2 models. It's designed to be exceptionally compact and efficient, making advanced robotics automation accessible on consumer-grade hardware. With just 495 million parameters, SmolVLA significantly reduces computational demands without compromising performance, democratizing robotics development for everyone.

Key Features

Compact and Efficient

With only 495M parameters, SmolVLA runs smoothly on standard consumer hardware, lowering the barrier to entry for robotics research and development.

High Performance

Despite its small size, SmolVLA delivers state-of-the-art performance in various robotic manipulation tasks, from simple pick-and-place to complex sequential operations.

Open Source

The model, along with its training and evaluation code, is fully open-source, promoting collaboration and innovation within the robotics community.

Watch it in Action

Watch SmolVLA in action, performing a variety of tasks that showcase its capabilities in a real-world environment.

SmolVLA Overview

Community DIY Robot Video

Official Research Paper

For a deep dive into the technical details, read the official research paper. It provides comprehensive information on the model architecture, training methodology, and performance benchmarks.

Get Started