SmolVLA
A compact, efficient, and powerful vision-language-action model for modern robotics.
🚀 Meet SmolVLA
SmolVLA is a groundbreaking Vision-Language-Action (VLA) model with only 450 million parameters, developed by Hugging Face. It's carefully designed for cost-effective deployment on consumer-grade hardware, making advanced robotics technology accessible to more developers and enthusiasts.
Trained on the open community LeRobot dataset, SmolVLA truly embodies the power of open-source collaboration, with performance that matches or exceeds larger proprietary models.
📌 Core Features
Compact and Efficient Architecture
Combines a streamlined SmolVLM-2 vision-language model with a Flow-Matching Transformer action expert for unparalleled efficiency.
Asynchronous Inference
Achieves real-time response by decoupling action prediction from execution, reducing task completion time by approximately 30% on average.
Open and Community-Driven
Fully trained on publicly available LeRobot community datasets on Hugging Face and released as open source, encouraging widespread use and research.
Exceptional Performance
Excels in simulation environments like LIBERO and Meta-World, achieving approximately 78.3% average success rate in real-world tasks.
🎬 Demo Videos
Watch SmolVLA in action as it performs various tasks that showcase its capabilities in real-world environments.
SmolVLA Overview
Community DIY Robot Video
Official Research Paper
For an in-depth look at the technical details, read the official research paper. It provides comprehensive information on model architecture, training methodology, and performance benchmarks.
🛠️ Quick Start Guide
📋 Model Overview
SmolVLA is a 450M parameter vision-language-action model designed for affordable and efficient robotics. It's optimized to run on consumer hardware while maintaining competitive performance.
1. Environment Setup
Before proceeding, you need to properly install the environment by following the Installation Guide on the docs.
git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[smolvla]"
2. Load Pre-trained Model
The fastest way to experience SmolVLA is to directly load the pre-trained model from Hugging Face.
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")
3. Fine-tuning the Pre-trained Model
Fine-tune SmolVLA on your specific dataset for better task performance. This example uses the SO101 pick-place dataset.
python lerobot/scripts/train.py \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=lerobot/svla_so101_pickplace \
--batch_size=64 \
--steps=20000 \
--output_dir=outputs/train/my_smolvla \
--job_name=my_smolvla_training \
--policy.device=cuda \
--wandb.enable=true
4. Training from Scratch
Train SmolVLA neural network with pretrained VLM and action expert initialized from scratch.
python lerobot/scripts/train.py \
--dataset.repo_id=lerobot/svla_so101_pickplace \
--batch_size=64 \
--steps=200000 \
--output_dir=outputs/train/my_smolvla \
--job_name=my_smolvla_training \
--policy.device=cuda \
--wandb.enable=true
🗂️ Dataset Information
SmolVLA is trained on community-contributed datasets. The svla_so101_pickplace
dataset contains:
- Episodes: 50 total episodes
- Frames: 11,939 total frames
- Robot Type: SO100 follower arm
- FPS: 30 frames per second
- Video Resolution: 480×640 (up and side cameras)
- Action Space: 6-DOF (shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper)
📊 Additional Resources
🤝 Community Fine-tuned Models
Discover amazing applications built by the community! These 29 fine-tuned models showcase the versatility of SmolVLA across various robotic tasks.
🧄 Garlic Handling
hannanasko/smolvla_so_100_garlic
🚗 Monster Truck Pick-up
GhostScientist/smolvla_please_pick_up_the_monster_truck
📝 Whiteboard & Bike Light
danielkorth/smolvla-whiteboard-and-bike-light
🍭 Candy Picking
kaku/smolvla-candy-pick
🍓 Strawberry Detection
Corematic-Europe/Strawberries_model
🐼 Panda Robot Reach
Abderlrahman/smolvla-panda-mujoco-reach-cube
📊 Training Dataset Explorer
Explore the svla_so101_pickplace
dataset used to train SmolVLA. This interactive viewer shows real robotic demonstrations.
💡 Tip: Use the dataset viewer to understand the action sequences, camera angles, and robot states that SmolVLA learns from.
🌐 Resources & Community
Join us in advancing open, affordable, and efficient robotics. We welcome contributions of data, code improvements, or sharing your projects.