English

SmolVLA

A compact, efficient, and powerful vision-language-action model for modern robotics.

🚀 Meet SmolVLA

SmolVLA is a groundbreaking Vision-Language-Action (VLA) model with only 450 million parameters, developed by Hugging Face. It's carefully designed for cost-effective deployment on consumer-grade hardware, making advanced robotics technology accessible to more developers and enthusiasts.

Trained on the open community LeRobot dataset, SmolVLA truly embodies the power of open-source collaboration, with performance that matches or exceeds larger proprietary models.

📌 Core Features

Compact and Efficient Architecture

Combines a streamlined SmolVLM-2 vision-language model with a Flow-Matching Transformer action expert for unparalleled efficiency.

Asynchronous Inference

Achieves real-time response by decoupling action prediction from execution, reducing task completion time by approximately 30% on average.

Open and Community-Driven

Fully trained on publicly available LeRobot community datasets on Hugging Face and released as open source, encouraging widespread use and research.

Exceptional Performance

Excels in simulation environments like LIBERO and Meta-World, achieving approximately 78.3% average success rate in real-world tasks.

🎬 Demo Videos

Watch SmolVLA in action as it performs various tasks that showcase its capabilities in real-world environments.

SmolVLA Overview

Community DIY Robot Video

Official Research Paper

For an in-depth look at the technical details, read the official research paper. It provides comprehensive information on model architecture, training methodology, and performance benchmarks.

🛠️ Quick Start Guide

📋 Model Overview

SmolVLA is a 450M parameter vision-language-action model designed for affordable and efficient robotics. It's optimized to run on consumer hardware while maintaining competitive performance.

Parameters: 450M
Downloads: 15,383+ last month
Fine-tuned models: 29 models

1. Environment Setup

Before proceeding, you need to properly install the environment by following the Installation Guide on the docs.

git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[smolvla]"

2. Load Pre-trained Model

The fastest way to experience SmolVLA is to directly load the pre-trained model from Hugging Face.

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")

3. Fine-tuning the Pre-trained Model

Fine-tune SmolVLA on your specific dataset for better task performance. This example uses the SO101 pick-place dataset.

python lerobot/scripts/train.py \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so101_pickplace \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/my_smolvla \
  --job_name=my_smolvla_training \
  --policy.device=cuda \
  --wandb.enable=true

4. Training from Scratch

Train SmolVLA neural network with pretrained VLM and action expert initialized from scratch.

python lerobot/scripts/train.py \
  --dataset.repo_id=lerobot/svla_so101_pickplace \
  --batch_size=64 \
  --steps=200000 \
  --output_dir=outputs/train/my_smolvla \
  --job_name=my_smolvla_training \
  --policy.device=cuda \
  --wandb.enable=true

🗂️ Dataset Information

SmolVLA is trained on community-contributed datasets. The svla_so101_pickplace dataset contains:

📊 Additional Resources

🤝 Community Fine-tuned Models

Discover amazing applications built by the community! These 29 fine-tuned models showcase the versatility of SmolVLA across various robotic tasks.

📊 Training Dataset Explorer

Explore the svla_so101_pickplace dataset used to train SmolVLA. This interactive viewer shows real robotic demonstrations.

💡 Tip: Use the dataset viewer to understand the action sequences, camera angles, and robot states that SmolVLA learns from.

🌐 Resources & Community

Join us in advancing open, affordable, and efficient robotics. We welcome contributions of data, code improvements, or sharing your projects.