--- license: apache-2.0 library_name: starvla tags: - robotics - vision-language-action - vla - libero - franka - manipulation - cosmos-predict2 base_model: nvidia/Cosmos-Predict2-2B-Video2World datasets: - openvla/modified_libero_rlds pipeline_tag: robotics --- # StarVLA-WM4A (LIBERO) **StarVLA-WM4A** is a Vision-Language-Action (VLA) policy built on top of the [StarVLA](https://github.com/starVLA/starVLA) framework. It couples the [Cosmos-Predict2](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) video world model as a frozen perception backbone with a lightweight flow-matching **action DiT** head (`CosmoPredict2GR00T` framework), and is fine-tuned on the full LIBERO manipulation suite. It is trained on the joint LIBERO-Spatial / LIBERO-Object / LIBERO-Goal / LIBERO-10 task mix. > ๐Ÿค Please refer to the official [StarVLA repository](https://github.com/starVLA/starVLA) > for installation, training recipes, and evaluation tooling. This repo only > hosts the model weights and the minimal configuration required to load them. --- ## โœจ Highlights | Property | Value | |---|---| | Framework | `CosmoPredict2GR00T` (StarVLA) | | Perception backbone | `nvidia/Cosmos-Predict2-2B-Video2World` (frozen VAE + T5) | | Action head | DiT-B, 16 layers, hidden=1024 | | Action dim / horizon | 7 / 8 (delta qpos + gripper) | | State dim | 7 | | Benchmark | LIBERO (4 task suites) | | Training precision | bf16 mixed precision | | LIBERO-Goal success rate | **92.0%** (184 / 200, see below) | --- ## ๐Ÿ“ฆ Files ``` StarVLA_WM4A/ โ”œโ”€โ”€ README.md # this file โ”œโ”€โ”€ config.yaml # minimal loadable config โ”œโ”€โ”€ dataset_statistics.json # action/state normalization stats โ””โ”€โ”€ starvla_wm4a_libero.pt # model weights (~14 GB) ``` --- ## ๐Ÿš€ Quick Start ### 1. Install StarVLA Follow the installation instructions in the [official repository](https://github.com/starVLA/starVLA): ```bash git clone https://github.com/starVLA/starVLA.git cd starVLA # create the conda env, install deps etc. โ€” see the upstream README ``` ### 2. Download the checkpoint ```bash # Option A โ€” huggingface-cli huggingface-cli download JackAILab/StarVLA_WM4A \ --local-dir ./pretrained/StarVLA_WM4A # Option B โ€” python from huggingface_hub import snapshot_download snapshot_download( repo_id="JackAILab/StarVLA_WM4A", local_dir="./pretrained/StarVLA_WM4A", ) ``` You also need the Cosmos-Predict2 backbone that this model is built on: ```bash huggingface-cli download nvidia/Cosmos-Predict2-2B-Video2World \ --local-dir ./pretrained/Cosmos-Predict2-2B-Video2World ``` ### 3. Run LIBERO evaluation From the `starVLA/` repo root: ```bash # start the policy server with this checkpoint CUDA_VISIBLE_DEVICES=0 python deployment/model_server/server_policy.py \ --ckpt_path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \ --port 6694 \ --use_bf16 # in a second shell (with the `libero` env activated): python examples/LIBERO/eval_files/eval_libero.py \ --args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \ --args.host 127.0.0.1 \ --args.port 6694 \ --args.task-suite-name libero_goal \ --args.num-trials-per-task 20 \ --args.video-out-path results/eval_libero_goal ``` ### 4. Load in Python ```python from starVLA.model.framework.base_framework import baseframework policy = baseframework.from_pretrained( "./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt", ) policy = policy.to("cuda").eval() # predict a 7-DoF action chunk from an observation dict # observation = {"image": [PIL.Image], "lang": "put the bowl on the plate", "state": np.ndarray[7]} action_chunk = policy.predict_action([observation]) # -> shape [1, 8, 7] ``` Before loading, make sure the backbone paths in `config.yaml` (`framework.world_model.base_wm`, `framework.qwenvl.base_vlm`) point to your local copy of `Cosmos-Predict2-2B-Video2World` (or leave them as the HF repo id if your StarVLA build resolves HF paths directly). --- ## ๐Ÿงช Model Configuration Key settings (see `config.yaml` for the full spec): ```yaml framework: name: CosmoPredict2GR00T world_model: base_wm: nvidia/Cosmos-Predict2-2B-Video2World action_model: action_model_type: DiT-B # 16-layer DiT hidden_size: 1024 action_dim: 7 # (dx, dy, dz, droll, dpitch, dyaw, gripper) state_dim: 7 future_action_window_size: 7 # predicts 8 actions per step action_horizon: 8 repeated_diffusion_steps: 8 num_inference_timesteps: 4 enable_video_loss: false trainer: max_train_steps: 80000 num_warmup_steps: 3000 learning_rate: base: 1.0e-05 # backbone LR (frozen text/vae modules) lr_scheduler_type: cosine_with_min_lr freeze_modules: backbone.text_encoder, backbone.vae ``` - **Frozen modules**: T5 text encoder and Cosmos VAE โ€” only the DiT transformer and action head receive gradients. - **Optimizer**: AdamW, `ฮฒ = (0.9, 0.95)`, weight decay `1e-8`, grad clip `1.0`. - **Schedule**: cosine-with-min-lr, 3 k warmup. - **Precision**: bf16 mixed precision with gradient checkpointing. `dataset_statistics.json` contains the per-dimension action/state mean/std/min/max computed on the LIBERO Franka mix. These are required at inference time for normalization (`unnorm_key=franka`). --- ## ๐Ÿ† LIBERO-Goal Results Evaluated with the standard StarVLA LIBERO pipeline โ€” **20 rollouts per task**, 10 tasks in the `libero_goal` suite (200 rollouts total). The policy server runs at `bf16`, 4 inference timesteps, action chunk of 8. **Overall success rate: 92.0% (184 / 200)** | Task | Success | Rate | |---|---|---| | `push_the_plate_to_the_front_of_the_stove` | 20 / 20 | **100.0%** | | `put_the_bowl_on_the_plate` | 20 / 20 | **100.0%** | | `put_the_wine_bottle_on_top_of_the_cabinet` | 20 / 20 | **100.0%** | | `turn_on_the_stove` | 20 / 20 | **100.0%** | | `open_the_middle_drawer_of_the_cabinet` | 19 / 20 | 95.0% | | `put_the_bowl_on_top_of_the_cabinet` | 19 / 20 | 95.0% | | `put_the_cream_cheese_in_the_bowl` | 18 / 20 | 90.0% | | `put_the_bowl_on_the_stove` | 17 / 20 | 85.0% | | `put_the_wine_bottle_on_the_rack` | 16 / 20 | 80.0% | | `open_the_top_drawer_and_put_the_bowl_inside` | 15 / 20 | 75.0% | Reproduce with: ```bash python examples/LIBERO/eval_files/eval_libero.py \ --args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \ --args.task-suite-name libero_goal \ --args.num-trials-per-task 20 ``` Evaluation on the other LIBERO suites (`libero_spatial`, `libero_object`, `libero_10`) is ongoing and will be appended here once the full sweep finishes. --- ## ๐Ÿ“Š Training Data Trained on the four LIBERO task suites in a balanced mixture, loaded through the StarVLA LeRobot data pipeline: - `libero_spatial_no_noops_1.0.0_lerobot` - `libero_object_no_noops_1.0.0_lerobot` - `libero_goal_no_noops_1.0.0_lerobot` - `libero_10_no_noops_1.0.0_lerobot` All four are derived from the original LIBERO benchmark (see [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO)) and wrapped into LeRobot format (see [openvla/modified_libero_rlds](https://huggingface.co/datasets/openvla/modified_libero_rlds) for the upstream RLDS version). Input: single RGB view at `224 ร— 224`, language instruction, 7-D robot state. Output: chunk of 8 future actions (`delta_qpos` + gripper). --- ## ๐Ÿ“œ License Released under the **Apache 2.0** license. This checkpoint is built on top of [nvidia/Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World) โ€” please also comply with the upstream Cosmos model license when using or redistributing these weights. --- ## ๐Ÿ“– Citation If you use this checkpoint, please cite the StarVLA project and the Cosmos-Predict2 world model: ```bibtex @misc{starvla2026, title = {StarVLA: A Unified Vision-Language-Action Framework}, author = {StarVLA Contributors}, year = {2026}, url = {https://github.com/starVLA/starVLA} } @misc{cosmospredict2, title = {Cosmos-Predict2: A Video World Model for Robotics and Simulation}, author = {NVIDIA}, year = {2025}, url = {https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World} } ``` --- ## ๐Ÿ”— Links - **Framework**: https://github.com/starVLA/starVLA - **Backbone**: https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World - **Benchmark**: https://github.com/Lifelong-Robot-Learning/LIBERO - **Issues / Questions**: please open an issue in the [StarVLA repo](https://github.com/starVLA/starVLA/issues) and tag it with `model/StarVLA_WM4A`.