---
license: apache-2.0
library_name: starvla
tags:
  - robotics
  - vision-language-action
  - vla
  - libero
  - franka
  - manipulation
  - cosmos-predict2
base_model: nvidia/Cosmos-Predict2-2B-Video2World
datasets:
  - openvla/modified_libero_rlds
pipeline_tag: robotics
---

# StarVLA-WM4A (LIBERO)

**StarVLA-WM4A** is a Vision-Language-Action (VLA) policy built on top of the
[StarVLA](https://github.com/starVLA/starVLA) framework. It couples the
[Cosmos-Predict2](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
video world model as a frozen perception backbone with a lightweight
flow-matching **action DiT** head (`CosmoPredict2GR00T` framework), and is
fine-tuned on the full LIBERO manipulation suite.

It is trained on the joint LIBERO-Spatial / LIBERO-Object / LIBERO-Goal /
LIBERO-10 task mix.

> 🤝 Please refer to the official [StarVLA repository](https://github.com/starVLA/starVLA)
> for installation, training recipes, and evaluation tooling. This repo only
> hosts the model weights and the minimal configuration required to load them.

---

## ✨ Highlights

| Property | Value |
|---|---|
| Framework | `CosmoPredict2GR00T` (StarVLA) |
| Perception backbone | `nvidia/Cosmos-Predict2-2B-Video2World` (frozen VAE + T5) |
| Action head | DiT-B, 16 layers, hidden=1024 |
| Action dim / horizon | 7 / 8 (delta qpos + gripper) |
| State dim | 7 |
| Benchmark | LIBERO (4 task suites) |
| Training precision | bf16 mixed precision |
| LIBERO-Goal success rate | **92.0%** (184 / 200, see below) |

---

## 📦 Files

```
StarVLA_WM4A/
├── README.md                  # this file
├── config.yaml                # minimal loadable config
├── dataset_statistics.json    # action/state normalization stats
└── starvla_wm4a_libero.pt     # model weights (~14 GB)
```

---

## 🚀 Quick Start

### 1. Install StarVLA

Follow the installation instructions in the
[official repository](https://github.com/starVLA/starVLA):

```bash
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# create the conda env, install deps etc. — see the upstream README
```

### 2. Download the checkpoint

```bash
# Option A — huggingface-cli
huggingface-cli download JackAILab/StarVLA_WM4A \
    --local-dir ./pretrained/StarVLA_WM4A

# Option B — python
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="JackAILab/StarVLA_WM4A",
    local_dir="./pretrained/StarVLA_WM4A",
)
```

You also need the Cosmos-Predict2 backbone that this model is built on:

```bash
huggingface-cli download nvidia/Cosmos-Predict2-2B-Video2World \
    --local-dir ./pretrained/Cosmos-Predict2-2B-Video2World
```

### 3. Run LIBERO evaluation

From the `starVLA/` repo root:

```bash
# start the policy server with this checkpoint
CUDA_VISIBLE_DEVICES=0 python deployment/model_server/server_policy.py \
    --ckpt_path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
    --port 6694 \
    --use_bf16

# in a second shell (with the `libero` env activated):
python examples/LIBERO/eval_files/eval_libero.py \
    --args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
    --args.host 127.0.0.1 \
    --args.port 6694 \
    --args.task-suite-name libero_goal \
    --args.num-trials-per-task 20 \
    --args.video-out-path results/eval_libero_goal
```

### 4. Load in Python

```python
from starVLA.model.framework.base_framework import baseframework

policy = baseframework.from_pretrained(
    "./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt",
)
policy = policy.to("cuda").eval()

# predict a 7-DoF action chunk from an observation dict
# observation = {"image": [PIL.Image], "lang": "put the bowl on the plate", "state": np.ndarray[7]}
action_chunk = policy.predict_action([observation])  # -> shape [1, 8, 7]
```

Before loading, make sure the backbone paths in `config.yaml`
(`framework.world_model.base_wm`, `framework.qwenvl.base_vlm`) point to your
local copy of `Cosmos-Predict2-2B-Video2World` (or leave them as the HF repo id
if your StarVLA build resolves HF paths directly).

---

## 🧪 Model Configuration

Key settings (see `config.yaml` for the full spec):

```yaml
framework:
  name: CosmoPredict2GR00T
  world_model:
    base_wm: nvidia/Cosmos-Predict2-2B-Video2World
  action_model:
    action_model_type: DiT-B      # 16-layer DiT
    hidden_size: 1024
    action_dim: 7                 # (dx, dy, dz, droll, dpitch, dyaw, gripper)
    state_dim: 7
    future_action_window_size: 7  # predicts 8 actions per step
    action_horizon: 8
    repeated_diffusion_steps: 8
    num_inference_timesteps: 4
  enable_video_loss: false

trainer:
  max_train_steps: 80000
  num_warmup_steps: 3000
  learning_rate:
    base: 1.0e-05                 # backbone LR (frozen text/vae modules)
  lr_scheduler_type: cosine_with_min_lr
  freeze_modules: backbone.text_encoder, backbone.vae
```

- **Frozen modules**: T5 text encoder and Cosmos VAE — only the DiT transformer
  and action head receive gradients.
- **Optimizer**: AdamW, `β = (0.9, 0.95)`, weight decay `1e-8`, grad clip `1.0`.
- **Schedule**: cosine-with-min-lr, 3 k warmup.
- **Precision**: bf16 mixed precision with gradient checkpointing.

`dataset_statistics.json` contains the per-dimension action/state mean/std/min/max
computed on the LIBERO Franka mix. These are required at inference time for
normalization (`unnorm_key=franka`).

---

## 🏆 LIBERO-Goal Results

Evaluated with the standard StarVLA LIBERO pipeline — **20 rollouts per task**,
10 tasks in the `libero_goal` suite (200 rollouts total). The policy server
runs at `bf16`, 4 inference timesteps, action chunk of 8.

**Overall success rate: 92.0% (184 / 200)**

| Task | Success | Rate |
|---|---|---|
| `push_the_plate_to_the_front_of_the_stove` | 20 / 20 | **100.0%** |
| `put_the_bowl_on_the_plate` | 20 / 20 | **100.0%** |
| `put_the_wine_bottle_on_top_of_the_cabinet` | 20 / 20 | **100.0%** |
| `turn_on_the_stove` | 20 / 20 | **100.0%** |
| `open_the_middle_drawer_of_the_cabinet` | 19 / 20 | 95.0% |
| `put_the_bowl_on_top_of_the_cabinet` | 19 / 20 | 95.0% |
| `put_the_cream_cheese_in_the_bowl` | 18 / 20 | 90.0% |
| `put_the_bowl_on_the_stove` | 17 / 20 | 85.0% |
| `put_the_wine_bottle_on_the_rack` | 16 / 20 | 80.0% |
| `open_the_top_drawer_and_put_the_bowl_inside` | 15 / 20 | 75.0% |

Reproduce with:

```bash
python examples/LIBERO/eval_files/eval_libero.py \
    --args.pretrained-path ./pretrained/StarVLA_WM4A/starvla_wm4a_libero.pt \
    --args.task-suite-name libero_goal \
    --args.num-trials-per-task 20
```

Evaluation on the other LIBERO suites (`libero_spatial`, `libero_object`,
`libero_10`) is ongoing and will be appended here once the full sweep finishes.

---

## 📊 Training Data

Trained on the four LIBERO task suites in a balanced mixture, loaded through
the StarVLA LeRobot data pipeline:

- `libero_spatial_no_noops_1.0.0_lerobot`
- `libero_object_no_noops_1.0.0_lerobot`
- `libero_goal_no_noops_1.0.0_lerobot`
- `libero_10_no_noops_1.0.0_lerobot`

All four are derived from the original LIBERO benchmark (see
[LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO)) and wrapped into
LeRobot format (see
[openvla/modified_libero_rlds](https://huggingface.co/datasets/openvla/modified_libero_rlds)
for the upstream RLDS version).

Input: single RGB view at `224 × 224`, language instruction, 7-D robot state.
Output: chunk of 8 future actions (`delta_qpos` + gripper).

---

## 📜 License

Released under the **Apache 2.0** license.

This checkpoint is built on top of
[nvidia/Cosmos-Predict2-2B-Video2World](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
— please also comply with the upstream Cosmos model license when using or
redistributing these weights.

---

## 📖 Citation

If you use this checkpoint, please cite the StarVLA project and the Cosmos-Predict2
world model:

```bibtex
@misc{starvla2026,
  title  = {StarVLA: A Unified Vision-Language-Action Framework},
  author = {StarVLA Contributors},
  year   = {2026},
  url    = {https://github.com/starVLA/starVLA}
}

@misc{cosmospredict2,
  title  = {Cosmos-Predict2: A Video World Model for Robotics and Simulation},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World}
}
```

---

## 🔗 Links

- **Framework**: https://github.com/starVLA/starVLA
- **Backbone**: https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World
- **Benchmark**: https://github.com/Lifelong-Robot-Learning/LIBERO
- **Issues / Questions**: please open an issue in the
  [StarVLA repo](https://github.com/starVLA/starVLA/issues) and tag it with
  `model/StarVLA_WM4A`.