File size: 4,853 Bytes
f60702c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: apache-2.0
base_model:
- microsoft/Florence-2-large
tags:
- robotics
- vla
pipeline_tag: robotics
widget:
- src: https://huggingface.co/2toINF/X-VLA-0.9B-WidowX/resolve/main/demo.mp4
  type: video
  label: "Watch X-VLA in action on WidowX robot"
---

#  X-VLA 0.9B (WidowX Edition)


**Repository:** [2toINF/X-VLA-0.9B-WidowX](https://huggingface.co/2toINF/X-VLA-0.9B-WidowX)

**Authors:** [2toINF](https://github.com/2toINF) | **License:** Apache 2.0

**Paper:** *Zheng et al., 2025, “X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model”* ([arXiv:2510.10274](https://arxiv.org/pdf/2510.10274))


## 🚀 Overview

Successful generalist **Vision-Language-Action (VLA)** models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets.
To facilitate and leverage the heterogeneity in rich robotic data sources, **X-VLA** introduces a **Soft Prompt approach** with minimally added parameters: we infuse prompt-learning concepts into cross-embodiment robot learning, introducing **separate sets of learnable embeddings** for each distinct embodiment.

These embodiment-specific prompts empower VLA models to exploit cross-embodiment features effectively.
Our architecture—**a clean, flow-matching-based VLA design relying exclusively on soft-prompted standard Transformers**—achieves superior scalability and simplicity.

Trained on **Bridge Data** and evaluated across **six simulations** and **three real-world robots**, the 0.9B-parameter X-VLA simultaneously achieves **state-of-the-art performance** across diverse benchmarks, demonstrating flexible dexterity and fast adaptation across embodiments, environments, and tasks.

🌐 **Project Website:** [https://thu-air-dream.github.io/X-VLA/](https://thu-air-dream.github.io/X-VLA/)


<video controls autoplay loop muted playsinline width="720">
  <source src="https://huggingface.co/2toINF/X-VLA-0.9B-WidowX/resolve/main/demo.mp4" type="video/mp4">
</video>

## ⚙️ Usage
### 🔹 Load the model

```python
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "2toINF/X-VLA-WidowX",
    trust_remote_code=True
)
```
### 🔹 Start FastAPI server

```python
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
model.run(processor, host="0.0.0.0", port=8000)
```
### 🔹 Client-server evaluation

You can run the provided evaluation client from our GitHub:
👉 [2toINF/X-VLA – Client &amp; Server Code](https://github.com/2toINF/X-VLA)

```bash
python client_widowx.py   --server_ip <SERVER_IP>   --server_port 8000   --output_dir logs/
```
Each evaluation produces task-level videos and logs under `logs/`.



## 🧩 Architecture

| Component                         | Role                                                                       |
| :-------------------------------- | :------------------------------------------------------------------------- |
| **Florence 2 Encoder**      | Vision-Language representation backbone (encoder-only).                    |
| **SoftPromptedTransformer** | Flow-matching action denoiser using learnable soft prompts per embodiment. |
| **Action Hub**              | Defines action spaces, masking rules, pre/post-processing, and losses.     |

## 🧠 Training Summary

| Setting           | Value                                           |
| :---------------- | :---------------------------------------------- |
| Training Data     | Bridge Data V2    |
| Parameters        | ≈ 0.9 B                                        |
| Action Mode       | `ee6d`                                         |
| Precision         | BP16                                            |
| Framework         | PyTorch + Transformers                          |

---
## 🪪 License
```
Copyright 2025 2toINF (https://github.com/2toINF)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
http://www.apache.org/licenses/LICENSE-2.0
```
---
## 📚 Citation
```bibtex
@article{zheng2025x,
  title   = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
  author  = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
             and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
  journal = {arXiv preprint arXiv:2510.10274},
  year    = {2025}
}
```
---
## 🌐 Links

- 📄 **Paper:** [arXiv 2510.10274](https://arxiv.org/abs/2510.10274)
- 💻 **Code & Client/Server:** [GitHub – 2toINF/X-VLA](https://github.com/2toINF/X-VLA)
- 🤖 **Model Hub:** [Hugging Face – 2toINF/X-VLA-0.9B-WidowX](https://huggingface.co/2toINF/X-VLA-0.9B-WidowX)