Belle-VLM: Vietnamese Vision Language Model
Model Description
Belle-VLM is a Vision Language Model trained for Vietnamese multimodal reasoning tasks.
Architecture
- LLM Backbone: Qwen3-0.6B
- Vision Encoder: FastViTHD (MobileCLIP)
- Projector: MLP 2-layer (3072 -> 1024)
Training
- Dataset: 5CD-AI/Viet-multimodal-open-r1-8k-verified
- Method: LoRA fine-tuning
- Epochs: 2
- Learning Rate: 2e-05
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"beyoru/Belle-VLM",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("beyoru/Belle-VLM", trust_remote_code=True)
Training Details
| Parameter | Value |
|---|---|
| Base Model | Qwen/Qwen3-0.6B |
| Vision Tower | mobileclip_l_384 |
| LoRA Rank | 8 |
| LoRA Alpha | 16 |
| Batch Size | 1 x 1 |
| Epochs | 2 |
License
Apache 2.0
- Downloads last month
- 47
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support