Belle-VLM: Vietnamese Vision Language Model

Model Description

Belle-VLM is a Vision Language Model trained for Vietnamese multimodal reasoning tasks.

Architecture

  • LLM Backbone: Qwen3-0.6B
  • Vision Encoder: FastViTHD (MobileCLIP)
  • Projector: MLP 2-layer (3072 -> 1024)

Training

  • Dataset: 5CD-AI/Viet-multimodal-open-r1-8k-verified
  • Method: LoRA fine-tuning
  • Epochs: 2
  • Learning Rate: 2e-05

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "beyoru/Belle-VLM",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("beyoru/Belle-VLM", trust_remote_code=True)

Training Details

Parameter Value
Base Model Qwen/Qwen3-0.6B
Vision Tower mobileclip_l_384
LoRA Rank 8
LoRA Alpha 16
Batch Size 1 x 1
Epochs 2

License

Apache 2.0

Downloads last month
47
Safetensors
Model size
0.6B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for beyoru/Belle-VLM-Base

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(645)
this model

Dataset used to train beyoru/Belle-VLM-Base