VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining
Paper β’ 2505.21527 β’ Published
We evaluate gipformer-65M-rnnt against 12 established Vietnamese ASR models across 12 benchmarks spanning call center, medical, broadcast, read speech, etc. All numbers are Word Error Rate (WER %) β lower is better.
| Model | Params | tele-medium | tele-diff-north | tele-diff-middle | tele-diff-south | MultiMED | VietMed | vlsp-t1 | vlsp-t2 | LSVSC | Fleurs | ViMD | vivos |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| vinai/PhoWhisper-tiny | 39M | 49.33 | 115.07 | 88.35 | 91.08 | 32.05 | 31.52 | 21.59 | 52.91 | 13.37 | 25.86 | 23.12 | 10.10 |
| vinai/PhoWhisper-base | 74M | 42.61 | 65.05 | 74.10 | 71.31 | 28.32 | 28.39 | 19.97 | 44.37 | 12.81 | 19.25 | 18.59 | 8.46 |
| vinai/PhoWhisper-small | 244M | 33.96 | 55.88 | 65.41 | 62.35 | 26.02 | 25.50 | 15.99 | 34.20 | 11.23 | 16.11 | 14.09 | 6.23 |
| vinai/PhoWhisper-medium | 769M | 26.46 | 51.20 | 59.04 | 55.39 | 24.76 | 24.90 | 14.06 | 26.38 | 10.25 | 14.44 | 11.34 | 4.93 |
| vinai/PhoWhisper-large | 1.5B | 26.82 | 50.39 | 59.44 | 56.70 | 24.47 | 24.37 | 13.70 | 27.45 | 10.08 | 12.62 | 11.18 | 4.73 |
| khanhld/chunkformer-large-vie | 110M | 27.60 | 46.30 | 51.91 | 49.09 | 22.60 | 19.59 | 14.09 | 25.81 | 8.85 | 14.17 | 11.77 | 4.18 |
| nguyenvulebinh/wav2vec2-base-vi | 95M | 23.71 | 40.49 | 48.90 | 46.33 | 23.03 | 22.96 | 13.14 | 37.33 | 9.89 | 20.09 | 11.42 | 6.60 |
| hynt/Zipformer-30M-RNNT-6000h | 30M | 19.95 | 38.77 | 45.19 | 43.89 | 19.85 | 19.93 | 11.76 | 28.63 | 9.12 | 13.16 | 7.28 | 4.60 |
| VietASR-zipformer | 65M | 20.30 | 42.21 | 49.01 | 47.86 | 22.05 | 21.90 | 14.54 | 31.18 | 10.23 | 14.76 | 10.15 | 6.92 |
| Qwen/Qwen3-ASR-1.7B | 1.7B | 26.34 | 46.80 | 59.85 | 51.84 | 20.11 | 20.21 | 16.29 | 34.26 | 9.64 | 10.13 | 11.16 | 7.17 |
| Qwen/Qwen3-ASR-0.6B | 600M | 32.29 | 48.57 | 61.88 | 55.43 | 22.65 | 22.51 | 18.62 | 43.44 | 10.96 | 13.11 | 14.37 | 10.23 |
| nvidia/parakeet-ctc-0.6b-Vietnamese | 600M | 31.82 | 55.33 | 61.65 | 56.70 | 23.79 | 23.53 | 17.00 | 37.94 | 10.46 | 16.11 | 12.95 | 7.76 |
| g-group-ai-lab/gipformer-65M-rnnt | 65M | 15.53 | 25.10 | 32.27 | 32.62 | 19.35 | 19.41 | 13.39 | 20.40 | 8.96 | 12.92 | 7.17 | 4.12 |
| Rank | Count | Benchmarks |
|---|---|---|
| #1 | 9 / 12 | tele-medium, tele-difficult-north, tele-difficult-middle, tele-difficult-south, MultiMED, VietMed, vlsp-2020-task-2, ViMD, vivos |
| #2 | 1 / 12 | LSVSC (8.96) |
| #3 | 2 / 12 | vlsp-2020-task-1 (13.39), Fleurs (12.92) |
Private test sets (call center domain):
Public test sets:
Call center ASR is one of the most challenging real-world domains β noisy phone lines, overlapping speech, diverse regional accents, and spontaneous conversation. gipformer-65M-rnnt delivers dominant performance across all call center test sets.
@misc{gipformer,
title={gipformer-65M-rnnt: Efficient Vietnamese Speech Recognition},
author={G-Group AI Lab},
year={2026},
url={https://huggingface.co/g-group-ai-lab/gipformer-65M-rnnt}
}
This model is released under the MIT License.
Developed by G-Group AI Lab. For questions, issues, or collaboration inquiries, please visit our HuggingFace organization page.