gipformer-65M-rnnt: Efficient Vietnamese Speech Recognition

Highlights

State-of-the-art accuracy — Demonstrates top-tier performance across major Vietnamese ASR benchmarks, delivering highly precise and reliable transcription quality.
Robust handling of telephonic domains — Excels in processing challenging, noisy real-world call center recordings across all major Vietnamese regional accents.
Outstanding parameter efficiency — Ranks among the smallest ASR models currently available.
Seamless edge deployment — Its naturally low resource requirements enable ultra-fast inference on mobile and embedded systems, making it perfectly suited for offline, on-device applications.
Built-in data privacy — By supporting full local execution, the model ensures sensitive audio data is processed securely on-device, eliminating the need for third-party cloud services.
gipformer-65M-rnnt is fine-tuned from VietASR-zipformer model.

Benchmark Results

We evaluate gipformer-65M-rnnt against 12 established Vietnamese ASR models across 12 benchmarks spanning call center, medical, broadcast, read speech, etc. All numbers are Word Error Rate (WER %) — lower is better.

Model	Params	tele-medium	tele-diff-north	tele-diff-middle	tele-diff-south	MultiMED	VietMed	vlsp-t1	vlsp-t2	LSVSC	Fleurs	ViMD	vivos
vinai/PhoWhisper-tiny	39M	49.33	115.07	88.35	91.08	32.05	31.52	21.59	52.91	13.37	25.86	23.12	10.10
vinai/PhoWhisper-base	74M	42.61	65.05	74.10	71.31	28.32	28.39	19.97	44.37	12.81	19.25	18.59	8.46
vinai/PhoWhisper-small	244M	33.96	55.88	65.41	62.35	26.02	25.50	15.99	34.20	11.23	16.11	14.09	6.23
vinai/PhoWhisper-medium	769M	26.46	51.20	59.04	55.39	24.76	24.90	14.06	26.38	10.25	14.44	11.34	4.93
vinai/PhoWhisper-large	1.5B	26.82	50.39	59.44	56.70	24.47	24.37	13.70	27.45	10.08	12.62	11.18	4.73
khanhld/chunkformer-large-vie	110M	27.60	46.30	51.91	49.09	22.60	19.59	14.09	25.81	8.85	14.17	11.77	4.18
nguyenvulebinh/wav2vec2-base-vi	95M	23.71	40.49	48.90	46.33	23.03	22.96	13.14	37.33	9.89	20.09	11.42	6.60
hynt/Zipformer-30M-RNNT-6000h	30M	19.95	38.77	45.19	43.89	19.85	19.93	11.76	28.63	9.12	13.16	7.28	4.60
VietASR-zipformer	65M	20.30	42.21	49.01	47.86	22.05	21.90	14.54	31.18	10.23	14.76	10.15	6.92
Qwen/Qwen3-ASR-1.7B	1.7B	26.34	46.80	59.85	51.84	20.11	20.21	16.29	34.26	9.64	10.13	11.16	7.17
Qwen/Qwen3-ASR-0.6B	600M	32.29	48.57	61.88	55.43	22.65	22.51	18.62	43.44	10.96	13.11	14.37	10.23
nvidia/parakeet-ctc-0.6b-Vietnamese	600M	31.82	55.33	61.65	56.70	23.79	23.53	17.00	37.94	10.46	16.11	12.95	7.76
g-group-ai-lab/gipformer-65M-rnnt	65M	15.53	25.10	32.27	32.62	19.35	19.41	13.39	20.40	8.96	12.92	7.17	4.12

Rankings Summary

Rank	Count	Benchmarks
#1	9 / 12	tele-medium, tele-difficult-north, tele-difficult-middle, tele-difficult-south, MultiMED, VietMed, vlsp-2020-task-2, ViMD, vivos
#2	1 / 12	LSVSC (8.96)
#3	2 / 12	vlsp-2020-task-1 (13.39), Fleurs (12.92)

Dataset Descriptions

Private test sets (call center domain):

tele-medium — Call center recordings with medium difficulty
tele-difficult-north — Low-quality call center audio, hard-to-hear speakers — Northern Vietnamese accent
tele-difficult-middle — Low-quality call center audio, hard-to-hear speakers — Central Vietnamese accent
tele-difficult-south — Low-quality call center audio, hard-to-hear speakers — Southern Vietnamese accent

Public test sets:

MultiMED — Multi-domain medical conversations
VietMed — Vietnamese medical domain
vlsp-2020-task-1 — VLSP 2020 ASR Shared Task 1
vlsp-2020-task-2 — VLSP 2020 ASR Shared Task 2
LSVSC — Large-Scale Vietnamese Speech Corpus
Fleurs — Google's Few-shot Learning Evaluation of Universal Representations of Speech (Vietnamese subset)
ViMD — Vietnamese Multi-Domain
vivos — Vietnamese read speech corpus

Call Center Domain: Where It Matters Most

Call center ASR is one of the most challenging real-world domains — noisy phone lines, overlapping speech, diverse regional accents, and spontaneous conversation. gipformer-65M-rnnt delivers dominant performance across all call center test sets.

Usage

PyTorch model: Use icefall for inference or training.
ONNX model (recommended): Use sherpa-onnx for inference on CPU/GPU, mobile, and embedded devices.

Citation

@misc{gipformer,
  title={gipformer-65M-rnnt: Efficient Vietnamese Speech Recognition},
  author={G-Group AI Lab},
  year={2026},
  url={https://huggingface.co/g-group-ai-lab/gipformer-65M-rnnt}
}

License

This model is released under the MIT License.

Acknowledgments

Developed by G-Group AI Lab. For questions, issues, or collaboration inquiries, please visit our HuggingFace organization page.

Downloads last month: -

Paper for g-group-ai-lab/gipformer-65M-rnnt

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

Paper • 2505.21527 • Published May 23, 2025

Evaluation results

WER (vivos)
self-reported

4.120
WER (LSVSC)
self-reported

8.960
WER (Fleurs)
self-reported

12.920
WER (ViMD)
self-reported

7.170
WER (MultiMED)
self-reported

19.350