gipformer-65M-rnnt: Efficient Vietnamese Speech Recognition

Highlights

  • State-of-the-art accuracy β€” Demonstrates top-tier performance across major Vietnamese ASR benchmarks, delivering highly precise and reliable transcription quality.
  • Robust handling of telephonic domains β€” Excels in processing challenging, noisy real-world call center recordings across all major Vietnamese regional accents.
  • Outstanding parameter efficiency β€” Ranks among the smallest ASR models currently available.
  • Seamless edge deployment β€” Its naturally low resource requirements enable ultra-fast inference on mobile and embedded systems, making it perfectly suited for offline, on-device applications.
  • Built-in data privacy β€” By supporting full local execution, the model ensures sensitive audio data is processed securely on-device, eliminating the need for third-party cloud services.
  • gipformer-65M-rnnt is fine-tuned from VietASR-zipformer model.

Benchmark Results

We evaluate gipformer-65M-rnnt against 12 established Vietnamese ASR models across 12 benchmarks spanning call center, medical, broadcast, read speech, etc. All numbers are Word Error Rate (WER %) β€” lower is better.

Model Params tele-medium tele-diff-north tele-diff-middle tele-diff-south MultiMED VietMed vlsp-t1 vlsp-t2 LSVSC Fleurs ViMD vivos
vinai/PhoWhisper-tiny 39M 49.33 115.07 88.35 91.08 32.05 31.52 21.59 52.91 13.37 25.86 23.12 10.10
vinai/PhoWhisper-base 74M 42.61 65.05 74.10 71.31 28.32 28.39 19.97 44.37 12.81 19.25 18.59 8.46
vinai/PhoWhisper-small 244M 33.96 55.88 65.41 62.35 26.02 25.50 15.99 34.20 11.23 16.11 14.09 6.23
vinai/PhoWhisper-medium 769M 26.46 51.20 59.04 55.39 24.76 24.90 14.06 26.38 10.25 14.44 11.34 4.93
vinai/PhoWhisper-large 1.5B 26.82 50.39 59.44 56.70 24.47 24.37 13.70 27.45 10.08 12.62 11.18 4.73
khanhld/chunkformer-large-vie 110M 27.60 46.30 51.91 49.09 22.60 19.59 14.09 25.81 8.85 14.17 11.77 4.18
nguyenvulebinh/wav2vec2-base-vi 95M 23.71 40.49 48.90 46.33 23.03 22.96 13.14 37.33 9.89 20.09 11.42 6.60
hynt/Zipformer-30M-RNNT-6000h 30M 19.95 38.77 45.19 43.89 19.85 19.93 11.76 28.63 9.12 13.16 7.28 4.60
VietASR-zipformer 65M 20.30 42.21 49.01 47.86 22.05 21.90 14.54 31.18 10.23 14.76 10.15 6.92
Qwen/Qwen3-ASR-1.7B 1.7B 26.34 46.80 59.85 51.84 20.11 20.21 16.29 34.26 9.64 10.13 11.16 7.17
Qwen/Qwen3-ASR-0.6B 600M 32.29 48.57 61.88 55.43 22.65 22.51 18.62 43.44 10.96 13.11 14.37 10.23
nvidia/parakeet-ctc-0.6b-Vietnamese 600M 31.82 55.33 61.65 56.70 23.79 23.53 17.00 37.94 10.46 16.11 12.95 7.76
g-group-ai-lab/gipformer-65M-rnnt 65M 15.53 25.10 32.27 32.62 19.35 19.41 13.39 20.40 8.96 12.92 7.17 4.12

Rankings Summary

Rank Count Benchmarks
#1 9 / 12 tele-medium, tele-difficult-north, tele-difficult-middle, tele-difficult-south, MultiMED, VietMed, vlsp-2020-task-2, ViMD, vivos
#2 1 / 12 LSVSC (8.96)
#3 2 / 12 vlsp-2020-task-1 (13.39), Fleurs (12.92)
Dataset Descriptions

Private test sets (call center domain):

  • tele-medium β€” Call center recordings with medium difficulty
  • tele-difficult-north β€” Low-quality call center audio, hard-to-hear speakers β€” Northern Vietnamese accent
  • tele-difficult-middle β€” Low-quality call center audio, hard-to-hear speakers β€” Central Vietnamese accent
  • tele-difficult-south β€” Low-quality call center audio, hard-to-hear speakers β€” Southern Vietnamese accent

Public test sets:

  • MultiMED β€” Multi-domain medical conversations
  • VietMed β€” Vietnamese medical domain
  • vlsp-2020-task-1 β€” VLSP 2020 ASR Shared Task 1
  • vlsp-2020-task-2 β€” VLSP 2020 ASR Shared Task 2
  • LSVSC β€” Large-Scale Vietnamese Speech Corpus
  • Fleurs β€” Google's Few-shot Learning Evaluation of Universal Representations of Speech (Vietnamese subset)
  • ViMD β€” Vietnamese Multi-Domain
  • vivos β€” Vietnamese read speech corpus

Call Center Domain: Where It Matters Most

Call center ASR is one of the most challenging real-world domains β€” noisy phone lines, overlapping speech, diverse regional accents, and spontaneous conversation. gipformer-65M-rnnt delivers dominant performance across all call center test sets.

Usage

  • PyTorch model: Use icefall for inference or training.
  • ONNX model (recommended): Use sherpa-onnx for inference on CPU/GPU, mobile, and embedded devices.

Citation

@misc{gipformer,
  title={gipformer-65M-rnnt: Efficient Vietnamese Speech Recognition},
  author={G-Group AI Lab},
  year={2026},
  url={https://huggingface.co/g-group-ai-lab/gipformer-65M-rnnt}
}

License

This model is released under the MIT License.

Acknowledgments

Developed by G-Group AI Lab. For questions, issues, or collaboration inquiries, please visit our HuggingFace organization page.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for g-group-ai-lab/gipformer-65M-rnnt

Evaluation results