TL;DR GPA incorporates three speech tasks into one single model and this repo includes codes of training, fine-tuning and effecient deployment of GPA.
π Abstract
GPA stands for General Purpose Audio.
In academia, a studentβs GPA (Grade Point Average) serves as a unified metric that reflects performance across diverse subjectsβranging from Calculus and Philosophy to Gym class.
Similarly, our GPA model unifies the three major pillars of audio tasksβText-to-Speech (TTS), Automatic Speech Recognition (ASR), and Voice Conversion (VC)βinto a single auto-regreesive transformer.
- Our open-source content includes support for multiple frameworks and provides production-ready code suitable for cloud deployment.
- we include concise inference examples and training pipelines for research purpose.
- The released 0.3B model is also perfect for edge devices and edge deployment is to be released.
π Model Overview
β‘ Model Performance
The following results are obtained by benchmarking services instantiated via the official deployment scripts, reflecting end-to-end performance in realistic serving scenarios rather than offline inference.
Among currently available open-source systems, our model is one of the few that natively supports both concurrent and streaming inference, while achieving performance comparable to the first tier of existing approaches.
π‘Note
- TTFC: Time To First Chunk (TTS)
- TTFT: Time To First Token (ASR)
- RTF: Real-Time Factor (audio duration / synthesis time)
TTS Streaming Benchmark (Latency & Throughput)
| Concurrency | Avg TTFC (ms) | P50 TTFC (ms) | P99 TTFC (ms) | Avg RTF | P50 RTF | P99 RTF | Audio Dur (s) |
|---|---|---|---|---|---|---|---|
| 1 | 258.8 | 258.8 | 258.8 | 0.197 | 0.197 | 0.197 | 6.44 |
| 5 | 385.0 | 394.7 | 396.2 | 0.218 | 0.217 | 0.248 | 6.76 |
| 10 | 544.6 | 564.2 | 566.7 | 0.282 | 0.301 | 0.313 | 6.49 |
| 20 | 977.8 | 977.9 | 982.9 | 0.470 | 0.490 | 0.538 | 7.19 |
| 40 | 1797.0 | 1736.4 | 2564.5 | 0.421 | 0.400 | 0.587 | 6.33 |
| 80 | 3786.4 | 4054.4 | 5415.8 | 0.763 | 0.763 | 1.096 | 6.32 |
| 160 | 9847.9 | 10239.9 | 14350.3 | 1.718 | 1.740 | 2.577 | 6.44 |
Table 2. TTS Streaming RTF and Audio Duration
ASR Streaming Benchmark
| Concurrency | Avg TTFT (ms) | P50 TTFT (ms) | P99 TTFT (ms) | Avg Total (ms) |
|---|---|---|---|---|
| 1 | 157.5 | 157.5 | 157.5 | 190.9 |
| 5 | 394.1 | 393.7 | 395.9 | 400.0 |
| 10 | 589.6 | 721.3 | 723.3 | 598.1 |
| 20 | 1316.3 | 1495.6 | 1500.4 | 1317.8 |
| 40 | 2690.9 | 2678.3 | 2861.4 | 2693.7 |
| 80 | 3833.4 | 3961.3 | 4027.0 | 3845.1 |
| 160 | 5037.0 | 5689.3 | 6676.0 | 5044.0 |
Table 3. ASR Streaming Latency vs Concurrency
π Evaluation Metric Results
TTS Evaluation Table
| Model | Open-Source | Model Size | test-zh CER (%) β | test-zh Sim (%) β | test-en WER (%) β | test-en Sim (%) β |
|---|---|---|---|---|---|---|
| Multi-Stage or NAR Methods | ||||||
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | β | - | 1.12 | 79.6 | 2.25 | 76.2 |
| MiniMax-Speech | β | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | β | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | β | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | β | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | β | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | β | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VibeVoice-Realtime | β | 0.5B | - | - | 2.05 | 63.3 |
| HiggsAudio-v2 | β | 3B | 1.50 | 74.0 | 2.44 | 67.7 |
| VoxCPM | β | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| GLM-TTS | β | 1.5B | 1.03 | 76.1 | - | - |
| GLM-TTS RL | β | 1.5B | 0.89 | 76.4 | - | - |
| Fun-CosyVoice3-0.5B-2512 | β | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 |
| Fun-CosyVoice3-0.5B-2512_RL | β | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |
| One-Stage AR Methods | ||||||
| Spark TTS | β | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |
| GPA-0.3B-preview | β | 0.3B | 0.95 | 65.9 | 1.51 | 56.5 |
ASR Evaluation Table
Note: ASR results on Librispeech and Aishell-1. WER (%) is reported for Librispeech, and CER (%) is reported for Aishell-1.
| Model | Model Size | Librispeech test-clean | Aishell-1 |
|---|---|---|---|
| Models with < 0.5B parameters | |||
| Whisper-S | 0.24B | 3.13 | - |
| GPA-0.3B-preview | 0.3B | 8.88 | 4.50 |
| Models with > 0.5B parameters | |||
| Fun-ASR-nano | 0.8B | 1.76 | 1.80 |
| FireRed-ASR | 1.1B | 1.84 | 0.54 |
| GLM-ASR-nano | 1.5B | 2.00 | 1.81 |
| GLM-ASR-nano* | 1.5B | 2.17 | 2.17 |
| Whisper-L | 1.55B | 1.82 | 4.72 |
| Kimi-Audio | - | 1.32 | 0.71 |
| Step-Audio2 | - | 1.17 | 0.63 |
| Seed-ASR | - | 1.58 | 0.68 |
| Seed-ASR* | - | 2.80 | 1.63 |
| Fun-ASR | 7.7B | 1.51 | 1.22 |
π Acknowledgements
We borrowed a lot of code from the following excellent projects:
π Citation
If you find GPA useful for your research or projects, please cite us:
@misc{gpa2026,
title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformer},
author={Runyuan Cai, Yu Lin, Yiming Wang, Chunlin Fu and Xiaodong Zeng},
year={2026},
howpublished={\url{https://github.com/AutoArk/GPA}},
}