AST-Soloni 114M (End-to-End Speech Translation)
st-soloni-114m-tdt-ctc is an end-to-end Speech Translation (ST) model designed to translate Bambara audio directly into French text. It is based on the FastConformer architecture and pretrained for ASR on jeli-asr and Kunkado (soloni-v1) before being fine-tuned for translation.
π¨ Important Note
This model is a baseline for research on low-resource speech translation. It was trained on "amateur" translations which exhibit high variance.
NVIDIA NeMo: Training
To use this model, ensure you have the NVIDIA NeMo toolkit installed:
pip install nemo-toolkit['asr']
How to Use This Model
Load Model with NeMo
import nemo.collections.asr as nemo_asr
# This model uses the Hybrid RNNT-CTC encoder-decoder structure adapted for ST
st_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="anonymousnowhere/st-soloni-114m-tdt-ctc")
Translate audio
# Translates Bambara audio directly to French text
st_model.transcribe(['bambara_sample.wav'])
Model Architecture
This model utilizes the FastConformer encoder, which features 8x depthwise-separable convolutional downsampling for efficiency. While originally an ASR architecture, this model is trained as an E2E-ST system where the decoder predicts French text tokens directly from Bambara speech features.
Training
The model was trained following a two-stage process:
Pre-training: Initialized from RobotsMali/soloni-114m-tdt-ctc-v1
Finetuning: Trained on the Jeli-ASR dataset (30 hours) with the Audio-French pairs
Hyperparameters: Optimized using AdamW with a Noam scheduler, a peak learning rate of 0.001, and a 1,000-step warmup.
The finetuning codes and configurations can be found at our Anonymous Github.
Dataset
This model was trained and evaluated on Jeli-ASR, a corpus of ~30 hours of Bambara speech with French translations provided by native speakers. The translations are semi-professional with only 10h completed by trained linguists.
Evaluation
Thus model was evaluated on the test set of Jeli-ASR. We report the Word Error Rate (WER), the Character Error Rate (CER) and the Bilingual Evaluation Understudy (BLEU).
| Benchmark | Decoding | WER (%) β | CER (%) β | BLEU β |
|---|---|---|---|---|
| Jeli-ASR Test | CTC | 73.90 | 55.98 | 17.28 |
| Jeli-ASR Test | TDT | 70.43 | 58.17 | 24.18 |
License
This model is released under the CC-BY-4.0 license.
- Downloads last month
- 3
Model tree for anonymousnowhere/st-soloni-114m-tdt-ctc
Base model
nvidia/parakeet-tdt_ctc-110mDataset used to train anonymousnowhere/st-soloni-114m-tdt-ctc
Evaluation results
- Test BLEU on Jeli-ASRtest set self-reported24.180
- Test WER on Jeli-ASRtest set self-reported70.430
- Test CER on Jeli-ASRtest set self-reported55.980