AST-Soloni 114M (End-to-End Speech Translation)

Model architecture | Model size | Language

st-soloni-114m-tdt-ctc is an end-to-end Speech Translation (ST) model designed to translate Bambara audio directly into French text. It is based on the FastConformer architecture and pretrained for ASR on jeli-asr and Kunkado (soloni-v1) before being fine-tuned for translation.

🚨 Important Note

This model is a baseline for research on low-resource speech translation. It was trained on "amateur" translations which exhibit high variance.

NVIDIA NeMo: Training

To use this model, ensure you have the NVIDIA NeMo toolkit installed:

pip install nemo-toolkit['asr']

How to Use This Model

Load Model with NeMo

import nemo.collections.asr as nemo_asr
# This model uses the Hybrid RNNT-CTC encoder-decoder structure adapted for ST
st_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="anonymousnowhere/st-soloni-114m-tdt-ctc")

Translate audio

# Translates Bambara audio directly to French text
st_model.transcribe(['bambara_sample.wav'])

Model Architecture

This model utilizes the FastConformer encoder, which features 8x depthwise-separable convolutional downsampling for efficiency. While originally an ASR architecture, this model is trained as an E2E-ST system where the decoder predicts French text tokens directly from Bambara speech features.

Training

The model was trained following a two-stage process:

  1. Pre-training: Initialized from RobotsMali/soloni-114m-tdt-ctc-v1

  2. Finetuning: Trained on the Jeli-ASR dataset (30 hours) with the Audio-French pairs

  3. Hyperparameters: Optimized using AdamW with a Noam scheduler, a peak learning rate of 0.001, and a 1,000-step warmup.

The finetuning codes and configurations can be found at our Anonymous Github.

Dataset

This model was trained and evaluated on Jeli-ASR, a corpus of ~30 hours of Bambara speech with French translations provided by native speakers. The translations are semi-professional with only 10h completed by trained linguists.

Evaluation

Thus model was evaluated on the test set of Jeli-ASR. We report the Word Error Rate (WER), the Character Error Rate (CER) and the Bilingual Evaluation Understudy (BLEU).

Benchmark Decoding WER (%) ↓ CER (%) ↓ BLEU ↑
Jeli-ASR Test CTC 73.90 55.98 17.28
Jeli-ASR Test TDT 70.43 58.17 24.18

License

This model is released under the CC-BY-4.0 license.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for anonymousnowhere/st-soloni-114m-tdt-ctc

Dataset used to train anonymousnowhere/st-soloni-114m-tdt-ctc

Evaluation results