NAMAA-Space/NAMAA-Saudi-TTS-V2
A voice-cloning text-to-speech model for Najdi Arabic (Saudi Arabic), fully fine-tuned from SWivid/Habibi-TTS's on (Saudi) specialized checkpoint.
Given 5โ8 seconds of a reference voice saying something in Arabic, this model generates new arbitrary Arabic text in that same voice, with Najdi-dialect prosody and pronunciation.
Model Details
Architecture
F5-TTS Diffusion Transformer (DiT), v1 Base configuration:
| Parameter | Value |
|---|---|
| Model family | F5-TTS (flow-matching text-to-speech) |
| Backbone | DiT (Diffusion Transformer) |
| Embedding dim | 1024 |
| Depth (layers) | 22 |
| Attention heads | 16 |
| FFN multiplier | 2 |
| Text embedding dim | 512 |
| Conv text encoder layers | 4 |
| Total parameters | ~335M |
| Tokenizer | Character-level (vocab size 2704) |
| Vocoder | charactr/vocos-mel-24khz |
| Mel features | 100 mels, 24kHz, hop 256, win 1024 |
See config.json for the full architecture spec.
Base model
This is a full fine-tune (not LoRA) warm-started from
SWivid/Habibi-TTS/Specialized/SAU/model_200000.safetensors. That base
checkpoint was already trained on Arabic speech for 200,000 updates by SWivid;
this fine-tune runs an additional ~last updates on Najdi-specific data
on top of those weights. All ~335M parameters are trainable โ no frozen layers,
no adapters.
Why full fine-tuning, not LoRA?
LoRA with low rank (r=8-16) on messy conditional generation tasks ends up averaging noise and signal together in its low-rank update, producing artifacts that weren't in either the base model or the training data. Full fine-tuning has the capacity to properly partition the data distribution โ at the cost of not being able to toggle the adaptation on/off at inference.
Why flow matching / DiT for TTS?
F5-TTS uses flow matching (a diffusion-family method) rather than autoregressive generation. Instead of generating audio token-by-token, it denoises a full mel-spectrogram in 32 ODE steps. The DiT transformer conditions on both the text and the masked reference audio โ the same architecture used for text-to-image diffusion, adapted for audio. See the original F5-TTS paper.
Training
Data
Combined ~18.4 hours of Najdi and Saudi Arabic speech from five HuggingFace datasets, filtered to clips of 3โ10 seconds in duration:
Total training clips: ~13,614 Total audio duration: ~18.4 hours Sample rate: resampled to 24 kHz mono
Hyperparameters
| Parameter | Value |
|---|---|
| Mixed precision | bf16 |
| Batch type | frame-packed |
| Frames per GPU | 38,400 |
| Max samples per batch | 64 |
| Gradient accumulation steps | 2 |
| Learning rate | 5e-5 |
| Epochs | 20 |
| Warmup updates | 300 |
| Max grad norm | 1.0 |
| Save interval | every 500 updates |
| Total updates | ~2,180 |
| Hardware | NVIDIA A100 80GB |
See training_config.json for the exact config used.
Usage
Quick start
from huggingface_hub import hf_hub_download
from f5_tts.model import DiT
from f5_tts.infer.utils_infer import load_model, load_vocoder, preprocess_ref_audio_text
from habibi_tts.infer.utils_infer import infer_process
import torch, soundfile as sf
# Download files
ckpt_path = hf_hub_download(repo_id="NAMAA-Space/NAMAA-Saudi-TTS-V2", filename="model_last.pt")
vocab_path = hf_hub_download(repo_id="NAMAA-Space/NAMAA-Saudi-TTS-V2", filename="vocab.txt")
# Load model
V1_BASE_CFG = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
model = load_model(DiT, V1_BASE_CFG, ckpt_path, vocab_file=vocab_path, device="cuda")
model = model.to(torch.float32).eval()
vocoder = load_vocoder()
# Voice clone
REF_AUDIO = "path/to/najdi_reference_clip.wav" # 5-8s of clean Najdi speech
REF_TEXT = "exact transcript of that clip"
ref_audio, ref_text = preprocess_ref_audio_text(REF_AUDIO, REF_TEXT)
# Generate
wave, sr, _ = infer_process(
ref_audio, ref_text,
"ู
ุฑุญุจุงุ ููู ุญุงูู ุงูููู
ุ", # text to speak
model, vocoder,
nfe_step=32, speed=1.0,
)
sf.write("output.wav", wave, sr)
Or use the inference.py script included in this repo.
Reference clip guidelines
Quality of the generated output is dominated by quality of the reference clip.
- Duration: 5โ8 seconds. Less than 3s is too little prosody; more than 10s costs VRAM with no quality gain.
- Clean audio: no background music, no overlapping speakers, no heavy room reverb. Clipped or noisy references produce clipped or noisy outputs.
- Single speaker, full phrases: the reference should contain one person speaking complete sentences, not isolated words.
- Accurate transcript: REF_TEXT must be what's actually said in REF_AUDIO. Even small drift here degrades the output.
Generation parameters
nfe_step=32โ diffusion ODE steps. 32 is the quality/speed sweet spot; raise to 64 for marginal quality gain (2x slower), lower to 16 for faster iteration.speed=1.0โ speech speed multiplier. 0.8 for slower, 1.2 for faster.
Intended Use
- Text-to-speech for Najdi Arabic content creation where you have a clean reference clip of the target voice.
- Research on Arabic dialect TTS.
Out-of-Scope / Limitations
- Not a multi-dialect model. Fine-tuned on Najdi; will produce Najdi-flavored output even for other Arabic dialect inputs. Use vanilla Habibi-TTS for other dialects.
- Voice clone requires reference audio. This is not a "read arbitrary text in a fixed voice" model. Without a reference clip, there is no output.
- Training data includes podcast audio. Some inherited background characteristics (room tone, occasional distant music) may appear in outputs, especially when reference clip is itself podcast-sourced.
- Non-commercial. License is CC-BY-NC-SA-4.0 (inherited from the base Habibi-TTS model). You may not use this for commercial purposes without working out separate licensing with SWivid.
- No safeguards against voice cloning misuse. Do not clone a person's voice without their permission. Do not generate deceptive or impersonating audio.
Evaluation
No formal evaluation metrics are reported. Subjective quality improvement over vanilla Habibi-TTS SAU was observed for Najdi-accented prompts. A/B testing against the base model is recommended for your specific use case.
Files
| File | Description |
|---|---|
model_last.pt |
Model weights for inference (step last) |
model_last.pt |
Full training state (weights + optimizer) for resuming |
vocab.txt |
Character-level tokenizer vocab (2704 tokens) |
config.json |
Architecture hyperparameters |
training_config.json |
Training hyperparameters used |
inference.py |
Standalone inference script |
Citation
If you use this model, please cite both this fine-tune and the base Habibi-TTS work:
@misc{habibi-tts-najdi-ft,
author = {namaa community},
title = {Habibi-TTS Najdi Fine-Tuned},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/NAMAA-Space/NAMAA-Saudi-TTS-V2}},
}
@misc{habibi-tts,
author = {SWivid},
title = {Habibi-TTS},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/SWivid/Habibi-TTS}},
}
@article{f5tts,
title = {F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
author = {Chen, Yushen and others},
journal = {arXiv:2410.06885},
year = {2024},
}
Acknowledgments
- SWivid for the Habibi-TTS SAU pretrained base
- F5-TTS authors for the underlying architecture and training framework
charactr/vocos-mel-24khzfor the neural vocoder
- Downloads last month
- 61
