NAMAA-Space/NAMAA-Saudi-TTS-V2

A voice-cloning text-to-speech model for Najdi Arabic (Saudi Arabic), fully fine-tuned from SWivid/Habibi-TTS's on (Saudi) specialized checkpoint.

Given 5–8 seconds of a reference voice saying something in Arabic, this model generates new arbitrary Arabic text in that same voice, with Najdi-dialect prosody and pronunciation.

Model Details

Architecture

F5-TTS Diffusion Transformer (DiT), v1 Base configuration:

Parameter	Value
Model family	F5-TTS (flow-matching text-to-speech)
Backbone	DiT (Diffusion Transformer)
Embedding dim	1024
Depth (layers)	22
Attention heads	16
FFN multiplier	2
Text embedding dim	512
Conv text encoder layers	4
Total parameters	~335M
Tokenizer	Character-level (vocab size 2704)
Vocoder	charactr/vocos-mel-24khz
Mel features	100 mels, 24kHz, hop 256, win 1024

See config.json for the full architecture spec.

Base model

This is a full fine-tune (not LoRA) warm-started from SWivid/Habibi-TTS/Specialized/SAU/model_200000.safetensors. That base checkpoint was already trained on Arabic speech for 200,000 updates by SWivid; this fine-tune runs an additional ~last updates on Najdi-specific data on top of those weights. All ~335M parameters are trainable — no frozen layers, no adapters.

Why full fine-tuning, not LoRA?

LoRA with low rank (r=8-16) on messy conditional generation tasks ends up averaging noise and signal together in its low-rank update, producing artifacts that weren't in either the base model or the training data. Full fine-tuning has the capacity to properly partition the data distribution — at the cost of not being able to toggle the adaptation on/off at inference.

Why flow matching / DiT for TTS?

F5-TTS uses flow matching (a diffusion-family method) rather than autoregressive generation. Instead of generating audio token-by-token, it denoises a full mel-spectrogram in 32 ODE steps. The DiT transformer conditions on both the text and the masked reference audio — the same architecture used for text-to-image diffusion, adapted for audio. See the original F5-TTS paper.

Training

Data

Combined ~18.4 hours of Najdi and Saudi Arabic speech from five HuggingFace datasets, filtered to clips of 3–10 seconds in duration:

Total training clips: ~13,614 Total audio duration: ~18.4 hours Sample rate: resampled to 24 kHz mono

Hyperparameters

Parameter	Value
Mixed precision	bf16
Batch type	frame-packed
Frames per GPU	38,400
Max samples per batch	64
Gradient accumulation steps	2
Learning rate	5e-5
Epochs	20
Warmup updates	300
Max grad norm	1.0
Save interval	every 500 updates
Total updates	~2,180
Hardware	NVIDIA A100 80GB

See training_config.json for the exact config used.

Usage

Quick start

from huggingface_hub import hf_hub_download
from f5_tts.model import DiT
from f5_tts.infer.utils_infer import load_model, load_vocoder, preprocess_ref_audio_text
from habibi_tts.infer.utils_infer import infer_process
import torch, soundfile as sf

# Download files
ckpt_path  = hf_hub_download(repo_id="NAMAA-Space/NAMAA-Saudi-TTS-V2", filename="model_last.pt")
vocab_path = hf_hub_download(repo_id="NAMAA-Space/NAMAA-Saudi-TTS-V2", filename="vocab.txt")

# Load model
V1_BASE_CFG = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
model = load_model(DiT, V1_BASE_CFG, ckpt_path, vocab_file=vocab_path, device="cuda")
model = model.to(torch.float32).eval()
vocoder = load_vocoder()

# Voice clone
REF_AUDIO = "path/to/najdi_reference_clip.wav"   # 5-8s of clean Najdi speech
REF_TEXT  = "exact transcript of that clip"
ref_audio, ref_text = preprocess_ref_audio_text(REF_AUDIO, REF_TEXT)

# Generate
wave, sr, _ = infer_process(
    ref_audio, ref_text,
    "مرحبا، كيف حالك اليوم؟",    # text to speak
    model, vocoder,
    nfe_step=32, speed=1.0,
)
sf.write("output.wav", wave, sr)

Or use the inference.py script included in this repo.

Reference clip guidelines

Quality of the generated output is dominated by quality of the reference clip.

Duration: 5–8 seconds. Less than 3s is too little prosody; more than 10s costs VRAM with no quality gain.
Clean audio: no background music, no overlapping speakers, no heavy room reverb. Clipped or noisy references produce clipped or noisy outputs.
Single speaker, full phrases: the reference should contain one person speaking complete sentences, not isolated words.
Accurate transcript: REF_TEXT must be what's actually said in REF_AUDIO. Even small drift here degrades the output.

Generation parameters

nfe_step=32 — diffusion ODE steps. 32 is the quality/speed sweet spot; raise to 64 for marginal quality gain (2x slower), lower to 16 for faster iteration.
speed=1.0 — speech speed multiplier. 0.8 for slower, 1.2 for faster.

Intended Use

Text-to-speech for Najdi Arabic content creation where you have a clean reference clip of the target voice.
Research on Arabic dialect TTS.

Out-of-Scope / Limitations

Not a multi-dialect model. Fine-tuned on Najdi; will produce Najdi-flavored output even for other Arabic dialect inputs. Use vanilla Habibi-TTS for other dialects.
Voice clone requires reference audio. This is not a "read arbitrary text in a fixed voice" model. Without a reference clip, there is no output.
Training data includes podcast audio. Some inherited background characteristics (room tone, occasional distant music) may appear in outputs, especially when reference clip is itself podcast-sourced.
Non-commercial. License is CC-BY-NC-SA-4.0 (inherited from the base Habibi-TTS model). You may not use this for commercial purposes without working out separate licensing with SWivid.
No safeguards against voice cloning misuse. Do not clone a person's voice without their permission. Do not generate deceptive or impersonating audio.

Evaluation

No formal evaluation metrics are reported. Subjective quality improvement over vanilla Habibi-TTS SAU was observed for Najdi-accented prompts. A/B testing against the base model is recommended for your specific use case.

Files

File	Description
`model_last.pt`	Model weights for inference (step last)
`model_last.pt`	Full training state (weights + optimizer) for resuming
`vocab.txt`	Character-level tokenizer vocab (2704 tokens)
`config.json`	Architecture hyperparameters
`training_config.json`	Training hyperparameters used
`inference.py`	Standalone inference script

Citation

If you use this model, please cite both this fine-tune and the base Habibi-TTS work:

@misc{habibi-tts-najdi-ft,
  author = {namaa community},
  title = {Habibi-TTS Najdi Fine-Tuned},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/NAMAA-Space/NAMAA-Saudi-TTS-V2}},
}

@misc{habibi-tts,
  author = {SWivid},
  title = {Habibi-TTS},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/SWivid/Habibi-TTS}},
}

@article{f5tts,
  title = {F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
  author = {Chen, Yushen and others},
  journal = {arXiv:2410.06885},
  year = {2024},
}

Acknowledgments

SWivid for the Habibi-TTS SAU pretrained base
F5-TTS authors for the underlying architecture and training framework
charactr/vocos-mel-24khz for the neural vocoder

Downloads last month: 61

Model tree for NAMAA-Space/NAMAA-Saudi-TTS-V2

Base model

SWivid/F5-TTS

Finetuned

SWivid/Habibi-TTS

Finetuned

(3)

this model

Space using NAMAA-Space/NAMAA-Saudi-TTS-V2 1

Paper for NAMAA-Space/NAMAA-Saudi-TTS-V2

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Paper • 2410.06885 • Published Oct 9, 2024 • 47