Fine-tuned Spark-TTS 0.5B: German Emotional Speech Synthesis (16-bit Standalone)
This repository contains a fully merged 16-bit standalone version of the fine-tuned Spark-TTS (0.5B) model, specialized for German speech synthesis with advanced support for emotional cues and non-verbal audio tokens.
No separate adapters or base model needed β load and run directly.
The model was fine-tuned using LoRA (Low-Rank Adaptation) on the curated Vishalshendge3198/Dataset_eleven_v3 German dataset.
π Key Highlights
- 57.14% Loss Improvement: Reduced test loss from 10.0074 (Base) to 4.2891 (Fine-tuned)
- Zero Quality Loss: 16-bit merged weights β no quantization rounding errors
- Standalone: No base model or adapters needed
- Architecture: Spark-TTS 0.5B (Qwen2-based LLM backbone)
- Emotional Support:
[happy],[angry],[sad],[thoughtful],[sleepy],[whisper], and more - Non-Verbal Tokens:
[sighs],[laughter],[yawn],[growl],[pause], etc.
π Supported Tags
Use square brackets [tag] in your prompts:
| Category | Tags |
|---|---|
| Emotions | [happy], [angry], [sad], [thoughtful], [neutral], [sleepy], [whisper], [worried], [annoyed], [surprised], [fearful], [contemptuous], [disgusted] |
| Paralinguistic | [sighs], [laughter], [cry], [growl], [sob], [breath], [pause], [grit], [yawn], [mumble], [sniffles], [exhales sharply], [inhales deeply], [chuckles], [tremble], [sigh] |
ποΈ Training Details
| Parameter | Value |
|---|---|
| Base Model | SparkAudio/Spark-TTS-0.5B |
| Dataset | Vishalshendge3198/Dataset_eleven_v3 |
| Train / Val / Test | 1926 / 241 / 241 samples |
| Learning Rate | 5e-4 |
| LoRA Rank (R) | 64 |
| LoRA Alpha | 64 |
| Epochs | 3 |
| Framework | Unsloth 2026.1 + HuggingFace Transformers |
| Training Time | ~17.5 minutes |
| Peak GPU Memory | 3.22 GB |
π Inference Example
Requires: The Spark-TTS repo cloned locally for
BiCodecTokenizer(audio detokenization).
import sys, os, re, torch, soundfile as sf
sys.path.insert(0, "Spark-TTS") # clone from https://github.com/SparkAudio/Spark-TTS
from unsloth import FastLanguageModel
from sparktts.models.audio_tokenizer import BiCodecTokenizer
MODEL_NAME = "Vishalshendge3198/spark_tts_finetune"
SPARK_TTS_DIR = "Spark-TTS-0.5B" # local Spark-TTS weights folder
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load model (standalone β no adapters needed)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=2048,
dtype=None,
load_in_4bit=False, # 16-bit, no quantization needed
)
FastLanguageModel.for_inference(model)
# Load BiCodec audio tokenizer
audio_tokenizer = BiCodecTokenizer(SPARK_TTS_DIR, device=DEVICE)
# Generate speech
text = "[happy] Das ist ja wunderbar, endlich klappt es!"
prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>"
inputs = tokenizer([prompt], return_tensors="pt").to(DEVICE)
generated_ids = model.generate(
**inputs, max_new_tokens=2048, do_sample=True,
temperature=0.8, top_k=50, top_p=1.0,
eos_token_id=tokenizer.eos_token_id,
)
generated_ids_trimmed = generated_ids[:, inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# Extract tokens
semantic_ids = torch.tensor([int(t) for t in re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)]).long().unsqueeze(0)
global_ids = [int(t) for t in re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)][:32]
global_ids += [0] * (32 - len(global_ids))
global_ids = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0)
# Decode to audio
waveform = audio_tokenizer.detokenize(global_ids.squeeze(0).to(DEVICE), semantic_ids.to(DEVICE))
sf.write("output.wav", waveform, audio_tokenizer.config.get("sample_rate", 16000))
print("β
Saved to output.wav")
π Performance
| Metric | Base Model (0.5B) | Fine-tuned (German) | Improvement |
|---|---|---|---|
| Validation Loss | ~10.0074 | 4.3125 | 56.9% |
| Test Loss | ~10.0074 | 4.2891 | 57.14% |
| German Emotional Prosody | Basic | Advanced | High |
π Credits
Developed by Vishal Shendge as part of a German TTS fine-tuning research project using the Spark-TTS architecture by SparkAudio. Special thanks to the Unsloth team for the efficient fine-tuning framework.
- Downloads last month
- 82