You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Fine-tuned Spark-TTS 0.5B: German Emotional Speech Synthesis (16-bit Standalone)

This repository contains a fully merged 16-bit standalone version of the fine-tuned Spark-TTS (0.5B) model, specialized for German speech synthesis with advanced support for emotional cues and non-verbal audio tokens.

No separate adapters or base model needed — load and run directly.

The model was fine-tuned using LoRA (Low-Rank Adaptation) on the curated Vishalshendge3198/Dataset_eleven_v3 German dataset.

🚀 Key Highlights

57.14% Loss Improvement: Reduced test loss from 10.0074 (Base) to 4.2891 (Fine-tuned)
Zero Quality Loss: 16-bit merged weights — no quantization rounding errors
Standalone: No base model or adapters needed
Architecture: Spark-TTS 0.5B (Qwen2-based LLM backbone)
Emotional Support: [happy], [angry], [sad], [thoughtful], [sleepy], [whisper], and more
Non-Verbal Tokens: [sighs], [laughter], [yawn], [growl], [pause], etc.

🎭 Supported Tags

Use square brackets [tag] in your prompts:

Category	Tags
Emotions	`[happy]`, `[angry]`, `[sad]`, `[thoughtful]`, `[neutral]`, `[sleepy]`, `[whisper]`, `[worried]`, `[annoyed]`, `[surprised]`, `[fearful]`, `[contemptuous]`, `[disgusted]`
Paralinguistic	`[sighs]`, `[laughter]`, `[cry]`, `[growl]`, `[sob]`, `[breath]`, `[pause]`, `[grit]`, `[yawn]`, `[mumble]`, `[sniffles]`, `[exhales sharply]`, `[inhales deeply]`, `[chuckles]`, `[tremble]`, `[sigh]`

🏋️ Training Details

Parameter	Value
Base Model	SparkAudio/Spark-TTS-0.5B
Dataset	Vishalshendge3198/Dataset_eleven_v3
Train / Val / Test	1926 / 241 / 241 samples
Learning Rate	5e-4
LoRA Rank (R)	64
LoRA Alpha	64
Epochs	3
Framework	Unsloth 2026.1 + HuggingFace Transformers
Training Time	~17.5 minutes
Peak GPU Memory	3.22 GB

🔊 Inference Example

Requires: The Spark-TTS repo cloned locally for BiCodecTokenizer (audio detokenization).

import sys, os, re, torch, soundfile as sf

sys.path.insert(0, "Spark-TTS")  # clone from https://github.com/SparkAudio/Spark-TTS

from unsloth import FastLanguageModel
from sparktts.models.audio_tokenizer import BiCodecTokenizer

MODEL_NAME    = "Vishalshendge3198/spark_tts_finetune"
SPARK_TTS_DIR = "Spark-TTS-0.5B"   # local Spark-TTS weights folder
DEVICE        = "cuda" if torch.cuda.is_available() else "cpu"

# Load model (standalone — no adapters needed)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=False,   # 16-bit, no quantization needed
)
FastLanguageModel.for_inference(model)

# Load BiCodec audio tokenizer
audio_tokenizer = BiCodecTokenizer(SPARK_TTS_DIR, device=DEVICE)

# Generate speech
text   = "[happy] Das ist ja wunderbar, endlich klappt es!"
prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>"
inputs = tokenizer([prompt], return_tensors="pt").to(DEVICE)

generated_ids = model.generate(
    **inputs, max_new_tokens=2048, do_sample=True,
    temperature=0.8, top_k=50, top_p=1.0,
    eos_token_id=tokenizer.eos_token_id,
)

generated_ids_trimmed = generated_ids[:, inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]

# Extract tokens
semantic_ids = torch.tensor([int(t) for t in re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)]).long().unsqueeze(0)
global_ids   = [int(t) for t in re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)][:32]
global_ids  += [0] * (32 - len(global_ids))
global_ids   = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0)

# Decode to audio
waveform = audio_tokenizer.detokenize(global_ids.squeeze(0).to(DEVICE), semantic_ids.to(DEVICE))
sf.write("output.wav", waveform, audio_tokenizer.config.get("sample_rate", 16000))
print("✅ Saved to output.wav")

📊 Performance

Metric	Base Model (0.5B)	Fine-tuned (German)	Improvement
Validation Loss	~10.0074	4.3125	56.9%
Test Loss	~10.0074	4.2891	57.14%
German Emotional Prosody	Basic	Advanced	High

📜 Credits

Developed by Vishal Shendge as part of a German TTS fine-tuning research project using the Spark-TTS architecture by SparkAudio. Special thanks to the Unsloth team for the efficient fine-tuning framework.

Downloads last month: 82

Safetensors

Model size

0.5B params

Tensor type

F32