You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Fine-tuned Spark-TTS 0.5B: German Emotional Speech Synthesis (16-bit Standalone)

This repository contains a fully merged 16-bit standalone version of the fine-tuned Spark-TTS (0.5B) model, specialized for German speech synthesis with advanced support for emotional cues and non-verbal audio tokens.

No separate adapters or base model needed β€” load and run directly.

The model was fine-tuned using LoRA (Low-Rank Adaptation) on the curated Vishalshendge3198/Dataset_eleven_v3 German dataset.


πŸš€ Key Highlights

  • 57.14% Loss Improvement: Reduced test loss from 10.0074 (Base) to 4.2891 (Fine-tuned)
  • Zero Quality Loss: 16-bit merged weights β€” no quantization rounding errors
  • Standalone: No base model or adapters needed
  • Architecture: Spark-TTS 0.5B (Qwen2-based LLM backbone)
  • Emotional Support: [happy], [angry], [sad], [thoughtful], [sleepy], [whisper], and more
  • Non-Verbal Tokens: [sighs], [laughter], [yawn], [growl], [pause], etc.

🎭 Supported Tags

Use square brackets [tag] in your prompts:

Category Tags
Emotions [happy], [angry], [sad], [thoughtful], [neutral], [sleepy], [whisper], [worried], [annoyed], [surprised], [fearful], [contemptuous], [disgusted]
Paralinguistic [sighs], [laughter], [cry], [growl], [sob], [breath], [pause], [grit], [yawn], [mumble], [sniffles], [exhales sharply], [inhales deeply], [chuckles], [tremble], [sigh]

πŸ‹οΈ Training Details

Parameter Value
Base Model SparkAudio/Spark-TTS-0.5B
Dataset Vishalshendge3198/Dataset_eleven_v3
Train / Val / Test 1926 / 241 / 241 samples
Learning Rate 5e-4
LoRA Rank (R) 64
LoRA Alpha 64
Epochs 3
Framework Unsloth 2026.1 + HuggingFace Transformers
Training Time ~17.5 minutes
Peak GPU Memory 3.22 GB

πŸ”Š Inference Example

Requires: The Spark-TTS repo cloned locally for BiCodecTokenizer (audio detokenization).

import sys, os, re, torch, soundfile as sf

sys.path.insert(0, "Spark-TTS")  # clone from https://github.com/SparkAudio/Spark-TTS

from unsloth import FastLanguageModel
from sparktts.models.audio_tokenizer import BiCodecTokenizer

MODEL_NAME    = "Vishalshendge3198/spark_tts_finetune"
SPARK_TTS_DIR = "Spark-TTS-0.5B"   # local Spark-TTS weights folder
DEVICE        = "cuda" if torch.cuda.is_available() else "cpu"

# Load model (standalone β€” no adapters needed)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=False,   # 16-bit, no quantization needed
)
FastLanguageModel.for_inference(model)

# Load BiCodec audio tokenizer
audio_tokenizer = BiCodecTokenizer(SPARK_TTS_DIR, device=DEVICE)

# Generate speech
text   = "[happy] Das ist ja wunderbar, endlich klappt es!"
prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>"
inputs = tokenizer([prompt], return_tensors="pt").to(DEVICE)

generated_ids = model.generate(
    **inputs, max_new_tokens=2048, do_sample=True,
    temperature=0.8, top_k=50, top_p=1.0,
    eos_token_id=tokenizer.eos_token_id,
)

generated_ids_trimmed = generated_ids[:, inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]

# Extract tokens
semantic_ids = torch.tensor([int(t) for t in re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)]).long().unsqueeze(0)
global_ids   = [int(t) for t in re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)][:32]
global_ids  += [0] * (32 - len(global_ids))
global_ids   = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0)

# Decode to audio
waveform = audio_tokenizer.detokenize(global_ids.squeeze(0).to(DEVICE), semantic_ids.to(DEVICE))
sf.write("output.wav", waveform, audio_tokenizer.config.get("sample_rate", 16000))
print("βœ… Saved to output.wav")

πŸ“Š Performance

Metric Base Model (0.5B) Fine-tuned (German) Improvement
Validation Loss ~10.0074 4.3125 56.9%
Test Loss ~10.0074 4.2891 57.14%
German Emotional Prosody Basic Advanced High

πŸ“œ Credits

Developed by Vishal Shendge as part of a German TTS fine-tuning research project using the Spark-TTS architecture by SparkAudio. Special thanks to the Unsloth team for the efficient fine-tuning framework.

Downloads last month
82
Safetensors
Model size
0.5B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support