Fine-tuning NeMo Sortformer for Custom Speaker Diarization

#12
by daikooo - opened

Hi all,

I’m fine-tuning NeMo Sortformer on my own datasets (mostly 1–2 speakers per audio, Indian language). I froze the Sortformer module
sortformer_modules:
target: nemo.collections.asr.modules.sortformer_modules.SortformerModules
num_spks: ${model.max_num_of_spks} # Number of speakers per model. This is currently fixed at 4.
dropout_rate: 0.5 # Dropout rate
fc_d_model: ${model.model_defaults.fc_d_model}
tf_d_model: ${model.model_defaults.tf_d_model} # Hidden layer size for linear layers in Sortformer Diarizer module

and fine-tuned only embeddings. Here’s a sample output I’m getting:

0.64 665.59 Speaker1
0.64 665.59 Speaker2
0.64 665.59 Speaker3
0.64 665.59 Speaker4
599.068 1265.138 Speaker1
599.068 1265.138 Speaker2
599.068 1265.138 Speaker3
599.068 1265.138 Speaker4
1198.533 1835.323 Speaker1
1198.533 1835.323 Speaker2
1198.533 1835.323 Speaker3
1198.533 1835.323 Speaker4

Observations:
With enabling post-processing YAML file, outputs collapse (all speakers active for full audio).
Without post-processing file, diarization exists but is inaccurate.
The confidence scores are flattened out and is similar across the 4 speakers

Looking for advice on:
Fine-tuning Sortformer with fewer than 4 speakers.
Post-processing parameter ranges for better separation.
Freezing layers vs. full model fine-tuning.

Thanks a lot!

Sign up or log in to comment