Ultra Diar Streaming Sortformer (8-Speaker)
This model extends NVIDIA Streaming Sortformer speaker diarization from 4 speakers to 8 speakers. The original diar_streaming_sortformer_4spk-v2.1 supports up to 4 speakers; this model expands the capability to handle up to 8 speakers through fine-tuning and architectural modifications.
Model Details
- Base model: nvidia/diar_streaming_sortformer_4spk-v2.1
- Extension: 4spk → 8spk
- Framework: NeMo (NVIDIA)
- Version: 1.0
Code & Training
The experimental pipeline, training scripts, and inference code will be made public on GitHub at a later date. Currently available only on Hugging Face.
Training
- Hardware: 2× NVIDIA H100 GPUs
Usage
This model requires the NVIDIA NeMo toolkit to train, fine-tune, or perform diarization. Install NeMo after installing Cython and the latest PyTorch.
Install NeMo
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
Quick Start: Run Diarization
from nemo.collections.asr.models import SortformerEncLabelModel
# Load model from Hugging Face
diar_model = SortformerEncLabelModel.from_pretrained("devsy0117/ultra_diar_streaming_sortformer_8spk_v1")
diar_model.eval()
# Streaming parameters (recommended for best performance)
diar_model.sortformer_modules.chunk_len = 340
diar_model.sortformer_modules.chunk_right_context = 40
diar_model.sortformer_modules.fifo_len = 40
diar_model.sortformer_modules.spkcache_update_period = 300
# Run diarization
predicted_segments = diar_model.diarize(audio=["/path/to/your/audio.wav"], batch_size=1)
for segment in predicted_segments[0]:
print(segment)
Loading the Model
from nemo.collections.asr.models import SortformerEncLabelModel
# Option 1: Load directly from Hugging Face
diar_model = SortformerEncLabelModel.from_pretrained("devsy0117/ultra_diar_streaming_sortformer_8spk_v1")
# Option 2: Load from a downloaded .nemo file
diar_model = SortformerEncLabelModel.restore_from(
restore_path="/path/to/ultra_diar_streaming_sortformer_8spk_v1.nemo",
map_location="cuda",
strict=False,
)
diar_model.eval()
Input Format
- Single audio file:
audio_input="/path/to/multispeaker_audio.wav" - Multiple files:
audio_input=["/path/to/audio1.wav", "/path/to/audio2.wav"]
Evaluation Results
Comparison with the base model (diar_streaming_sortformer_4spk-v2.1) on AliMeeting and AMI benchmarks. Metrics follow the same internal evaluation pipeline as other Ultra Sortformer releases.
Evaluation Parameters
| Parameter | Value |
|---|---|
| Post-processing | None |
| Collar | 0.25 s |
| Ignore overlap | False |
| Chunk size | 340 frames |
| Batch size | 1 |
AliMeeting (test)
| Model | DER | FA | MISS | CER | Spk_Count_Acc |
|---|---|---|---|---|---|
| diar_streaming_sortformer_4spk-v2.1 (base) | 11.03% | 0.40% | 9.93% | 0.70% | 95.00% |
| ultra_diar_streaming_sortformer_8spk_v1 (ours) | 5.69% | 1.12% | 3.89% | 0.68% | 100.00% |
AMI IHM (test)
| Model | DER | FA | MISS | CER | Spk_Count_Acc |
|---|---|---|---|---|---|
| diar_streaming_sortformer_4spk-v2.1 (base) | 26.05% | 0.50% | 23.51% | 2.03% | 93.75% |
| ultra_diar_streaming_sortformer_8spk_v1 (ours) | 10.87% | 1.53% | 7.89% | 1.44% | 81.25% |
AMI SDM (test)
| Model | DER | FA | MISS | CER | Spk_Count_Acc |
|---|---|---|---|---|---|
| diar_streaming_sortformer_4spk-v2.1 (base) | 28.29% | 0.82% | 23.76% | 3.72% | 93.75% |
| ultra_diar_streaming_sortformer_8spk_v1 (ours) | 15.61% | 2.33% | 8.23% | 5.05% | 75.00% |
CallHome (test)
| Model | eng DER | deu DER | jpn DER | spa DER | zho DER | eng Spk_Acc | deu Spk_Acc | jpn Spk_Acc | spa Spk_Acc | zho Spk_Acc |
|---|---|---|---|---|---|---|---|---|---|---|
| diar_streaming_sortformer_4spk-v2.1 (base) | 4.94% | 6.70% | 10.03% | 23.27% | 7.15% | 83.57% | 80.83% | 79.17% | 63.57% | 72.86% |
| ultra_diar_streaming_sortformer_8spk_v1 (ours) | 8.20% | 7.70% | 11.11% | 18.24% | 10.16% | 92.86% | 90.00% | 89.17% | 70.00% | 75.00% |
Note: The base model is limited to 4 speakers. Extending to 8 speakers changes speaker-count behavior on short or low-speaker sessions; interpret
Spk_Count_Acctogether with DER. This release prioritizes strong DER on challenging multi-speaker settings.
License
This repository’s model weights and documentation are released under the Apache License 2.0.
The upstream base model may be subject to separate terms; see nvidia/diar_streaming_sortformer_4spk-v2.1 for its license and attribution requirements.
- Downloads last month
- 16
Model tree for devsy0117/ultra_diar_streaming_sortformer_8spk_v1
Base model
nvidia/diar_streaming_sortformer_4spk-v2.1