Benchmarks¶

Evaluated on the VoxConverse dev set (216 files, 1--20 speakers per file).

Speaker Count Estimation¶

GT Speakers	Files	Exact Match	Within +/-1
1	22	91%	95%
2	44	70%	91%
3	35	69%	97%
4	24	54%	88%
5	31	32%	87%
6--7	29	45%	79%
8+	31	0%	26%
Overall	216	51%	81%

The algorithm works best for 1--4 speakers (88--97% within +/-1). Accuracy drops for 8 or more speakers --- see Limitations below.

Diarization Error Rate (DER)¶

DER is the standard metric for speaker diarization, computed with collar=0.25 and skip_overlap=True.

System	Weighted DER	Median DER	Notes
pyannote precision-2	~8.5%	--	Commercial license
diarize	~10.8%	~3.7%	Apache 2.0, CPU-only, no API key
pyannote community-1	~11.2%	--	CC-BY-4.0, needs HF token
pyannote 3.1 (legacy)	~11.2%	--	MIT, needs HF token

pyannote DER numbers are self-reported from the pyannote benchmark page on VoxConverse v0.3.

Better than pyannote 3.1 on VoxConverse

diarize achieves lower DER than both pyannote 3.1 (legacy) and community-1 on VoxConverse, while requiring no HuggingFace token or account registration.

CPU Speed (Real Time Factor)¶

RTF = processing_time / audio_duration. Lower is faster; RTF < 1.0 means faster than real-time.

System	Mean RTF	Median RTF	Notes
diarize	0.12	0.12	~7x faster than community-1
pyannote community-1	0.82	0.86	~2x faster than 3.1
pyannote 3.1 (legacy)	1.74	1.83	Slower than real-time on CPU

Measured on VoxConverse dev files on Apple M2 Pro / M2 Max (CPU only, no GPU). All systems were warm-started (models pre-loaded).

Apples-to-apples

All systems ran on the same files with torch.device("cpu"). diarize uses ONNX Runtime for speaker embeddings; pyannote uses PyTorch neural networks (segmentation + embedding models).

pyannote 3.1 is slower than real-time on CPU

With RTF > 1.0, pyannote 3.1 cannot process audio in real-time on CPU. A 10-minute recording takes ~18 minutes to diarize vs ~1.2 minutes with diarize. Community-1 is faster (RTF ~0.86) but still ~7x slower than diarize.

Methodology¶

Dataset: VoxConverse dev set --- 216 audio files recorded from YouTube debates, news shows, and other multi-speaker media.
Ground truth: RTTM annotations from the official repository.
Evaluation: pyannote.metrics DiarizationErrorRate with standard parameters.
Speed benchmark: 25 files from VoxConverse dev set, stratified by duration. Wall-clock time measured with time.time() after model warm-up. RTF = processing_time / audio_duration.
Hardware: Apple M2 Pro, macOS, CPU only (no GPU).

Limitations¶

Speaker count > 7

The GMM BIC speaker-count estimator with silhouette refinement works well for 1--5 speakers and degrades gradually for 6--7. For 8 or more speakers it tends to undercount and produces higher DER. If you know your audio has many speakers, pass num_speakers explicitly:

result = diarize("panel.wav", num_speakers=12)

Known limitations:

Many speakers (8+): Automatic speaker count estimation degrades --- GMM BIC with silhouette refinement reaches 26% within-one accuracy for 8+ speakers. Use num_speakers when the speaker count is known.
Overlapping speech: DER is computed with skip_overlap=True. The pipeline does not model overlapping speech --- when two people talk simultaneously, only one is labelled.
Short utterances (< 0.4 s): Segments shorter than 0.4 seconds are not embedded directly; they are assigned the label of the nearest speaker, which can cause errors at speaker boundaries.

Future Work¶

Single-dataset disclaimer

All results above are from VoxConverse dev set only. We are actively expanding evaluation to ensure the algorithm generalises well and is not overfit to a single benchmark.

Planned evaluation:

Cross-dataset validation --- AMI, DIHARD III, CALLHOME, and other standard benchmarks, run in isolated environments with controlled CPU/memory limits.
Speaker count estimation comparison --- dedicated benchmarks comparing speaker counting accuracy against pyannote and other systems across datasets.
Broader system comparison --- benchmark against NeMo, WhisperX, and other open-source diarization solutions with verified, reproducible results.

Planned features:

Streaming / real-time diarization --- process live audio streams with real-time speaker detection and embedding extraction.
Speaker identification --- store and compare speaker embeddings to recognise known speakers across sessions.