Benchmarks¶
Primary published numbers are evaluated on the VoxConverse dev set (216 files, 1--20 speakers per file). We also run preliminary cross-dataset checks on AMI meetings to track generalisation.
Speaker Count Estimation¶
| Metric | Result |
|---|---|
| Files | 216 |
| Exact match | 125/216 (58%) |
| Within +/-1 | 178/216 (82%) |
The automatic estimator is usually close, but exact counting remains the main weak spot. Accuracy drops for many-speaker files --- see Limitations below.
Diarization Error Rate (DER)¶
DER is the standard metric for speaker diarization, computed with
collar=0.25 and skip_overlap=True.
| System | Weighted DER | Median DER | Notes |
|---|---|---|---|
| pyannote precision-2 | ~8.5% | -- | Commercial license |
| diarize | ~4.8% | ~2.1% | Apache 2.0, CPU-only, no API key |
| pyannote community-1 | ~11.2% | -- | CC-BY-4.0, needs HF token |
| pyannote 3.1 (legacy) | ~11.2% | -- | MIT, needs HF token |
pyannote DER numbers are self-reported from the pyannote benchmark page on VoxConverse v0.3.
Dataset-specific result
On this VoxConverse dev evaluation, diarize reports lower weighted
DER than the published pyannote VoxConverse figures, while requiring
no HuggingFace token or account registration. Treat this as a
VoxConverse-specific benchmark and compare on your own audio when accuracy
is the top priority.
Cross-Dataset Check: AMI¶
Preliminary AMI test-set evaluation uses 16 Mix-Headset meeting
recordings (4--9 speakers per file), RTTM annotations from the
standard AMI speaker-diarization benchmark, and the same DER settings
(collar=0.25, skip_overlap=True).
| Metric | Result |
|---|---|
| Files | 16 |
| Weighted DER | 14.96% |
| Mean DER | 14.63% |
| Median DER | 14.18% |
| Speaker count exact match | 4/16 (25%) |
| Speaker count within +/-1 | 8/16 (50%) |
This confirms that meeting-domain audio is a harder case for automatic speaker counting. The estimator often collapses 6+ speaker meetings to 4--5 speakers, even when aggregate DER remains moderate because some ground-truth speakers have little speaking time.
CPU Speed (Real Time Factor)¶
RTF = processing_time / audio_duration. Lower is faster; RTF < 1.0 means faster than real-time.
| System | Mean RTF | Median RTF | Notes |
|---|---|---|---|
| diarize | 0.12 | 0.12 | ~7x faster than community-1 |
| pyannote community-1 | 0.82 | 0.86 | ~2x faster than 3.1 |
| pyannote 3.1 (legacy) | 1.74 | 1.83 | Slower than real-time on CPU |
Measured on VoxConverse dev files on Apple M2 Pro / M2 Max (CPU only, no GPU). All systems were warm-started (models pre-loaded).
Apples-to-apples
All systems ran on the same files with torch.device("cpu").
diarize uses ONNX Runtime for speaker embeddings; pyannote uses
PyTorch neural networks (segmentation + embedding models).
pyannote 3.1 is slower than real-time on CPU
With RTF > 1.0, pyannote 3.1 cannot process audio in real-time
on CPU. A 10-minute recording takes ~18 minutes to diarize vs
~1.2 minutes with diarize. Community-1 is faster (RTF ~0.86)
but still ~7x slower than diarize.
Methodology¶
- Dataset: VoxConverse dev set --- 216 audio files recorded from YouTube debates, news shows, and other multi-speaker media.
- Ground truth: RTTM annotations from the official repository.
- Evaluation: pyannote.metrics
DiarizationErrorRatewith standard parameters. - Speed benchmark: 25 files from VoxConverse dev set, stratified by
duration. Wall-clock time measured with
time.time()after model warm-up. RTF = processing_time / audio_duration. - Hardware: Apple M2 Pro, macOS, CPU only (no GPU).
Reproducing and Extending Benchmarks¶
The repository includes a dataset-agnostic RTTM runner for local experiments:
python scripts/benchmark_rttm.py \
--dataset voxconverse-dev \
--audio-dir /path/to/voxconverse/dev/audio \
--rttm-dir /path/to/voxconverse/rttm_annotations/dev \
--output results_voxconverse_dev.json
It also supports combined RTTM files and targeted diagnostics:
python scripts/benchmark_rttm.py \
--dataset ami-test \
--audio-dir /path/to/ami/mix-headset/test \
--rttm-file /path/to/AMI.SpeakerDiarization.Benchmark.test.rttm \
--oracle-speakers \
--file-id IS1009a
Use --oracle-speakers to isolate speaker assignment and clustering
quality when the true speaker count is known. Use --list-only to
verify audio/RTTM matching without running inference.
Limitations¶
Speaker count > 7
The GMM BIC speaker-count estimator with silhouette refinement is
usually close on VoxConverse dev, but many-speaker files remain the
hardest case. For 8 or more speakers it can undercount and
produce higher DER.
If you know your audio has many speakers, pass num_speakers
explicitly:
Known limitations:
- Many speakers (8+): Automatic speaker count estimation degrades.
Use
num_speakerswhen the speaker count is known. - Speaker label switching / fragmentation: Temporal smoothing reduces
short label jumps, but on noisy real-world audio one actual speaker can
still be split across multiple
SPEAKER_XXlabels. This is mostly a clustering and embedding-assignment limitation, and it is visible in transcripts even when aggregate DER looks acceptable. - Overlapping speech: DER is computed with
skip_overlap=True. The pipeline does not model overlapping speech --- when two people talk simultaneously, only one is labelled. - Short utterances (< 0.4 s): Segments shorter than 0.4 seconds are not embedded directly; they are assigned the label of the nearest speaker, which can cause errors at speaker boundaries.
Future Work¶
Cross-dataset validation in progress
VoxConverse remains the primary published benchmark. AMI is now used as an additional meeting-domain check, and more datasets are needed before making broad accuracy claims.
Planned evaluation:
- Cross-dataset validation --- DIHARD III, CALLHOME, and other standard benchmarks, run in isolated environments with controlled CPU/memory limits.
- Speaker count estimation comparison --- dedicated benchmarks comparing speaker counting accuracy against pyannote and other systems across datasets.
- Broader system comparison --- benchmark against NeMo, WhisperX, and other open-source diarization solutions with verified, reproducible results.
Planned features:
- Streaming / real-time diarization --- process live audio streams with real-time speaker detection and embedding extraction.
- Speaker identification --- store and compare speaker embeddings to recognise known speakers across sessions.