Skip to content

Benchmarks

Primary published numbers are evaluated on the VoxConverse dev set (216 files, 1--20 speakers per file). We also run preliminary cross-dataset checks on AMI meetings to track generalisation.

Speaker Count Estimation

Metric Result
Files 216
Exact match 125/216 (58%)
Within +/-1 178/216 (82%)

The automatic estimator is usually close, but exact counting remains the main weak spot. Accuracy drops for many-speaker files --- see Limitations below.

Diarization Error Rate (DER)

DER is the standard metric for speaker diarization, computed with collar=0.25 and skip_overlap=True.

System Weighted DER Median DER Notes
pyannote precision-2 ~8.5% -- Commercial license
diarize ~4.8% ~2.1% Apache 2.0, CPU-only, no API key
pyannote community-1 ~11.2% -- CC-BY-4.0, needs HF token
pyannote 3.1 (legacy) ~11.2% -- MIT, needs HF token

pyannote DER numbers are self-reported from the pyannote benchmark page on VoxConverse v0.3.

Dataset-specific result

On this VoxConverse dev evaluation, diarize reports lower weighted DER than the published pyannote VoxConverse figures, while requiring no HuggingFace token or account registration. Treat this as a VoxConverse-specific benchmark and compare on your own audio when accuracy is the top priority.

Cross-Dataset Check: AMI

Preliminary AMI test-set evaluation uses 16 Mix-Headset meeting recordings (4--9 speakers per file), RTTM annotations from the standard AMI speaker-diarization benchmark, and the same DER settings (collar=0.25, skip_overlap=True).

Metric Result
Files 16
Weighted DER 14.96%
Mean DER 14.63%
Median DER 14.18%
Speaker count exact match 4/16 (25%)
Speaker count within +/-1 8/16 (50%)

This confirms that meeting-domain audio is a harder case for automatic speaker counting. The estimator often collapses 6+ speaker meetings to 4--5 speakers, even when aggregate DER remains moderate because some ground-truth speakers have little speaking time.

CPU Speed (Real Time Factor)

RTF = processing_time / audio_duration. Lower is faster; RTF < 1.0 means faster than real-time.

System Mean RTF Median RTF Notes
diarize 0.12 0.12 ~7x faster than community-1
pyannote community-1 0.82 0.86 ~2x faster than 3.1
pyannote 3.1 (legacy) 1.74 1.83 Slower than real-time on CPU

Measured on VoxConverse dev files on Apple M2 Pro / M2 Max (CPU only, no GPU). All systems were warm-started (models pre-loaded).

Apples-to-apples

All systems ran on the same files with torch.device("cpu"). diarize uses ONNX Runtime for speaker embeddings; pyannote uses PyTorch neural networks (segmentation + embedding models).

pyannote 3.1 is slower than real-time on CPU

With RTF > 1.0, pyannote 3.1 cannot process audio in real-time on CPU. A 10-minute recording takes ~18 minutes to diarize vs ~1.2 minutes with diarize. Community-1 is faster (RTF ~0.86) but still ~7x slower than diarize.

Methodology

  • Dataset: VoxConverse dev set --- 216 audio files recorded from YouTube debates, news shows, and other multi-speaker media.
  • Ground truth: RTTM annotations from the official repository.
  • Evaluation: pyannote.metrics DiarizationErrorRate with standard parameters.
  • Speed benchmark: 25 files from VoxConverse dev set, stratified by duration. Wall-clock time measured with time.time() after model warm-up. RTF = processing_time / audio_duration.
  • Hardware: Apple M2 Pro, macOS, CPU only (no GPU).

Reproducing and Extending Benchmarks

The repository includes a dataset-agnostic RTTM runner for local experiments:

python scripts/benchmark_rttm.py \
  --dataset voxconverse-dev \
  --audio-dir /path/to/voxconverse/dev/audio \
  --rttm-dir /path/to/voxconverse/rttm_annotations/dev \
  --output results_voxconverse_dev.json

It also supports combined RTTM files and targeted diagnostics:

python scripts/benchmark_rttm.py \
  --dataset ami-test \
  --audio-dir /path/to/ami/mix-headset/test \
  --rttm-file /path/to/AMI.SpeakerDiarization.Benchmark.test.rttm \
  --oracle-speakers \
  --file-id IS1009a

Use --oracle-speakers to isolate speaker assignment and clustering quality when the true speaker count is known. Use --list-only to verify audio/RTTM matching without running inference.

Limitations

Speaker count > 7

The GMM BIC speaker-count estimator with silhouette refinement is usually close on VoxConverse dev, but many-speaker files remain the hardest case. For 8 or more speakers it can undercount and produce higher DER. If you know your audio has many speakers, pass num_speakers explicitly:

result = diarize("panel.wav", num_speakers=12)

Known limitations:

  • Many speakers (8+): Automatic speaker count estimation degrades. Use num_speakers when the speaker count is known.
  • Speaker label switching / fragmentation: Temporal smoothing reduces short label jumps, but on noisy real-world audio one actual speaker can still be split across multiple SPEAKER_XX labels. This is mostly a clustering and embedding-assignment limitation, and it is visible in transcripts even when aggregate DER looks acceptable.
  • Overlapping speech: DER is computed with skip_overlap=True. The pipeline does not model overlapping speech --- when two people talk simultaneously, only one is labelled.
  • Short utterances (< 0.4 s): Segments shorter than 0.4 seconds are not embedded directly; they are assigned the label of the nearest speaker, which can cause errors at speaker boundaries.

Future Work

Cross-dataset validation in progress

VoxConverse remains the primary published benchmark. AMI is now used as an additional meeting-domain check, and more datasets are needed before making broad accuracy claims.

Planned evaluation:

  • Cross-dataset validation --- DIHARD III, CALLHOME, and other standard benchmarks, run in isolated environments with controlled CPU/memory limits.
  • Speaker count estimation comparison --- dedicated benchmarks comparing speaker counting accuracy against pyannote and other systems across datasets.
  • Broader system comparison --- benchmark against NeMo, WhisperX, and other open-source diarization solutions with verified, reproducible results.

Planned features:

  • Streaming / real-time diarization --- process live audio streams with real-time speaker detection and embedding extraction.
  • Speaker identification --- store and compare speaker embeddings to recognise known speakers across sessions.