How It Works¶
diarize answers "who spoke when?" using a four-stage pipeline,
all running on CPU.
Pipeline Overview¶
Audio File
|
v
[1] Silero VAD -----------> Speech segments (start, end)
|
v
[2] WeSpeaker ResNet34-LM -> 256-dim speaker embeddings
|
v
[3] GMM BIC -------------> Estimated speaker count (k)
|
v
[4] Spectral Clustering --> Speaker labels
|
v
DiarizeResult
Stage 1: Voice Activity Detection¶
Silero VAD (MIT licensed) identifies which portions of the audio contain speech. Non-speech regions (silence, music, noise) are discarded before embedding extraction.
The VAD returns a list of SpeechSegment
objects with start/end timestamps in seconds.
Key parameters: threshold (default 0.45), minimum speech duration, minimum silence duration, padding.
See: run_vad()
Stage 2: Speaker Embedding Extraction¶
WeSpeaker ResNet34-LM (Apache 2.0) extracts a 256-dimensional embedding vector for each speech segment. Long segments are split using a sliding window (1.2s window, 0.6s step) so each window gets its own embedding. Segments shorter than 0.4s are skipped during extraction and assigned the nearest speaker label later.
The model runs via ONNX Runtime, so no GPU is needed.
See: extract_embeddings()
Stage 3: Speaker Count Estimation¶
Unless the user provides num_speakers, the pipeline estimates how many
speakers are present in two steps:
Step 1 --- Cosine similarity pre-check. Compute pairwise cosine similarities of the L2-normalised embeddings. If the 10th percentile is above a threshold (0.16), all embeddings belong to the same speaker and k = 1 is returned immediately --- skipping the more expensive GMM fitting entirely.
Step 2 --- GMM BIC. If the pre-check does not trigger, the pipeline uses Gaussian Mixture Models with the Bayesian Information Criterion (BIC):
- Project to 8 dimensions via PCA (optimal for GMM with full covariance)
- For each candidate k (from
min_speakerstomax_speakers), fitGaussianMixture(k, covariance_type="full") - Select k with the lowest BIC score
The PCA=8 setting provides a good balance: stable estimation for 2--7 speakers while keeping computational cost low.
Warning
For 8 or more speakers the estimator systematically undercounts.
Pass num_speakers explicitly when the speaker count is known.
See Benchmarks --- Limitations.
See: estimate_speakers()
Stage 4: Spectral Clustering¶
With the number of speakers determined, Spectral Clustering groups
the embedding vectors using cosine similarity as the affinity metric.
The cosine similarity matrix is rescaled to [0, 1] and passed to
scikit-learn's SpectralClustering.
Adjacent subsegments assigned to the same speaker are merged, and short segments that were skipped during embedding extraction are assigned the label of the nearest speaker.
See: cluster_spectral()
Using Individual Stages¶
Each stage is available as a standalone function for advanced use cases:
from diarize.vad import run_vad
from diarize.embeddings import extract_embeddings
from diarize.clustering import estimate_speakers, cluster_spectral
# Run stages individually
speech_segments = run_vad("meeting.wav")
embeddings, subsegments = extract_embeddings("meeting.wav", speech_segments)
k, details = estimate_speakers(embeddings, min_k=2, max_k=10)
labels = cluster_spectral(embeddings, k)
# Use results
for sub, label in zip(subsegments, labels):
print(f"[{sub.start:.1f}s - {sub.end:.1f}s] SPEAKER_{label:02d}")
Dependencies¶
All components are permissively licensed:
| Component | License | Purpose |
|---|---|---|
| Silero VAD | MIT | Voice Activity Detection |
| WeSpeaker | Apache 2.0 | Speaker Embeddings (ONNX) |
| scikit-learn | BSD | Spectral Clustering, GMM, PCA |
| PyTorch | BSD | Audio loading, VAD inference |
| Pydantic | MIT | Data validation and models |