API Reference¶
Main Pipeline¶
diarize
¶
diarize(audio_path: str | Path, *, min_speakers: int = 1, max_speakers: int = 20, num_speakers: int | None = None) -> DiarizeResult
Run the full speaker diarization pipeline on an audio file.
Pipeline stages:
- Silero VAD — detect speech segments
- WeSpeaker ResNet34-LM — extract 256-dim speaker embeddings
- GMM BIC — estimate number of speakers (unless num_speakers is provided)
- Spectral Clustering — assign speaker labels
| PARAMETER | DESCRIPTION |
|---|---|
audio_path
|
Path to an audio file (wav, mp3, flac, etc.).
TYPE:
|
min_speakers
|
Minimum number of speakers for auto-detection.
TYPE:
|
max_speakers
|
Maximum number of speakers for auto-detection.
TYPE:
|
num_speakers
|
If set, skip auto-detection and use this exact number of speakers.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DiarizeResult
|
class: |
DiarizeResult
|
export methods. |
Example::
from diarize import diarize
result = diarize("meeting.wav")
print(f"Found {result.num_speakers} speakers")
for seg in result.segments:
print(f" [{seg.start:.1f} - {seg.end:.1f}] {seg.speaker}")
result.to_rttm("meeting.rttm")
Result Types¶
DiarizeResult
¶
Bases: BaseModel
Result of speaker diarization.
This is the main object returned by :func:diarize.diarize.
| ATTRIBUTE | DESCRIPTION |
|---|---|
segments |
Diarization segments sorted by start time.
TYPE:
|
audio_path |
Path to the source audio file.
TYPE:
|
audio_duration |
Duration of the source audio in seconds.
TYPE:
|
estimation_details |
Diagnostic info from speaker count estimation.
TYPE:
|
Example::
result = diarize("meeting.wav")
print(result.num_speakers) # 3
print(result.speakers) # ['SPEAKER_00', 'SPEAKER_01', 'SPEAKER_02']
result.to_rttm("meeting.rttm") # export RTTM
result.model_dump() # full dict serialization
audio_duration
class-attribute
instance-attribute
¶
estimation_details
class-attribute
instance-attribute
¶
to_rttm
¶
Export segments as RTTM (Rich Transcription Time Marked) format.
RTTM is the standard interchange format for diarization results,
used by evaluation tools like pyannote.metrics and dscore.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
If provided, write RTTM to this file path. Otherwise just return the RTTM string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
RTTM-formatted string. |
Example::
result.to_rttm("output.rttm") # write to file
rttm_str = result.to_rttm() # get as string
to_list
¶
Export segments as a list of plain dicts (JSON-friendly).
| RETURNS | DESCRIPTION |
|---|---|
list[dict[str, float | str]]
|
List of |
Segment
¶
Bases: BaseModel
A single diarization segment with start/end times and speaker label.
| ATTRIBUTE | DESCRIPTION |
|---|---|
start |
Segment start time in seconds.
TYPE:
|
end |
Segment end time in seconds. Must be greater than start.
TYPE:
|
speaker |
Speaker label, e.g.
TYPE:
|
Example::
seg = Segment(start=0.5, end=3.2, speaker="SPEAKER_00")
print(seg.duration) # 2.7
SpeechSegment
¶
Bases: BaseModel
A speech segment detected by VAD (no speaker label yet).
| ATTRIBUTE | DESCRIPTION |
|---|---|
start |
Segment start time in seconds.
TYPE:
|
end |
Segment end time in seconds.
TYPE:
|
SubSegment
¶
Bases: BaseModel
An embedding window within a speech segment.
| ATTRIBUTE | DESCRIPTION |
|---|---|
start |
Window start time in seconds.
TYPE:
|
end |
Window end time in seconds.
TYPE:
|
parent_idx |
Index of the parent :class:
TYPE:
|
SpeakerEstimationDetails
¶
Bases: BaseModel
Diagnostic details from speaker count estimation.
| ATTRIBUTE | DESCRIPTION |
|---|---|
method |
Estimation method used (e.g.
TYPE:
|
best_k |
Estimated number of speakers.
TYPE:
|
pca_dim |
Number of PCA dimensions used.
TYPE:
|
k_bics |
Mapping of
TYPE:
|
reason |
Short description if estimation was skipped.
TYPE:
|
cosine_sim_p10 |
10th percentile of pairwise cosine similarities (populated when single-speaker pre-check is evaluated).
TYPE:
|
Voice Activity Detection¶
run_vad
¶
run_vad(audio_path: str | Path, *, threshold: float = 0.45, min_speech_duration_ms: int = 200, min_silence_duration_ms: int = 50, speech_pad_ms: int = 20) -> list[SpeechSegment]
Detect speech segments using Silero VAD.
| PARAMETER | DESCRIPTION |
|---|---|
audio_path
|
Path to the audio file.
TYPE:
|
threshold
|
VAD probability threshold (0.0 to 1.0). Higher values produce fewer, more confident detections.
TYPE:
|
min_speech_duration_ms
|
Minimum speech segment duration in milliseconds. Segments shorter than this are discarded.
TYPE:
|
min_silence_duration_ms
|
Minimum silence duration in milliseconds required to split speech into separate segments.
TYPE:
|
speech_pad_ms
|
Padding added around each detected speech segment in milliseconds.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[SpeechSegment]
|
List of :class: |
list[SpeechSegment]
|
sorted by start time. |
Example::
segments = run_vad("meeting.wav")
for seg in segments:
print(f"Speech: {seg.start:.2f}s - {seg.end:.2f}s ({seg.duration:.2f}s)")
Embedding Extraction¶
extract_embeddings
¶
extract_embeddings(audio_path: str | Path, speech_segments: list[SpeechSegment]) -> tuple[np.ndarray, list[SubSegment]]
Extract 256-dim speaker embeddings using WeSpeaker ResNet34-LM (ONNX).
Long segments are split using a sliding window for more accurate clustering. Each window produces its own embedding.
| PARAMETER | DESCRIPTION |
|---|---|
audio_path
|
Path to the audio file (wav, mp3, flac, etc.).
TYPE:
|
speech_segments
|
Speech segments detected by VAD.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
A |
list[SubSegment]
|
|
tuple[ndarray, list[SubSegment]]
|
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If audio_path does not exist. |
Example::
from diarize.vad import run_vad
from diarize.embeddings import extract_embeddings
segments = run_vad("meeting.wav")
embeddings, subs = extract_embeddings("meeting.wav", segments)
print(embeddings.shape) # (N, 256)
Clustering¶
estimate_speakers
¶
estimate_speakers(embeddings: ndarray, min_k: int = 1, max_k: int = 20) -> tuple[int, SpeakerEstimationDetails]
Estimate the number of speakers using GMM BIC.
Algorithm:
- L2-normalise embeddings.
- PCA projection to 8 dimensions (optimal for GMM with full covariance).
- For each k from min_k to max_k, fit
GaussianMixture(k, covariance_type="full"). - Select k with the minimum BIC.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Speaker embeddings of shape
TYPE:
|
min_k
|
Minimum number of speakers to consider.
TYPE:
|
max_k
|
Maximum number of speakers to consider.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int
|
A |
SpeakerEstimationDetails
|
speaker count and details is a |
tuple[int, SpeakerEstimationDetails]
|
class: |
tuple[int, SpeakerEstimationDetails]
|
diagnostic information. |
Example::
k, details = estimate_speakers(embeddings, min_k=1, max_k=10)
print(f"Estimated {k} speakers (PCA dim={details.pca_dim})")
cluster_spectral
¶
Cluster embeddings into k speakers using Spectral Clustering.
Uses cosine similarity as the affinity metric, rescaled to [0, 1].
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Speaker embeddings of shape
TYPE:
|
k
|
Number of clusters (speakers).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
Integer label array of shape |
Example::
labels = cluster_spectral(embeddings, k=3)
print(set(labels)) # {0, 1, 2}
cluster_auto
¶
cluster_auto(embeddings: ndarray, min_speakers: int = 1, max_speakers: int = 20) -> tuple[np.ndarray, SpeakerEstimationDetails]
Automatically determine speaker count and cluster embeddings.
Combines :func:estimate_speakers and :func:cluster_spectral
in a single call.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Speaker embeddings of shape
TYPE:
|
min_speakers
|
Minimum number of speakers.
TYPE:
|
max_speakers
|
Maximum number of speakers.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
A |
SpeakerEstimationDetails
|
of shape |
tuple[ndarray, SpeakerEstimationDetails]
|
class: |
cluster_speakers
¶
cluster_speakers(embeddings: ndarray, min_speakers: int = 1, max_speakers: int = 20, num_speakers: int | None = None) -> tuple[np.ndarray, SpeakerEstimationDetails | None]
Cluster speaker embeddings into groups.
If num_speakers is provided, uses that exact number. Otherwise automatically estimates the number of speakers via GMM BIC.
| PARAMETER | DESCRIPTION |
|---|---|
embeddings
|
Speaker embeddings of shape
TYPE:
|
min_speakers
|
Minimum number of speakers for auto-detection.
TYPE:
|
max_speakers
|
Maximum number of speakers for auto-detection.
TYPE:
|
num_speakers
|
If set, skip auto-detection and use this exact number.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ndarray
|
A |
SpeakerEstimationDetails | None
|
num_speakers is explicitly provided (no estimation performed). |
Example::
labels, details = cluster_speakers(embeddings, num_speakers=3)
# or
labels, details = cluster_speakers(embeddings, min_speakers=2, max_speakers=10)
Utilities¶
get_audio_duration
¶
Return audio duration in seconds.
Tries soundfile first, falls back to torchaudio.
| PARAMETER | DESCRIPTION |
|---|---|
audio_path
|
Path to an audio file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Duration in seconds, or |
format_timestamp
¶
Format a number of seconds as HH:MM:SS or MM:SS.
| PARAMETER | DESCRIPTION |
|---|---|
seconds
|
Time in seconds (non-negative).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Human-readable timestamp string. |
Examples::
format_timestamp(45) # "00:45"
format_timestamp(3661) # "01:01:01"