Skip to content

API Reference

Main Pipeline

diarize

diarize(audio_path: str | Path, *, min_speakers: int = 1, max_speakers: int = 20, num_speakers: int | None = None) -> DiarizeResult

Run the full speaker diarization pipeline on an audio file.

Pipeline stages:

  1. Silero VAD — detect speech segments
  2. WeSpeaker ResNet34-LM — extract 256-dim speaker embeddings
  3. GMM BIC — estimate number of speakers (unless num_speakers is provided)
  4. Spectral Clustering — assign speaker labels
PARAMETER DESCRIPTION
audio_path

Path to an audio file (wav, mp3, flac, etc.).

TYPE: str | Path

min_speakers

Minimum number of speakers for auto-detection.

TYPE: int DEFAULT: 1

max_speakers

Maximum number of speakers for auto-detection.

TYPE: int DEFAULT: 20

num_speakers

If set, skip auto-detection and use this exact number of speakers.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
DiarizeResult

class:DiarizeResult containing segments, speaker info, and

DiarizeResult

export methods.

Example::

from diarize import diarize

result = diarize("meeting.wav")
print(f"Found {result.num_speakers} speakers")
for seg in result.segments:
    print(f"  [{seg.start:.1f} - {seg.end:.1f}] {seg.speaker}")
result.to_rttm("meeting.rttm")

Result Types

DiarizeResult

Bases: BaseModel

Result of speaker diarization.

This is the main object returned by :func:diarize.diarize.

ATTRIBUTE DESCRIPTION
segments

Diarization segments sorted by start time.

TYPE: list[Segment]

audio_path

Path to the source audio file.

TYPE: str

audio_duration

Duration of the source audio in seconds.

TYPE: float

estimation_details

Diagnostic info from speaker count estimation.

TYPE: SpeakerEstimationDetails | None

Example::

result = diarize("meeting.wav")
print(result.num_speakers)      # 3
print(result.speakers)          # ['SPEAKER_00', 'SPEAKER_01', 'SPEAKER_02']
result.to_rttm("meeting.rttm")  # export RTTM
result.model_dump()             # full dict serialization

segments class-attribute instance-attribute

segments: list[Segment] = Field(default_factory=list)

audio_path class-attribute instance-attribute

audio_path: str = ''

audio_duration class-attribute instance-attribute

audio_duration: float = Field(default=0.0, ge=0)

estimation_details class-attribute instance-attribute

estimation_details: SpeakerEstimationDetails | None = None

speakers property

speakers: list[str]

Sorted list of unique speaker labels.

num_speakers property

num_speakers: int

Number of unique speakers detected.

to_rttm

to_rttm(path: str | Path | None = None) -> str

Export segments as RTTM (Rich Transcription Time Marked) format.

RTTM is the standard interchange format for diarization results, used by evaluation tools like pyannote.metrics and dscore.

PARAMETER DESCRIPTION
path

If provided, write RTTM to this file path. Otherwise just return the RTTM string.

TYPE: str | Path | None DEFAULT: None

RETURNS DESCRIPTION
str

RTTM-formatted string.

Example::

result.to_rttm("output.rttm")  # write to file
rttm_str = result.to_rttm()    # get as string

to_list

to_list() -> list[dict[str, float | str]]

Export segments as a list of plain dicts (JSON-friendly).

RETURNS DESCRIPTION
list[dict[str, float | str]]

List of {"start": float, "end": float, "speaker": str} dicts.

Segment

Bases: BaseModel

A single diarization segment with start/end times and speaker label.

ATTRIBUTE DESCRIPTION
start

Segment start time in seconds.

TYPE: float

end

Segment end time in seconds. Must be greater than start.

TYPE: float

speaker

Speaker label, e.g. "SPEAKER_00".

TYPE: str

Example::

seg = Segment(start=0.5, end=3.2, speaker="SPEAKER_00")
print(seg.duration)  # 2.7

duration property

duration: float

Duration of the segment in seconds.

SpeechSegment

Bases: BaseModel

A speech segment detected by VAD (no speaker label yet).

ATTRIBUTE DESCRIPTION
start

Segment start time in seconds.

TYPE: float

end

Segment end time in seconds.

TYPE: float

duration property

duration: float

Duration in seconds.

SubSegment

Bases: BaseModel

An embedding window within a speech segment.

ATTRIBUTE DESCRIPTION
start

Window start time in seconds.

TYPE: float

end

Window end time in seconds.

TYPE: float

parent_idx

Index of the parent :class:SpeechSegment.

TYPE: int

SpeakerEstimationDetails

Bases: BaseModel

Diagnostic details from speaker count estimation.

ATTRIBUTE DESCRIPTION
method

Estimation method used (e.g. "gmm_bic").

TYPE: str

best_k

Estimated number of speakers.

TYPE: int

pca_dim

Number of PCA dimensions used.

TYPE: int | None

k_bics

Mapping of k -> BIC values evaluated.

TYPE: dict[int, float]

reason

Short description if estimation was skipped.

TYPE: str | None

cosine_sim_p10

10th percentile of pairwise cosine similarities (populated when single-speaker pre-check is evaluated).

TYPE: float | None


Voice Activity Detection

run_vad

run_vad(audio_path: str | Path, *, threshold: float = 0.45, min_speech_duration_ms: int = 200, min_silence_duration_ms: int = 50, speech_pad_ms: int = 20) -> list[SpeechSegment]

Detect speech segments using Silero VAD.

PARAMETER DESCRIPTION
audio_path

Path to the audio file.

TYPE: str | Path

threshold

VAD probability threshold (0.0 to 1.0). Higher values produce fewer, more confident detections.

TYPE: float DEFAULT: 0.45

min_speech_duration_ms

Minimum speech segment duration in milliseconds. Segments shorter than this are discarded.

TYPE: int DEFAULT: 200

min_silence_duration_ms

Minimum silence duration in milliseconds required to split speech into separate segments.

TYPE: int DEFAULT: 50

speech_pad_ms

Padding added around each detected speech segment in milliseconds.

TYPE: int DEFAULT: 20

RETURNS DESCRIPTION
list[SpeechSegment]

List of :class:SpeechSegment with timestamps in seconds,

list[SpeechSegment]

sorted by start time.

Example::

segments = run_vad("meeting.wav")
for seg in segments:
    print(f"Speech: {seg.start:.2f}s - {seg.end:.2f}s ({seg.duration:.2f}s)")

Embedding Extraction

extract_embeddings

extract_embeddings(audio_path: str | Path, speech_segments: list[SpeechSegment]) -> tuple[np.ndarray, list[SubSegment]]

Extract 256-dim speaker embeddings using WeSpeaker ResNet34-LM (ONNX).

Long segments are split using a sliding window for more accurate clustering. Each window produces its own embedding.

PARAMETER DESCRIPTION
audio_path

Path to the audio file (wav, mp3, flac, etc.).

TYPE: str | Path

speech_segments

Speech segments detected by VAD.

TYPE: list[SpeechSegment]

RETURNS DESCRIPTION
ndarray

A (embeddings, subsegments) tuple where:

list[SubSegment]
  • embeddings --- np.ndarray of shape (N, 256) with raw speaker embeddings (not yet L2-normalised; normalisation is applied later during clustering).
tuple[ndarray, list[SubSegment]]
  • subsegments --- list of :class:SubSegment objects that record the time window and parent segment index for each embedding row.
RAISES DESCRIPTION
FileNotFoundError

If audio_path does not exist.

Example::

from diarize.vad import run_vad
from diarize.embeddings import extract_embeddings

segments = run_vad("meeting.wav")
embeddings, subs = extract_embeddings("meeting.wav", segments)
print(embeddings.shape)  # (N, 256)

Clustering

estimate_speakers

estimate_speakers(embeddings: ndarray, min_k: int = 1, max_k: int = 20) -> tuple[int, SpeakerEstimationDetails]

Estimate the number of speakers using GMM BIC.

Algorithm:

  1. L2-normalise embeddings.
  2. PCA projection to 8 dimensions (optimal for GMM with full covariance).
  3. For each k from min_k to max_k, fit GaussianMixture(k, covariance_type="full").
  4. Select k with the minimum BIC.
PARAMETER DESCRIPTION
embeddings

Speaker embeddings of shape (N, D).

TYPE: ndarray

min_k

Minimum number of speakers to consider.

TYPE: int DEFAULT: 1

max_k

Maximum number of speakers to consider.

TYPE: int DEFAULT: 20

RETURNS DESCRIPTION
int

A (best_k, details) tuple where best_k is the estimated

SpeakerEstimationDetails

speaker count and details is a

tuple[int, SpeakerEstimationDetails]

class:~diarize.utils.SpeakerEstimationDetails instance with

tuple[int, SpeakerEstimationDetails]

diagnostic information.

Example::

k, details = estimate_speakers(embeddings, min_k=1, max_k=10)
print(f"Estimated {k} speakers (PCA dim={details.pca_dim})")

cluster_spectral

cluster_spectral(embeddings: ndarray, k: int) -> np.ndarray

Cluster embeddings into k speakers using Spectral Clustering.

Uses cosine similarity as the affinity metric, rescaled to [0, 1].

PARAMETER DESCRIPTION
embeddings

Speaker embeddings of shape (N, D).

TYPE: ndarray

k

Number of clusters (speakers).

TYPE: int

RETURNS DESCRIPTION
ndarray

Integer label array of shape (N,).

Example::

labels = cluster_spectral(embeddings, k=3)
print(set(labels))  # {0, 1, 2}

cluster_auto

cluster_auto(embeddings: ndarray, min_speakers: int = 1, max_speakers: int = 20) -> tuple[np.ndarray, SpeakerEstimationDetails]

Automatically determine speaker count and cluster embeddings.

Combines :func:estimate_speakers and :func:cluster_spectral in a single call.

PARAMETER DESCRIPTION
embeddings

Speaker embeddings of shape (N, D).

TYPE: ndarray

min_speakers

Minimum number of speakers.

TYPE: int DEFAULT: 1

max_speakers

Maximum number of speakers.

TYPE: int DEFAULT: 20

RETURNS DESCRIPTION
ndarray

A (labels, details) tuple where labels is an integer array

SpeakerEstimationDetails

of shape (N,) and details is

tuple[ndarray, SpeakerEstimationDetails]

class:~diarize.utils.SpeakerEstimationDetails.

cluster_speakers

cluster_speakers(embeddings: ndarray, min_speakers: int = 1, max_speakers: int = 20, num_speakers: int | None = None) -> tuple[np.ndarray, SpeakerEstimationDetails | None]

Cluster speaker embeddings into groups.

If num_speakers is provided, uses that exact number. Otherwise automatically estimates the number of speakers via GMM BIC.

PARAMETER DESCRIPTION
embeddings

Speaker embeddings of shape (N, D).

TYPE: ndarray

min_speakers

Minimum number of speakers for auto-detection.

TYPE: int DEFAULT: 1

max_speakers

Maximum number of speakers for auto-detection.

TYPE: int DEFAULT: 20

num_speakers

If set, skip auto-detection and use this exact number.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
ndarray

A (labels, details) tuple. details is None when

SpeakerEstimationDetails | None

num_speakers is explicitly provided (no estimation performed).

Example::

labels, details = cluster_speakers(embeddings, num_speakers=3)
# or
labels, details = cluster_speakers(embeddings, min_speakers=2, max_speakers=10)

Utilities

get_audio_duration

get_audio_duration(audio_path: str | Path) -> float

Return audio duration in seconds.

Tries soundfile first, falls back to torchaudio.

PARAMETER DESCRIPTION
audio_path

Path to an audio file.

TYPE: str | Path

RETURNS DESCRIPTION
float

Duration in seconds, or 0.0 if the file cannot be read.

format_timestamp

format_timestamp(seconds: float) -> str

Format a number of seconds as HH:MM:SS or MM:SS.

PARAMETER DESCRIPTION
seconds

Time in seconds (non-negative).

TYPE: float

RETURNS DESCRIPTION
str

Human-readable timestamp string.

Examples::

format_timestamp(45)    # "00:45"
format_timestamp(3661)  # "01:01:01"