Speaker Diarization for AI Meeting Agents - Who Said What

Matthew Diakonov

Updated March 19, 2026

speaker-diarization meeting-agent transcription audio-processing ai-agent

Speaker Diarization for AI Meeting Agents - Who Said What

Recording a meeting is easy. Knowing who said what is the hard part. Speaker diarization - the process of segmenting audio by speaker identity - is what separates a raw transcript from an actually useful meeting record.

Why Attribution Matters

"We should ship this next week" means very different things depending on whether the CEO or an intern said it. AI meeting agents that produce a wall of undifferentiated text lose the most important context: who committed to what, who disagreed, and who asked the question that changed the direction.

Action item extraction, decision tracking, and follow-up assignment all depend on knowing which speaker said which words.

How Diarization Works

Modern speaker diarization combines several techniques:

Voice embeddings. Each speaker's voice gets converted to a numerical representation (embedding) that captures their unique vocal characteristics - pitch, tone, speaking rate, formants. Models like ECAPA-TDNN or Pyannote generate these embeddings.

Clustering. Audio segments get grouped by speaker similarity. The system does not need to know who the speakers are in advance - it just identifies that segments A, C, and F sound like the same person, and segments B, D, and E sound like another.

Voice Activity Detection (VAD). Before diarization, the system identifies which parts of the audio contain speech versus silence, background noise, or music. This prevents the model from trying to attribute non-speech audio.

The Overlap Problem

Real conversations have interruptions, cross-talk, and simultaneous speech. Single-channel recordings make these segments extremely difficult to separate. The best results come from multi-channel audio (separate microphones per speaker) or spatial audio that preserves directional information.

Practical Accuracy

Current state-of-the-art diarization achieves roughly 90-95% accuracy on clean recordings with 2-4 speakers. Accuracy drops with more speakers, background noise, and similar-sounding voices. For meeting agents, this means occasional misattribution that requires human correction on critical decisions.

The takeaway: diarization is good enough to be useful, but not good enough to be trusted blindly on high-stakes attribution.

Fazm is an open source macOS AI agent. Open source on GitHub.

Speaker Diarization for AI Meeting Agents - Who Said What

Speaker Diarization for AI Meeting Agents - Who Said What

Why Attribution Matters

How Diarization Works

The Overlap Problem

Practical Accuracy

More on This Topic

Related Posts

Fazm: Open Source macOS AI Agent on GitHub

Parallel API Pricing: What Concurrent Calls Actually Cost

Route Claude API Through a Custom Endpoint with ANTHROPIC_BASE_URL