Voice recognition transcription: what an action-bound transcript needs that a notes-app one doesn't

Voice recognition transcription is the process of converting captured audio into editable text in real time. The textbook answer stops there. The interesting part is what comes after: the transcript only becomes useful when it lands in the right downstream system, and what that system needs from the transcription layer changes the shape of the whole pipeline.

Matthew Diakonov, Written with AI

Published May 12, 20269 min

Direct answer, verified 2026-05-12

Voice recognition transcription converts spoken audio into written text. Production pipelines stream raw PCM (typically 16-bit, 16 kHz mono or stereo) over a WebSocket to a hosted speech model. The model returns interim and final segments with per-word timestamps, confidence scores, and speaker labels. Fazm's implementation, in Desktop/Sources/TranscriptionService.swift in the public repo at github.com/mediar-ai/fazm, uses Deepgram Nova-3 with the parameter set documented at developers.deepgram.com/docs/models-languages-overview. Specifics, including the curated vocabulary and the spoken-form rewrite table, are below.

Two jobs that share a name

Most pages about voice recognition transcription assume the destination is a text file. You speak, words land in a document, a human cleans them up later. The pipeline is a one-way pipe: audio in, prose out, done.

A growing share of the pipelines I see deployed do not work that way at all. The transcript is not the product. The transcript is a prompt that, a few milliseconds later, will be parsed by an LLM and used to drive UI moves on a real desktop: clicks, keystrokes, scrolls, accessibility actions. The user's intent (“reply to that email”, “send the file to Marwan”, “open the last Notion page I had open this morning”) gets satisfied by an agent that reads the macOS accessibility tree and acts inside the app you already have focused.

Both jobs look identical from one foot away, audio streams in, text streams out, but at six inches the transcription layer has to make different choices for each one. The rest of this page is about those choices, drawn from the actual production code of one such agent.

Same audio, two different transcripts

ok so push the new claude sonnet config to vercel add matthew at fazm dot ai to the team and check post hog for empty pee call errors

"empty pee" instead of MCP
"matthew at fazm dot ai" instead of an email
"post hog" split into two words
Fine for a meeting note, useless as a prompt

The pipeline, end to end

Here is the actual call graph between the user holding the Option key and the agent performing an action. Every leg of this is in the public repo at github.com/mediar-ai/fazm and the file names are real.

VOICE -> TEXT -> ACTION

The audio buffer size, 3200 bytes, is exactly 100 ms of 16-bit mono PCM at 16 kHz (16000 samples per second times 2 bytes per sample times 0.1 s). It lives as private let audioBufferSize = 3200 on line 119 of TranscriptionService.swift. The buffer flushes the moment it fills, so latency from microphone to first chunk on the wire is bounded at ~100 ms regardless of how chatty the user is.

Choice #1: vocabulary bias for product nouns

Deepgram Nova-3 lets a client pass keyterm= query parameters to bias the language model toward specific tokens. The hard cap is 500 terms, but a comment in Fazm's source says effectiveness drops past roughly 30, so the shipped seed list is deliberately small and curated. Every name on it is a token the model otherwise mishears with high probability and that has near-zero collision with common English vocabulary.

The 18-term seed vocabulary

Fazm
Claude
Sonnet
Opus
Haiku
Anthropic
MCP
ACP
Supabase
Firestore
PostHog
Sentry
Stripe
Vercel
Deepgram
Whisper
Xcode
SwiftUI
Tauri

The list is rendered straight from the Swift code, in DeletedTypeStubs.swift near line 657 as static let systemVocabulary: [String]. Users add their own terms from a Dictionary settings panel; user terms go first, then de-duplicated system terms get appended. Anything the user removes from the system list gets persisted into disabledSystemVocabulary so an upgrade does not silently restore a term they killed on purpose.

Choice #2: spoken-form rewriting

A notes app can leave “dot com” alone, the writer will fix it. An agent that is about to type a URL into a browser cannot. Deepgram exposes a replace=spoken:written parameter that runs find-and-replace inside the model output. Fazm passes the table below on every English (or multi-language) connection.

Spoken-form replacements (18 rules)

dot com->.com

dot org->.org

dot net->.net

dot io->.io

dot ai->.ai

dot dev->.dev

dot app->.app

dot co->.co

dot me->.me

dot gg->.gg

at sign->@

dot json->.json

dot js->.js

dot ts->.ts

dot py->.py

dot swift->.swift

dot css->.css

dot html->.html

The rules only get attached when the language is en or the auto-detect mode multi. Pushing “dot com” -> “.com” on a German or Russian transcript would corrupt unrelated words. That conditional is right at the bottom of private func connect() in TranscriptionService.swift, around line 301.

Choice #3: separating the user from the meeting

When the agent is supposed to listen during a meeting (not just to the operator), the WebSocket is opened with channels=2 and multichannel=true. Channel 0 carries the mic, channel 1 carries the loopback of system audio (the other people's voices from Zoom, Meet, FaceTime, whatever). Deepgram returns segments tagged with a channel_index array, and Fazm extracts channelIndex = response.channel_index?.first ?? 0 into the transcript struct.

Push-to-talk uses channels=1 instead. The whole prompt is yours, the agent does not need to know who else was on the call, and a single-channel stream is cheaper. The two configurations share the same TranscriptionService class; the only thing the calling code does differently is pass channels: 1 or channels: 2 at init time.

Choice #4: surviving the silent dropout

The thing nobody warns you about with streaming transcription: the server-side connection can die silently. The socket reports as open, you keep flushing frames, nothing comes back. The textbook answer is to use the TCP keepalive, but TCP keepalive defaults on macOS are measured in hours, which is several lifetimes for a live caption.

Fazm runs two cooperating timers on top of the WebSocket. Every state transition below comes from named constants near the top of TranscriptionService.swift.

0 ms: connect

WebSocket opens to wss://api.deepgram.com/v1/listen with the model, language, audio, and parameter set. After 500 ms with no socket error, the service marks isConnected = true and starts both timers.

every 8 s: keepalive

Send {"type": "KeepAlive"} as a string frame. If the send callback returns an error, treat the connection as dead and trigger reconnection. If the send succeeds, record lastKeepaliveSuccessAt = Date().

every 30 s: watchdog

Check lastDataReceivedAt. If more than 60 seconds (the staleThreshold) have passed AND the most recent keepalive ALSO failed, the connection is genuinely dead and a forced reconnect fires. The second clause matters: a silent room is not a dead socket.

on disconnect: exponential reconnect

min(pow(2.0, attempts), 32.0) seconds before retry. Up to 10 attempts. After 10, surface an error to the UI instead of looping forever.

on stop: flush then close

Send any remaining audio in the buffer (it is unsafe to discard a half-second of speech). Then send CloseStream so Deepgram emits a final result. Then cancel keepalive and watchdog tasks and let the socket close on its own.

Choice #5: interim vs final

A naive client either renders every segment Deepgram sends (caption jitters wildly as the model revises its guess) or waits for the final segment of an utterance (the caption looks frozen for two seconds). Both feel broken.

Fazm splits the difference. Interim segments (is_final=false) paint a temporary caption strip. The strip clears and gets replaced when the same segment arrives again with is_final=true. The prompt only commits to the LLM on speech_final=true (the model thinks the user finished speaking) OR when the user releases the Option key (push-to-talk explicit end). The endpointing query parameter is set to 300 ms and utterance_end_ms to 1000 ms, so the model uses voice activity detection (300 ms of silence -> segment boundary) as the primary signal and a 1-second all-silence fallback as the backup.

18 + 18

“Voice recognition transcription is the easy part. The transcription pipeline that produces text an agent can act on is the harder part.”

vocabulary terms + spoken-form rewrites shipped in Desktop/Sources/TranscriptionService.swift and DeletedTypeStubs.swift

What this means if you are building your own

You can take a generic speech-to-text API and ship a working dictation field in an afternoon. The boring parts are what take the next two weeks: the vocabulary list, the spoken-form rewrites, the channel split, the keepalive-plus-watchdog pair, the interim/final commit logic, and the buffering scheme that keeps latency at ~100 ms without drowning the wire in 20 ms packets. None of those are in the API docs. Most are not in any product's marketing page either, because they sit one layer down from the feature description.

If you are evaluating a voice-driven product, those are the seams to look for. Ask the vendor what their seed vocabulary is. Ask whether spoken URLs come back normalized. Ask how they detect a silent dropout. The answer either exists (in their git history if they are open source, in the engineer's head if not) or the product will have a frustrating quality floor that does not improve with a model upgrade.

For Fazm specifically, every constant in this post is grep-able in the public repo. If something here is wrong or out of date, the file path tells you exactly where to send the pull request.

Want to see the voice path in person?

Bring a real workflow you would automate by voice and we will walk through whether the transcription layer described here actually serves it, live, on a Mac, with the source open in another window.

Common questions about voice recognition transcription

What is voice recognition transcription, in one sentence?

It is the process of converting captured microphone (or system) audio into editable text in real time, usually by streaming raw PCM samples to a hosted speech-to-text model and receiving interim and final segments back over a long-lived connection. Modern pipelines return per-word timestamps, per-word confidence scores, speaker labels (diarization), and "is this segment final yet" flags so a calling app can decide when to commit the text.

Is "voice recognition" the same thing as "transcription"?

They get used interchangeably but they are not exactly the same. Voice recognition (or speech recognition) is the general capability of mapping audio to characters. Transcription is one product built on top of that capability, the one that emits a written record of what was said. Voice command interfaces ("Hey Siri"), voice biometrics (verifying the speaker), and dictation into a focused text field are all also built on voice recognition but are not transcription in the strict sense. When a vendor says "voice recognition transcription," they usually mean speech-to-text optimised for producing readable text.

What model and parameters does Fazm actually use for transcription?

Deepgram Nova-3 over a WebSocket at wss://api.deepgram.com/v1/listen. The audio side is 16-bit PCM, 16 kHz sample rate, 2 channels (mic on channel 0, system audio on channel 1) for live conversation mode, or 1 channel for push-to-talk. Chunks are buffered to ~100 ms (3200 bytes of int16 at 16 kHz) before sending. The connection adds smart_format, punctuate, diarize, vad_events, interim_results, endpointing=300 ms, and utterance_end_ms=1000 ms. The model and these constants live in Desktop/Sources/TranscriptionService.swift, lines 94 through 99 and 274 through 305, in the public repo at github.com/mediar-ai/fazm.

Why does an action agent need different transcription tuning than a notes app?

Because the transcript is going straight into a prompt that controls a real machine, not into a markdown file a human will edit. Two specific failure modes get expensive fast. First, technical proper nouns like MCP, Sonnet, Supabase, or Anthropic get heard as "Empty Pee," "sonnet" lowercased into a Shakespeare reference, "super base," or "Anthropics" with an s. Second, spoken URL forms like "go to fazm dot ai slash download" get transcribed literally and then the agent types a query like that into the address bar. Fazm fixes both at the transcription layer by passing keyterm= for a curated 18-term vocabulary and replace= for 18 spoken-form rewrites, so the text the LLM sees has already been normalised.

What are the 18 terms in Fazm's system vocabulary?

Fazm, Claude, Sonnet, Opus, Haiku, Anthropic, MCP, ACP, Supabase, Firestore, PostHog, Sentry, Stripe, Vercel, Deepgram, Whisper, Xcode, SwiftUI, Tauri. The full list is in Desktop/Sources/DeletedTypeStubs.swift, around line 657, as systemVocabulary: [String]. A comment a few lines above says Nova-3 caps total keyterms at 500 but effectiveness drops past roughly 30 terms, which is why the seed list stays curated. Users can disable any of them from the Dictionary panel in Settings, and removals get persisted to disabledSystemVocabulary so an upgrade does not undo a deliberate edit.

How does the connection stay alive during silence?

Two timers. A keepalive task wakes every 8 seconds and sends {"type": "KeepAlive"} over the socket so the server does not reap the connection on idle. A watchdog task wakes every 30 seconds and checks the timestamp of the last received data; if more than 60 seconds have passed AND the last keepalive also failed, the connection is treated as silently dead and a reconnect is forced. Reconnection is exponential backoff capped at 32 seconds, up to 10 attempts, before giving up. All four constants (keepaliveInterval=8.0, watchdogInterval=30.0, staleThreshold=60.0, maxReconnectAttempts=10) are visible at the top of TranscriptionService.swift.

Does Fazm send audio to anyone other than Deepgram?

No. The Swift app opens a single WebSocket from your Mac to api.deepgram.com and writes raw PCM frames into it. There is no Fazm-operated audio relay. The Deepgram API key is resolved at runtime either from a DEEPGRAM_API_KEY environment variable or via a small backend key endpoint, but the backend hands a key back, it does not see the audio. If you want to swap Deepgram out (for Whisper, an on-device whisperKit build, or an internal STT gateway), the integration is one Swift file.

Is interim vs final transcription a real distinction or marketing?

Real. With interim_results=true, Deepgram emits provisional transcripts every few hundred milliseconds while you are still speaking, then later sends the same segment again with is_final=true once it has decided that segment is stable. A speech_final=true flag fires when the model thinks you have stopped speaking entirely. Fazm uses interim segments to render the live caption strip during push-to-talk (so the UI does not feel laggy) but only commits the prompt to the agent once speech_final arrives or the user releases the Option key. Without that distinction you either dispatch every guess (the agent loops on partial commands) or you wait for the full final string (the bar feels frozen).

What language coverage does the pipeline support?

About 38 language codes for single-language mode and 14 codes for the multi=auto-detect mode (English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, Dutch, plus regional variants like en-AU, en-GB, fr-CA, es-419, pt-BR). The full list is in DeletedTypeStubs.swift around lines 702-730. The spoken-form replacement table ("dot com" -> ".com") is only applied for English and the multi mode, because the underlying patterns are English-specific and would be noise in other languages.

What is the practical difference between push-to-talk and live mode?

Push-to-talk holds the Option key (or double-taps Option to lock-listen), captures only the user's microphone (1 channel), and sends the captured prompt as a discrete query to the agent when the key is released. Live mode opens a 2-channel stream where channel 0 is mic and channel 1 is loopback system audio (your speakers), so the transcript carries both sides of a meeting. Live mode also turns on diarize=true so the words from each speaker get labelled separately. Both modes share the same TranscriptionService.swift class, just instantiated with channels=1 versus channels=2.

Can I see this in real code or do I have to take the post's word for it?

Real code. Fazm is MIT-licensed at github.com/mediar-ai/fazm. The transcription class is Desktop/Sources/TranscriptionService.swift (711 lines). The vocabulary and language tables are in Desktop/Sources/DeletedTypeStubs.swift (search for systemVocabulary). The push-to-talk state machine is Desktop/Sources/FloatingControlBar/PushToTalkManager.swift (866 lines). The audio capture front end is Desktop/Sources/AudioCaptureService.swift. Every constant cited in this guide can be grep'd by name in those files.

Related, from the same source repo

Open source

Open source AI voice agent: how the same Deepgram WebSocket becomes an action prompt

Companion piece on the open-source side of the voice path: every file in the chain, the swap-in seam if you want to replace Deepgram, and how the transcript becomes a tool call.

Read

Voice handoff

Voice agent desktop workflow handoff: the three code paths nobody describes

Once the transcript becomes a prompt, what happens to the long-running desktop run that was already in flight? Enqueue, interrupt-and-replace, or stop-without-replace.

Read

Engines

Parakeet vs Whisper on a Mac voice agent: which engine for which job

If you are tempted to swap Deepgram for an on-device model, this is the parameter-by-parameter comparison between NVIDIA Parakeet and OpenAI Whisper for a desktop agent.

Read

Two jobs that share a name

Same audio, two different transcripts

The pipeline, end to end

Choice #1: vocabulary bias for product nouns

Choice #2: spoken-form rewriting

Choice #3: separating the user from the meeting

Choice #4: surviving the silent dropout

0 ms: connect

every 8 s: keepalive

every 30 s: watchdog

on disconnect: exponential reconnect

on stop: flush then close

Choice #5: interim vs final

What this means if you are building your own

Want to see the voice path in person?

Common questions about voice recognition transcription

Related, from the same source repo

Open source AI voice agent: how the same Deepgram WebSocket becomes an action prompt

Voice agent desktop workflow handoff: the three code paths nobody describes

Parakeet vs Whisper on a Mac voice agent: which engine for which job

Comments (••)

Comments ()