Voice to text transcription software in 2026: the two axes every shortlist forgets
Every guide to voice to text transcription software ranks the same eight names on the same four columns: word accuracy, language count, monthly price, and the list of apps it integrates with. Useful, but incomplete. There are two other questions that decide whether you will be happy with the product six months in, and almost no shortlist asks them.
Direct answer, verified 2026-05-12
There is no single best voice to text transcription software. There are roughly seven categories of product, and the right one depends on two questions the popular reviews skip. First, will a human read the transcript or will another machine act on it. Second, can you open the source and read the vocabulary and replacement rules. For human reading on async media, Descript or Sonix. For real-time meetings, Otter. For cross-app dictation, Wispr Flow. For an action layer that does the work the words imply, Fazm (the source is at github.com/mediar-ai/fazm).
The seven categories that map to actual jobs
Strip the brand names away and the market for voice to text transcription software falls into seven categories. Most reviews mix products from three or four of them into a single ranked list, which is how a piece of software optimized for editing a podcast ends up in the same row as one optimized for typing into Slack.
| Category | Canonical | Also in | Transcript target | Source-readable |
|---|---|---|---|---|
| Real-time meetings | Otter | Fireflies, Read.ai, Zoom AI Companion | Human reading the transcript afterwards | No |
| Async media (podcast, video) | Descript | Sonix, Trint, Rev | Human editor or producer | No |
| Cross-app dictation | Wispr Flow | Superwhisper, MacWhisper | A focused text field in a foreground app | Partial (Superwhisper, MacWhisper expose model choice) |
| Professional dictation (legal, medical) | Dragon Professional | Philips SpeechLive, M*Modal | Domain-specific document templates | No |
| On-device privacy | Apple Dictation | WhisperKit-based apps, Apple Voice Control | Human, never leaves the device | Apple is closed; WhisperKit is open |
| Vendor speech-to-text APIs | Deepgram Nova-3 | OpenAI Whisper, AssemblyAI, Google STT | An app you build yourself | Closed model, open parameters |
| Action agents (transcript drives a machine) | Fazm | Cursor voice mode, custom Whisper + LLM stacks | Another machine that acts on the text | Yes (Fazm is open source) |
The two columns on the right are the ones the popular shortlists skip. The next two sections take each in turn.
Axis 1: human reader, or machine actor
Voice to text reviews implicitly assume a human is going to read the output. That assumption used to be safe. It is not safe in 2026. A growing share of dictated text is going straight into another model, or into a desktop automation layer, or into the address bar of a browser the agent is driving. Once the consumer of the transcript stops being a human, the failure modes change.
Two specific failures get expensive fast. The first is technical proper nouns. When the speaker says “use the Anthropic SDK with the Sonnet model”, a base model with no biasing tends to write “use the Anthropics S D K with the sonnet model”. A human reading those words pattern-matches in a tenth of a second; a downstream agent treats the strings literally and looks up the wrong package or the wrong model identifier. The second is spoken URL syntax. When the speaker says “go to fazm dot ai slash download”, the transcript reads literally “go to fazm dot ai slash download”. A human reads it as fazm.ai/download and moves on; an agent types the literal string into the URL bar and fails.
Why this matters for shopping
If you ever plan to feed the transcript into another machine (a coding agent, a CRM importer, a desktop automation layer), you need software that exposes two specific knobs: vocabulary biasing for proper nouns, and a spoken-form replacement table for URL and email syntax. Most consumer transcription apps expose neither.
Axis 2: can you read the rules
The closed-source SaaS options dominate the popular shortlists, and for note-bound use they are fine. They become a liability the moment your transcript drives anything important downstream, because you cannot answer the question “why did it transcribe it like that” without filing a support ticket and waiting.
Open-source options give you a different deal. The model itself can still be a hosted service, but the wiring around it (which parameters get sent, which words get pre-loaded, which spoken forms get rewritten before the LLM sees them) is in a file you can grep. For Fazm, that file is Desktop/Sources/TranscriptionService.swift in the public repository at github.com/mediar-ai/fazm. The seed vocabulary lives in Desktop/Sources/DeletedTypeStubs.swift around line 657 as systemVocabulary: [String].
The actual parameters one open-source pipeline sends
Below is the WebSocket URL the desktop app builds when it connects to its speech-to-text provider. Reading it is the fastest way to understand what knobs are exposed and why each one is there. The same parameter shape works for any Deepgram Nova-3 client; only the vocabulary list and the replacement rules are product choices.
Two things to notice. The audio side is unremarkable: 16-bit linear PCM, 16 kHz, two channels (mic on channel 0, system audio on channel 1), buffered into 100 ms chunks before each send. The transcript side carries every knob a real consumer would want to inspect: which model is being called, whether interim results are returned, how aggressive endpointing is, which custom vocabulary is being biased for, which spoken forms are being rewritten before the calling app ever sees the text.
The 19-term seed vocabulary
Here is the full list of system terms the open-source pipeline ships with. The list is deliberately small; Deepgram Nova-3 accepts up to 500 keyterm entries, but in practice past about 30 the bias starts colliding with itself, so the seed list covers only words the base model gets wrong with high probability and that do not collide with common English. The user is expected to add their own 5 to 10 names on top.
systemVocabulary in DeletedTypeStubs.swift, line 657
- Fazm
- Claude
- Sonnet
- Opus
- Haiku
- Anthropic
- MCP
- ACP
- Supabase
- Firestore
- PostHog
- Sentry
- Stripe
- Vercel
- Deepgram
- Whisper
- Xcode
- SwiftUI
- Tauri
Source: github.com/mediar-ai/fazm. The list is editable per user; disabled terms drop out via disabledSystemVocabulary.
The 18 spoken-form replacement rules
Deepgram’s replace= parameter accepts a colon-separated find and replacement string, and the URL builder appends one per rule. Ten of the rules collapse spoken domain TLDs, one converts the spoken email symbol, and seven cover the source-file extensions a developer actually says out loud. The rules are only added when the language is English or in multi-language mode, because the spoken patterns themselves are English.
| Spoken | Written |
|---|---|
| dot com | .com |
| dot org | .org |
| dot net | .net |
| dot io | .io |
| dot ai | .ai |
| dot dev | .dev |
| dot app | .app |
| dot co | .co |
| dot me | .me |
| dot gg | .gg |
| at sign | @ |
| dot json | .json |
| dot js | .js |
| dot ts | .ts |
| dot py | .py |
| dot swift | .swift |
| dot css | .css |
| dot html | .html |
“On clean English with a decent mic, every name on the popular shortlists lands in the same word-accuracy band. The interesting differences live in vocabulary biasing, replacement rules, and whether you can read either.”
Field test, May 2026
How to choose, in five honest steps
Decide who reads the transcript
If a human reads it, the popular shortlists are fine. If another machine reads it, you need vocabulary biasing and spoken-form rewriting, and most popular options do not expose either.
Decide whether you need to audit the rules
If the transcripts feed a regulated workflow, an action layer, or a coding agent, the ability to read the vocabulary list and the replacement table out of source code stops being a nice-to-have.
Run a 60-second test of your own audio
Record yourself speaking the way you actually speak, jargon included. Run the same clip through three candidates. Count substitutions on proper nouns. Vendor-published WER does not survive contact with your microphone and your vocabulary.
Dictate a URL and an email
Say `go to fazm dot ai slash download and email matt at fazm dot ai`. Look at the transcripts. The number of products that emit `fazm.ai/download` and `matt@fazm.ai` instead of the literal spoken form is small.
Pick the smallest box that fits the job
If you only need to type into apps, pick a dictation product. If you need to act through apps, pick an agent that includes transcription. Buying both separately almost never produces the joined-up experience the agent path implies.
What is honest to say about Fazm here
Fazm is not a transcription product. It is a macOS computer-use agent that uses transcription as the front end. If your job is to produce a polished podcast transcript or to caption a recorded webinar, do not buy Fazm; buy Descript or Sonix. If your job is to drive your laptop with your voice (run a browser action, file an invoice, update a CRM record, write and ship code), then a tuned transcription pipeline is a dependency, not a product, and the differentiator becomes whether the agent is using transcription that was tuned for this job or that was bolted on. Fazm’s is tuned for it, which is why the configuration is in the open-source repo and why this entire page can quote it line by line.
Walk through your voice-to-action workflow with us
If you are evaluating voice-controlled desktop automation for a small business, book a working session and we will set up the vocabulary biasing, the replacement rules, and a couple of agent flows against your real apps.
Frequently asked questions
What does voice to text transcription software actually do under the hood in 2026?
Almost all of the modern options stream raw PCM audio (typically 16-bit, 16 kHz, mono or stereo) over a long-lived connection to a hosted speech-to-text model and receive interim and final segments back. The model returns per-word timestamps, per-word confidence scores, speaker labels (diarization), and an `is final` flag so the calling app can decide when to commit text. Older Windows-era apps like Dragon still ship with on-device acoustic models, but the cloud-streamed shape is the default for any 2024 or newer product. The differentiator between apps is mostly which model they call (Deepgram Nova, OpenAI Whisper, Google STT, AssemblyAI, Apple SpeechAnalyzer, Apple Speech, WhisperKit on-device) and how the resulting text is post-processed before display.
What is a realistic accuracy floor today?
On clean, single-speaker English in a quiet room with a decent USB or laptop mic, every name on the popular shortlists (Otter, Sonix, Rev, Wispr Flow, Apple Dictation, Deepgram Nova-3, OpenAI Whisper Large) lands in the 93-97 percent word accuracy range. On noisy audio, multi-speaker, or audio with technical proper nouns, the spread widens fast. The vendors that publish their own benchmarks (Deepgram, AssemblyAI, OpenAI) are not lying about their numbers; they just measured on the friendliest possible audio. For your own audio the only reliable benchmark is to run a 20-minute sample through three candidates and count substitution errors yourself.
What is the difference between dictation software and transcription software?
Dictation software targets a focused text field in a foreground app and types into it as you speak. Transcription software ingests pre-recorded or streaming audio and produces a saved text artifact, usually with timestamps. The boundary blurs in 2026: Wispr Flow is sold as dictation but emits a polished transcript, Otter is sold as transcription but ships a desktop dictation feature. The boundary that still matters is whether the output is destined for a human reader or for another machine to act on. That is the axis that changes the engineering, not the marketing label.
Is open-source voice to text transcription software actually viable?
Yes for the model layer (OpenAI Whisper, NVIDIA NeMo Parakeet, WhisperKit on Apple Silicon all run locally with respectable accuracy), and yes for the app layer if you accept that the consumer-grade product polish is shallower than the SaaS options. The advantage is not price; cloud Whisper API is cheap, and Apple Dictation is free. The advantage is auditability. With an open-source app you can read the vocabulary list, the spoken-form replacement table, the parameters passed to whichever model is being called, and the post-processing logic. With Sonix or Otter you cannot. If your transcripts feed an action layer or sit inside a regulated workflow, that visibility becomes load-bearing.
What are spoken-form replacement rules and why do they matter?
When a person dictates a URL, they say `fazm dot ai slash download`. The base model transcribes that literally. If the transcript is for a human, the human silently reads `fazm.ai/download` and moves on. If the transcript is going to drive a browser, the agent types `fazm dot ai slash download` into the address bar and fails. Spoken-form replacement rules collapse spoken phrases into their written equivalents at the transcription layer, before any downstream component sees the text. Deepgram exposes this as a `replace=` URL parameter; OpenAI Whisper has prompt seeding that performs a similar function less precisely. Fazm ships 18 default replacements (ten domain TLDs, the at sign, seven source-file extensions). The list is in the public repository and is editable per user.
What is keyterm or vocabulary biasing and when does it pay off?
Vocabulary biasing pre-loads the model with terms it would otherwise treat as unfamiliar word salad. The most common failures are technical proper nouns (`Anthropic`, `Supabase`, `MCP`, `ACP`, `Claude`, `Sonnet`, `PostHog`), product names that collide with common English (`Sonnet` becoming the Shakespeare reference, `Stripe` becoming the verb), and names of people on your team. Deepgram Nova-3 calls these `keyterm` and accepts up to 500 per request, but effectiveness drops sharply past about 30 because the bias collides with itself. Fazm seeds 19 system terms and lets the user add their own. The seeded list is visible in the source as `systemVocabulary: [String]`. If your transcripts include any technical jargon at all, this single feature usually moves accuracy more than picking a different vendor.
Which option is best if privacy is the gating concern?
Apple Dictation (built into macOS and iOS) and any WhisperKit-based app run on-device with no audio leaving the machine. Apple Voice Control is also fully on-device. The trade is that on-device options have a smaller vocabulary surface than cloud Nova-3 or Whisper Large and tend to be worse on technical jargon. If privacy is the gate but you still want hosted-grade accuracy, look for a product that lets you bring your own API key for an EU-resident inference endpoint, or one that supports a custom API endpoint so you can route through a privacy-respecting proxy.
Which option is best for a small business that wants to dictate emails, invoices, and CRM updates?
If the work is purely typing into apps, Wispr Flow and Apple Dictation cover most of it; Wispr Flow is easier to live with day to day on macOS in 2026. If the work is `do this thing for me` rather than `type this out`, you have crossed into desktop agent territory, and the right product is one that pairs transcription with computer control (Fazm, or a chained setup of dictation app plus separate agent). The category boundary matters because dictation apps stop after the text is typed; an agent runs the actions implied by the text.
What does Fazm actually do differently from a dictation app?
Fazm transcribes voice into intent, then drives the rest of macOS through accessibility APIs to carry that intent out. The transcription layer is tuned for action: keyterm vocabulary biasing for the names the user mentions most, spoken-form replacement so URLs and email addresses become typeable strings, multichannel capture so the user's mic and the system audio land on separate channels with separate speaker labels. The repository is `github.com/mediar-ai/fazm` and the relevant Swift file is `Desktop/Sources/TranscriptionService.swift`. The vocabulary seed is in `Desktop/Sources/DeletedTypeStubs.swift` around line 657.
How do I evaluate a voice to text transcription product without a marketing trial?
Three concrete tests. First, record a one-minute clip of yourself speaking the way you actually speak, jargon included, and run it through every candidate. Compare substitution errors on the proper nouns. Second, dictate a short email that includes a URL, an email address, and a code-style token like `app.json`. Verify whether the spoken-form rewrites happened. Third, look at whether you can read the vocabulary and replacement rules without filing a support ticket. The third test eliminates most of the popular shortlist and surfaces the open-source candidates plus the few SaaS products with a real custom-vocabulary settings panel.
Other guides on transcription and voice control on macOS
Related field notes
Voice recognition transcription: action-bound vs notes-bound
Field notes from one shipping macOS agent on why a transcript that controls a real machine needs different tuning than a transcript a human will read.
Voice message transcription on Mac in 2026
Per-platform field guide for transcribing voice notes from WhatsApp, Telegram, iMessage, Discord, and Signal on macOS, plus the universal share-extension fallback.
Parakeet vs Whisper for a Mac voice agent
Two on-device transcription models compared on the actual job a desktop agent has to do.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.