Voice message transcription on Mac in 2026: which apps actually work, per platform

The honest answer is that voice message transcription on macOS is not one problem with one solution. It is six problems with six different states of readiness, one per messaging app you use. The Mac story is worse than the iPhone story for most of them. Here is what each platform actually does in 2026, plus the side nobody writes about, dictating an outbound voice reply without mangling the product names and URLs your message is about.

M
Matthew Diakonov
10 min

Direct answer, verified 2026-05-12

Voice message transcription on Mac is per-platform in 2026. WhatsApp and Telegram have it, but both run on the mobile client and only sync the result to your Mac. Slack has it for paid workspaces. iMessage, Discord, and Signal have nothing built-in. The universal fallback for any platform is a share-extension app on the Mac App Store (Transcriptor, Voicepop), which runs a per-message transcription through the system Share menu. The platform-by-platform breakdown is below, sourced from each vendor's own docs and re-verified on 2026-05-12.

Authoritative references: WhatsApp Help Center on Voice Message Transcripts, Telegram's transcribe API documentation, Transcriptor on the Mac App Store.

The platform-by-platform reality

Most guides on this topic pick one app, usually WhatsApp, and pretend the answer generalises. It does not. Each messaging app has made a different bet on whether to ship transcription, where to run it, and how to gate it. The table below captures the state I verified on 2026-05-12 by reading each vendor's own docs and product pages. The Mac column is the one that matters for the question this page answers, and it is the one most other guides skip.

PlatformBuilt-in transcriptionWorks on MacCost
WhatsAppOn-device, mobile only initiationSees synced transcripts from phone, cannot trigger new onesFree
TelegramCloud, mobile only initiationSees results, cannot initiateTelegram Premium
SlackCloud, all clientsYesPaid workspace plans (Pro and above for full feature, varies)
iMessageNone for voice notesNon/a
DiscordNone for DM voice messagesNon/a
SignalNoneNon/a

WhatsApp

Long-press a voice message on iOS or Android, tap transcribe. Settings live under Settings > Chats > Voice message transcripts.

Telegram

Per Telegram's API docs, desktop clients have no mic capture layer and so cannot run the feature themselves.

Slack

Slack ships audio clip transcription server-side, the macOS client renders the result inline. Free workspaces see clips without transcripts.

iMessage

Messages for Mac has Live Voicemail-style features only on iOS. Voice notes shared in a thread are not transcribed by Apple.

Discord

Server-side transcription bots exist (VoiceScriber, others) for voice channel transcription. DM voice messages are not transcribed by Discord.

Signal

By design. Signal does not run cloud transcription on user content on any platform.

The universal fallback: share-extension apps

For the three platforms with nothing built-in on Mac (iMessage, Discord, Signal), and for the two that only render mobile-initiated results (WhatsApp, Telegram), the practical workaround is a share-extension app on the Mac App Store. The two most commonly installed are Transcriptor and Voicepop. The workflow is identical for either one.

Per-message share-extension workflow

1

Install the app once

Install Transcriptor or Voicepop from the Mac App Store. Both are free with paid tiers for cloud accuracy or unlimited length.

2

Right-click the voice message

Inside Messages, WhatsApp, Telegram, Signal, or Discord on macOS, right-click the voice clip and pick Share. iMessage exposes Save Audio or Share Audio. WhatsApp Desktop exposes Forward and Share. Discord lets you Save As to a local .ogg.

3

Pick the transcription app from the Share menu

The share-extension target you installed appears as a system Share destination. Selecting it hands the audio file to the app and opens its window.

4

Wait for the transcript

The app runs either an on-device model (Apple Speech, default for short clips) or a cloud model (typically Whisper API, sometimes Deepgram, opt-in for longer clips). A 30-second clip takes 2 to 6 seconds on-device, 1 to 3 seconds cloud.

5

Copy or save the text

The result is a plain transcript with timestamps optional. Copy out, paste into your reply, archive in Notes, whatever the downstream is. There is no auto-archive-by-conversation feature in either app; that is the structural limit of the share-extension model.

The other half of the workflow nobody writes about

The pages that rank for this question all focus on the receive side. They assume you are the person reading a voice message someone else sent. That is one half of the workflow. The other half is the reply. A surprising number of users record a voice reply specifically because typing the response would be slower, especially when the message is peppered with product names, URLs, and short technical strings.

When you receive a voice note transcribed as “the empty pee server crashed, ping matthew at fazm dot ai”, you parse it in your head and move on. When you DICTATE that same sentence into a Mac dictation tool and SEND it back, your recipient sees word salad. So the inverse problem has a different shape: the transcript has to be producible directly, with product nouns and URL forms intact, no mental cleanup step.

Most dictation tools do not handle this. Apple's built-in dictation has no custom vocabulary API. WhisperKit (the on-device Whisper port that ships in several Mac dictation apps) accepts an initial-prompt hint but its effect on rare proper nouns is weak. The hosted vendors (Deepgram Nova-3, AssemblyAI, OpenAI Whisper API) all offer a vocabulary parameter, but the dictation app on top has to actually wire it in and ship a sensible seed list.

Same dictated sentence, two different transcripts

the empty pee server crashed ping matthew at fazm dot ai and push the new claude sonnet config to versel

  • "empty pee" instead of MCP
  • "matthew at fazm dot ai" stays as words
  • "versel" instead of Vercel
  • "sonnet" lowercased into a poem reference
  • Unsendable as-is, you would manually retype

What the pipeline actually has to do

Outbound dictation that produces sendable transcripts has three moving parts. The mic capture layer feeds 16-bit PCM samples into a streaming speech model. The model has been pre-biased with a curated vocabulary so it knows that “MCP” is a real word in your dialect. A find-and-replace pass after the model rewrites known spoken-form patterns into their text form before any of it lands in a UI.

Outbound dictation pipeline that survives technical vocabulary

Microphone
Custom vocabulary
Replacement table
Streaming speech model
Tuned transcript
Recipient app

Fazm's open-source code shows what each leg of this looks like in production. The audio capture front end at Desktop/Sources/AudioCaptureService.swift targets 16000 Hz mono Float32 PCM (see targetSampleRate near line 55). The streaming layer at Desktop/Sources/TranscriptionService.swift uses Deepgram Nova-3 over a WebSocket at wss://api.deepgram.com/v1/listen (line 276) with smart_format, punctuate, interim_results, and endpointing=300 (lines 280 through 288). The vocabulary and the replacement rules attach to the same request as repeated query items:

// TranscriptionService.swift, around line 295
for term in vocabulary {
    queryItems.append(URLQueryItem(name: "keyterm", value: term))
}

if language == "multi" || language.hasPrefix("en") {
    for rule in Self.defaultReplacements {
        queryItems.append(URLQueryItem(name: "replace", value: "\(rule.find):\(rule.replace)"))
    }
}

The 18 spoken-form replacements Fazm ships by default

These are the “dot com” and “at sign” rules that get attached to every English or multi-language transcription request. The table lives as the static let defaultReplacements: [(find: String, replace: String)] property on TranscriptionService, lines 16 through 31. They run server-side at Deepgram, so the segments coming back over the WebSocket already have the substitutions applied. The same rules generalise to any dictation pipeline you build on Nova-3.

Spoken formWritten form
dot com.com
dot org.org
dot net.net
dot io.io
dot ai.ai
dot dev.dev
dot app.app
dot co.co
dot me.me
dot gg.gg
at sign@
dot json.json
dot js.js
dot ts.ts
dot py.py
dot swift.swift
dot css.css
dot html.html

Two notes on this set. The TLDs are not exhaustive, “dot ru” or “dot uk” are missing because the seed list targets the TLDs that show up most often in tech speech. The file extensions (.json, .js, .ts, .py, .swift, .css, .html) only matter if you actually dictate file names; if you do not, they are harmless but unused.

The 18 product nouns in the seed vocabulary

Every name on this list is a token that base Nova-3 mis-hears with high probability and that has near-zero collision with common English. The seed list is small on purpose. Deepgram's docs note the limit is 500 entries per request, but a comment in the Swift source flags that effectiveness drops past ~30 terms because the bias starts colliding with itself. Users add their own (their colleagues' names, their internal tools, their CRM stages) and the user list goes first; removing a system term persists into disabledSystemVocabulary so an upgrade does not silently restore it.

FazmClaudeSonnetOpusHaikuAnthropicMCPACPSupabaseFirestorePostHogSentryStripeVercelDeepgramWhisperXcodeSwiftUITauri

The list lives in Desktop/Sources/DeletedTypeStubs.swift near line 657 in github.com/mediar-ai/fazm. Pull it as a starting seed, prune the half that are not in your dialect, add 5 to 10 of your own, and you are at the ~30-term sweet spot.

So which side of this matters for you

If the workflow you are trying to fix is “a colleague sends me a 90-second voice note and I want to read it”, install a share-extension app and call it done. The platform-by-platform breakdown above shows you where to wire it in for each app you use. You do not need a fancier tool than that.

If the workflow you are trying to fix is “I keep sending voice replies because typing is too slow but the transcripts I hand off look bad”, the share-extension model is not the right shape. You need a dictation pipeline with custom vocabulary and a replacement table, running close enough to the keyboard that the tuned transcript becomes the thing you actually send. That is the gap a voice-first agent like Fazm sits in, with the caveat that the agent layer is doing more than dictation, it also routes the transcript to whichever app you have focused (Messages, Slack, the browser, etc.) via macOS accessibility APIs. If all you wanted was a tuned dictation surface and not a full agent, the same Deepgram parameters above will give you that on their own, you just have to build the surrounding app.

The honest summary, then: on the receive side this is a solved problem with off-the-shelf apps. On the send side it is still mostly hand-rolled, and the work is in the vocabulary seed list and the replacement table, not the speech model.

Talk through your dictation workflow

If you are trying to wire voice-driven actions into your daily Mac use, book a 20-minute call and we will look at what your current pipeline is doing wrong.

Frequently asked questions

Does WhatsApp transcribe voice messages on Mac in 2026?

Not directly. WhatsApp shipped Voice Message Transcripts in 2024, but the feature initiates on the mobile client. The Mac client respects the same toggle (Settings, Chats, Voice message transcripts) and will display a transcript that was generated by your phone, but you cannot long-press a voice message inside WhatsApp for Mac and trigger a fresh transcription. If you want the transcript on Mac, the practical workflow is to enable the toggle on your phone and let it sync, or use a share-extension app on macOS to transcribe individual files.

Does Telegram Premium do voice transcription on macOS?

Telegram Premium subscribers can transcribe voice messages, but per Telegram's own API docs the desktop clients (macOS, Windows, Linux, Web K) cannot initiate transcription because they do not have the mic capture layer the feature is wired through. They display transcripts produced by a mobile client. If you are a Telegram Premium user on Mac and you have a Mac-only setup, the transcription simply will not run on your messages until you open them on a phone with Premium signed in.

What about iMessage, Discord, and Signal?

None of the three ship a native voice message transcription feature on macOS as of mid-2026. iMessage has Live Voicemail and various AI features in Messages for iOS, but no on-thread voice-note transcription. Discord has voice channel transcription bots (community-installed) and a server admin can enable AutoMod features, but inbound voice messages in DMs are not transcribed by Discord itself. Signal does not ship any cloud-bound transcription by design, on any platform.

What is the universal fallback if my messaging app does not have it?

Share-extension apps on the Mac App Store. The two most-installed are Transcriptor (also called Transcriptor for WhatsApp and similar variants) and Voicepop. The workflow is identical for each: select a voice message inside the messaging app, hit the system Share menu, choose the transcription app, wait a few seconds for the transcript to appear inside the app's window, copy the text out. Both apps support most languages and most input platforms (WhatsApp, Telegram, Signal, Voice Memos, Slack export files). The catch is that the share-menu approach is per-message and manual; there is no auto-transcribe-on-arrival mode.

Why is dictating voice messages a different problem than transcribing received ones?

Because the transcript is going to a recipient, not to you. If your friend's voice note arrives transcribed as "so I pushed the empty pee config to versel and ping matthew at fazm dot ai", you can mentally repair the spelling and product names while you read. If you dictate that same sentence and it gets sent to your recipient verbatim, they see word salad. Outbound dictation needs vocabulary bias for product nouns and a spoken-form replacement table that collapses "dot com" into ".com" and "at sign" into "@" before the message leaves your Mac. Most consumer transcription apps do neither.

Where can I see exactly what a tuned dictation pipeline looks like in code?

Fazm's transcription class is open-source at github.com/mediar-ai/fazm. The file is Desktop/Sources/TranscriptionService.swift. The keyterm parameters get appended as Deepgram URL query items (line 296), the spoken-form replacements are applied as replace= rules only for English and the multi-language mode (lines 299-305), and the 18 default replacements live as a static defaultReplacements property on the class (lines 16-31). The 18-term seed vocabulary lives in Desktop/Sources/DeletedTypeStubs.swift around line 657, as systemVocabulary: [String]. Everything cited in this guide can be grep'd by name in those files.

Can a single Mac app transcribe voice messages from every platform at once?

Not in the sense of intercepting them automatically as they arrive. macOS does not give third-party apps a system-wide hook to read incoming voice notes from WhatsApp, Telegram, Signal, and Discord all at once. The closest you can get today is one of the share-extension apps acting as a uniform export target, so you have one transcription UX regardless of which app the message came from. The transcription still runs per-message and on-demand.

Are these transcripts private?

Mixed. WhatsApp's built-in transcription runs on-device, the audio never leaves the phone. Telegram Premium and Slack's paid transcription send audio to their respective servers. Most third-party share-extension apps default to an offline model (Apple's on-device Speech framework) but offer a cloud option for better accuracy, the choice is in their settings. Outbound dictation pipelines that use a hosted speech-to-text vendor (Deepgram, OpenAI Whisper API, Google Speech) send the audio to that vendor by definition, the Mac app is just a thin streaming client.

What is the highest-accuracy option for messages with technical jargon?

A hosted model that supports custom vocabulary (Deepgram Nova-3 with keyterm, OpenAI Whisper with a prompt seed, AssemblyAI with word_boost), paired with a list of the proper nouns you actually use. The on-device options on macOS (Apple's Speech framework, WhisperKit) are competitive on common speech but fall apart on names like "Anthropic", "Supabase", "PostHog", "MCP", "ACP", which dominate a tech worker's vocabulary. Hosted vendors plus a curated vocabulary list is the combination that produces sendable transcripts in this niche.

How big should my custom vocabulary be?

Smaller than you would think. Deepgram Nova-3 accepts up to 500 keyterm entries in a single request, but in practice effectiveness drops sharply past about 30 terms because the bias starts colliding with itself. Fazm ships an 18-term seed list deliberately curated to cover only the words that the base model gets wrong with high probability and that do not collide with common English. Adding your own 5 to 10 names on top, the people, companies, internal tools you actually mention, is usually enough.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.