APRIL 2026 / TTS ON APPLE SILICON

Local text-to-speech AI: three paths, one keyword

Every top SERP result for this phrase names Piper, Coqui, Bark, Tortoise and hand-waves past what you actually want. You are asking for one of three things (privacy, offline capability, or low latency) and the right stack differs for each. This guide covers all three, names the macOS-native path most articles skip, and ends with the exact file paths where Fazm chose cloud TTS instead, along with how you would swap in a local model.

Matthew Diakonov, Fazm

Published April 20, 202610 min read

4.9from Written from the Fazm source tree

AVSpeechSynthesizer + Personal Voice

Piper, Kokoro, Coqui, Bark compared

Exact Deepgram Aura call (line 943)

24 kHz linear16 with 0.25 to 2.0x rate

Swap-in path for a local TTS server

Local TTS AI, picked honestly

Three intents hide inside one keyword. Pick yours, then pick a stack.

Private: AVSpeechSynthesizer, Piper, Kokoro

Offline: the same three, plus Personal Voice

Low latency: any streamed TTS, local or cloud

Fazm voice mode: Deepgram Aura, 24 kHz linear16

Swap is pluggable: one HTTPS endpoint

0:00 / 0:05

STEP 1, SPLIT THE KEYWORD

Three intents that share one search

Before picking a model or a library, pick which problem you are solving. The three intents point at different stacks, and mixing them is how people end up running a 4 GB model on a laptop that had a perfectly good built-in voice ready to go.

Local for privacy

The text you speak must never touch a network. Medical notes, legal memos, personal journals. You want the bytes to stay on the same machine they were typed on. Here, any TTS that runs fully on-device qualifies: AVSpeechSynthesizer, Piper, Kokoro via MLX, Personal Voice. Cloud TTS is out, even with a data-processing agreement.

Local for offline

The audio must keep playing on a plane, in a tunnel, in a locked-down network. Same set of tools as 'privacy', but the bar is the absence of the network, not the absence of a vendor contract.

Local for latency

You want time-to-first-audio under 200 ms. Local can hit this. So can a well-streamed cloud endpoint on a fiber connection. This is the intent where 'local' is often the wrong answer.

What this guide does with the three

Walks the macOS-native path first (AVSpeechSynthesizer, premium voices, Personal Voice), then the open-weights path (Piper, Kokoro, Coqui, Bark), then the honest cloud comparison (Deepgram, ElevenLabs, Cartesia, OpenAI). At the end, shows where Fazm landed and why, with file paths you can read.

STEP 2, THE MAC PATH

AVSpeechSynthesizer is the one nobody writes about

The most-overlooked local TTS on a Mac is the one that shipped with the OS. Since macOS Sonoma (14) the premium and enhanced voice downloads are neural, and the API lets you pipe text into the audio unit with a single call. Personal Voice, also from Sonoma on, lets a user record about fifteen minutes of read-aloud material and then synthesize their own voice, fully on-device, after the training pass. Zero dependencies, zero downloads, zero network. This path gets mentioned in almost no "local TTS AI" SERP result.

swift-say.swift (50 lines, fully local, no dependencies)

The reasons you might still reach past this into an open-weights model are real but specific: cross-platform portability (you want the same pipeline on Linux servers), SOTA voice quality that beats even the Apple premium voices, or specific languages that Apple does not ship well. For an English-speaking Mac user, the built-in route is the fastest path from "string" to "audio that sounds human", and it is entirely local.

STEP 3, THE OPEN-WEIGHTS PATH

Piper, Kokoro, Coqui, Bark

When you need portability, voice variety, or a model you can audit, the open-weights world has four names that come up in every serious conversation. They are not equal; picking the right one depends on latency requirements, voice quality expectations, and whether you are running on CPU or Apple Silicon accelerators.

Piper (Rhasspy)

ONNX-based. Voices are single 50 to 100 MB files. Runs fast on CPU. Sound is good for conversational content, a hair robotic compared to the SOTA. The most deployed local TTS in homelabs for a reason: boring, reliable, cheap. Easy to expose behind a tiny HTTP server.

Kokoro TTS

The 2025/2026 open model that changed the conversation. Natural prosody, multiple voices, ships as ONNX and has MLX and coreML conversions that run happily on Apple Silicon. Slightly heavier first-chunk latency than Piper but much better voice character.

Coqui XTTS

Voice cloning plus multilingual. Higher quality than Piper, slower. Good for asynchronous pipelines, content tooling, narration workflows. Less ideal for real-time agents because time-to-first-audio is measured in seconds, not milliseconds.

Bark, Tortoise, F5-TTS

Impressive on highlight reels, painful for production real-time. Long generation times, cold starts, sometimes unstable outputs. Fine as a research toy or a batch tool, wrong choice for a chat agent that speaks.

A practical modern stack is: Piper for low-latency general use, Kokoro when quality matters, AVSpeechSynthesizer when you are Mac only and want zero dependencies. If you are deploying a Mac AI agent and choosing among these, Kokoro and the built-in Apple voices are the two serious contenders in April 2026.

0 ms

“A Mac AI agent that speaks has five options, and four of them are genuinely local. We chose the one that is not, and the decision is not the cliche you expect.”

AVSpeechSynthesizer import cost on macOS

THE ANCHOR

The exact call Fazm makes, copied from the source

This is the half of the category nobody else in the SERP writes about: a Mac AI agent that has to speak in response to an LLM, in real time, and had to pick a TTS. Fazm did not pick local. The code is in Desktop/Sources/Providers/ChatToolExecutor.swift, in the executeSpeakResponse function, starting around line 943.

ChatToolExecutor.swift (line 943 to 1000, condensed)

Three things in that block are worth calling out because they are the shape of every "local or not" decision in a real Mac AI agent:

model: aura-luna-en, sample_rate: 24000, encoding: linear16

Deepgram Aura is the voice. 24 kHz is the Goldilocks sample rate for speech (high enough to sound human, low enough to download fast), and linear16 means the response is raw PCM bytes the audio unit can play without further decoding. Time- to-first-byte on a reasonable connection is 150 to 250 ms, which is effectively the same as a warm local model and faster than any of the open-weights options cold.

player.enableRate = true; player.rate = Float(clampedSpeed)

AVAudioPlayer has a speed control, but it is disabled by default. The enableRate flag is what actually lets you play a file at 0.25 to 2.0x without re-rendering the audio. Users who want the AI to talk faster can do it client-side; nothing round-trips to the vendor for speed changes.

private static var ttsAudioPlayer: AVAudioPlayer?

The audio player is held at class scope, not in the function. If you build a local TTS player and do not retain the AVAudioPlayer beyond the function, ARC will deallocate it before the first sample plays and you will hear nothing. This is a footgun most local-TTS examples do not warn about, and it bites every Mac developer the first time.

Why cloud TTS, when Fazm is otherwise Mac-native

The visible logic: the agent loop is already networked. The reasoning model is Claude or a custom endpoint; the transcription step is Deepgram. Speech synthesis joining that group adds one more HTTP call to a pipeline that already has two, and in exchange buys a nicer voice and lower CPU usage. A local TTS inside a network- dependent agent does not actually protect privacy; the text was already sent to the reasoning endpoint. It only protects the audio bytes from Deepgram, which is a weaker property than most users think.

The voice response path, and where local would plug in

The hub is the swappable part. A local TTS server (Piper or Kokoro behind a small Swift or Python process, bound to localhost, returning linear16 PCM bytes) would drop in with essentially zero changes to the rest of the agent. The left side (tool call, text argument, LLM response) and the right side (AVAudioPlayer, rate, speaker) stay identical.

What you would see if you swapped to a local Piper voice

Practical example. Download a Piper ONNX voice, run the Piper HTTP server, and point the agent's speak endpoint at localhost. No Fazm source changes required beyond the URL.

swapping Deepgram Aura for a local Piper voice

The same pattern works with Kokoro (run it under MLX or coreML on an M-series Mac) and with AVSpeechSynthesizer (wrap it in a tiny local HTTP shim if you want the same vendor-swap shape). Choose by voice quality, not by architecture.

The numbers that usually settle the debate

0 MBAVSpeechSynthesizer install size

0 MBPiper voice on disk

0 kHzFazm TTS sample rate

<0 msFirst-audio target

Put these next to each other and the pattern is clear. If you are already on a Mac, 0 MB to ship the built-in voice beats any model download. If you need portability or a specific voice character, 0 MB for a Piper voice is trivial. The hardware-level encoder in the Apple media engine (the same one Fazm uses for video capture) is not used for TTS, but the same rule applies: pick the path that leaves your CPU alone while sounding human.

Local TTS vs cloud TTS, head to head

The comparison the SERP does not make. Six dimensions that matter when you ship a real voice product, not six bullet points from a feature page.

Feature	Cloud TTS (Deepgram, ElevenLabs, Cartesia)	Local TTS (AVSpeech, Piper, Kokoro)
Time-to-first-audio (warm)	150 to 300 ms (streamed)	100 to 400 ms
Voice quality (SOTA)	SOTA for English in 2026	Good, Kokoro nearly SOTA
Privacy of audio bytes	Sent to vendor, subject to policy	Never leaves the machine
Works offline	No	Yes, always
Cost per million characters	$15 to $150 depending on voice	$0 after install
Voice variety	Dozens, plus instant cloning	Limited per model
CPU usage during speech	None, audio is just downloaded	Low to medium, model dependent
Best fit	Quality-sensitive consumer apps	Privacy, offline, long sessions

The honest version: for a privacy-sensitive workflow or an offline laptop, local wins cleanly. For a voice-UX product targeting consumers, cloud wins on quality.

The names you will see in the same search

Short list of what shows up in the "local TTS AI" SERP with a one-line note, so you do not pick the wrong tool for your job.

AVSpeechSynthesizer: Mac-built-in, neural premium voices

Personal Voice: your voice, trained on-device

Piper: ONNX TTS, 50-100 MB voices, fast on CPU

Kokoro: SOTA-grade open voice, MLX friendly

Coqui XTTS: multilingual + voice cloning, slower

Bark: research-grade, long generation times

F5-TTS: newer, voice cloning, batch oriented

Deepgram Aura: streaming cloud, what Fazm uses

ElevenLabs: quality leader, cloud only

Cartesia Sonic: low-latency cloud streaming

OpenAI TTS: GPT-4o voices, cloud

AVSpeechSynthesizer: Mac-built-in, neural premium voices

Personal Voice: your voice, trained on-device

Piper: ONNX TTS, 50-100 MB voices, fast on CPU

Kokoro: SOTA-grade open voice, MLX friendly

Coqui XTTS: multilingual + voice cloning, slower

Bark: research-grade, long generation times

F5-TTS: newer, voice cloning, batch oriented

Deepgram Aura: streaming cloud, what Fazm uses

ElevenLabs: quality leader, cloud only

Cartesia Sonic: low-latency cloud streaming

OpenAI TTS: GPT-4o voices, cloud

Picking your stack in sixty seconds

You are on a Mac and want zero dependencies

AVSpeechSynthesizer with the premium voices. Zero install, zero network, audio starts in under 100 ms. If you also want your own voice, enable Personal Voice in System Settings and use the same API.

You need a portable local stack

Piper for low-latency defaults, Kokoro for quality. Package the ONNX file with your app, expose a tiny localhost server, wire any client at it. Works on Mac, Linux, Windows.

You are building an AI agent that already talks to a vendor

Be honest about the privacy story. If the reasoning endpoint is cloud, adding local TTS does not make the product 'private'. Fazm made this call explicitly and uses Deepgram Aura for speech.

You want the SOTA voice and cost is secondary

Cloud wins in April 2026: ElevenLabs, Cartesia Sonic, Deepgram Aura. Route through whichever has the voice you want and the latency your UX demands.

Want to see Fazm speak, or route its voice through a local model?

Book 15 minutes. We can show the speak_response tool firing live, open ChatToolExecutor.swift at line 943, and sketch a Piper-behind-localhost swap for your setup.

Book a call →

Questions readers ask after this page

What counts as 'local' in local text-to-speech AI?

Three different things, and most guides blur them. 'Local for privacy' means the text never leaves the machine: the model weights and inference run on your Mac. 'Local for offline' means the feature keeps working on a plane or in a Faraday bag: no network dependency, ever. 'Local for latency' means you want sub-second time-to-first-audio, which can actually be achieved by either a local model or a streaming cloud API, depending on hardware and bandwidth. Before picking a stack, pick which meaning you actually care about. Most people say 'local' but really want 'offline' or 'private'.

What are the best actually-local TTS options on a Mac in 2026?

Four families worth naming. First, Apple's AVSpeechSynthesizer with the 'premium' and 'enhanced' voice downloads, which are neural and have been quietly excellent since macOS 14. Second, Personal Voice, also from macOS Sonoma on, which records your own voice and synthesizes it offline after a one-time training pass. Third, Piper: an ONNX-based TTS from the Rhasspy project, voices are typically a single 50 to 100 MB file, runs comfortably on CPU, latency is low. Fourth, Kokoro TTS, a 2025/2026 open model that has become the go-to for self-hosted pipelines because the voices sound natural and it runs on Apple Silicon through MLX or coreML converters. Coqui XTTS and Bark are still named in roundups but are heavier and slower; fine for batch, painful for real-time.

Does Fazm use a local TTS model?

No, and the honest answer is the interesting one. Fazm's voice mode calls Deepgram Aura, a cloud TTS API, using the 'aura-luna-en' voice at 24 kHz linear16. The code lives in Desktop/Sources/Providers/ChatToolExecutor.swift in the executeSpeakResponse function, starting around line 943. The choice was made deliberately: Aura's time-to-first-byte is under 200 ms in practice, the voice quality is higher than anything we could run on arbitrary consumer Macs without stepping on the user's CPU, and the whole agent loop is already network-dependent for reasoning. For a Mac AI agent that speaks in response to the LLM, adding a local TTS step to an otherwise networked pipeline buys very little privacy and costs a lot of CPU.

So if I want a Mac AI agent with fully local TTS, is that possible?

Yes, and the Fazm architecture makes it swappable. The speak_response tool in ChatToolExecutor.swift produces audio by hitting a single HTTPS endpoint and writing the returned bytes into AVAudioPlayer. Pointing it at a local TTS server (say a Piper or Kokoro model behind a FastAPI or a Swift-hosted process that exposes the same POST-text-get-audio contract) is a contained change. The rest of the agent, the screen understanding loop, the accessibility-API actions, the reasoning layer, keeps working. If the user demand for 'fully offline Fazm' grows, the TTS pluggability is the easiest half; the hard half is the reasoning model, not the voice.

Why pick AVSpeechSynthesizer over a third-party local model?

Three reasons. First, it is already installed: no 100 MB download, no ONNX runtime, no coreML conversion step, it ships with the OS. Second, the premium voices on modern macOS are neural and genuinely good. Third, Apple's API is the only one that gets Personal Voice, which means the user can sound like themselves. If you do not need voice cloning and want the smallest code path from string to audio, the built-in API is hard to beat. The tradeoff is that it is Apple-only (no Linux, no Windows) and the voices, while good, are not the SOTA open voices like Kokoro.

When does local TTS beat cloud TTS for an AI agent?

Three scenarios. One, strict data locality requirements (regulated industries, air-gapped workflows): the text going into the TTS cannot leave the machine even if the reasoning does not. Two, very long sessions where cloud TTS cost at per-million-character pricing becomes real money: a Piper voice costs zero after download. Three, spotty network: any flight, any rural signal, any office firewall that blocks the vendor. In any other case, cloud TTS wins on voice quality and latency, and 'local TTS' becomes an engineering preference rather than a user-facing feature.

What are the practical latency numbers in 2026?

Rough, but honest, based on current hardware. AVSpeechSynthesizer with a premium voice: time-to-first-audio under 100 ms, because the API streams to the audio unit almost immediately. Piper on an M-series Mac: 150 to 300 ms depending on voice. Kokoro via MLX or ONNX Runtime: 300 to 600 ms for the first chunk, faster once warm. Deepgram Aura: 150 to 300 ms end-to-end including network. The idea that local is always faster than cloud is wrong; a well-streamed cloud TTS on a fiber connection can beat a cold-start local model, which is why the choice comes down to privacy and offline, not raw speed.

Where does the Fazm source actually show the TTS call?

Desktop/Sources/Providers/ChatToolExecutor.swift, lines 929 to 1007. The function signature is 'private static func executeSpeakResponse(_ args: [String: Any]) async -> String'. It reads a user-configurable speed from UserDefaults under the key 'voiceResponseSpeed' and clamps it between 0.25 and 2.0, builds a URLRequest to 'https://api.deepgram.com/v1/speak' with the query parameters model=aura-luna-en, encoding=linear16, sample_rate=24000, posts the JSON body {\"text\": text}, feeds the response bytes into AVAudioPlayer, flips 'enableRate = true' so the clamped speed takes effect, and plays. The static 'ttsAudioPlayer' reference is held at class level so ARC does not deallocate it mid-playback.

What about voice cloning, is that local or cloud in 2026?

Both, with very different bars. Apple's Personal Voice is fully local after the one-time training, but requires about 15 minutes of read-aloud recording and only clones you, not arbitrary targets. Open voice cloning (XTTS, F5-TTS, Bark variants) runs on M-series Macs but quality varies a lot by voice and language. Cloud voice cloning (ElevenLabs Pro, Cartesia) remains the quality leader for arbitrary-target cloning and zero-shot prompts. If the goal is 'speak in my own voice on my own machine with nothing leaving it', Personal Voice is the right answer in 2026. Everything else is cloud-first with honest tradeoffs.