ggml-large-v3.bin: the 3.1 GB download, and why bigger is the wrong axis for a voice agent

Matthew Diakonov, Written with AI

Published June 23, 20267 min read

You searched the exact /resolve/main/ path for ggml-large-v3.bin, so you want one of two things: the actual file, or a straight answer on whether the biggest, most accurate Whisper model is the one to build on. This page gives you both. The download is one line. The second half is the part the download guides skip, because I build a voice-first Mac agent and large-v3's real limit is not the one its accuracy score suggests.

Direct answerVerified 2026-06-23

ggml-large-v3.bin lives in the ggerganov/whisper.cpp repository on Hugging Face. The direct download is:

https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin

As of 2026-06-23 the file is 3,095,033,483 bytes (~3.1 GB), sha256 64d182b440b98d5203c4f9bd541544d84c605196c4f7b845dfa11fb23594d1e2. The viewable model page is the blob view in the same repo. Or let the official script fetch it:

# from inside a cloned whisper.cpp checkout
./models/download-ggml-model.sh large-v3
# -> writes models/ggml-large-v3.bin (~3.1 GB)

# or fetch it directly (follow the 302 redirect to the Xet CDN):
curl -L -o ggml-large-v3.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin

If you want most of this accuracy for roughly half the decode cost, grab ggml-large-v3-turbo.bin (1,624,555,275 bytes, ~1.62 GB) instead. For an interactive workload that is almost always the better pick, for reasons the rest of this page gets into.

large-v3 in context, with exact byte counts

Everything below lives in the same whisper.cpp model store. The byte counts were read straight from the Hugging Face x-linked-size header on 2026-06-23, so you can sanity-check a download against them. large-v3 is the top row by size and the bottom row by speed.

File	Bytes	Size	What it is
ggml-base.bin	147,951,465	~148 MB	The default most people settle on. Real-time on Apple Silicon.
ggml-small.bin	487,601,967	~488 MB	Where serious dictation usually stops.
ggml-large-v3-turbo.bin	1,624,555,275	~1.62 GB	Pruned decoder. Most of large-v3's accuracy at a fraction of the decode cost.
ggml-large-v3.bin	3,095,033,483	~3.1 GB	The accuracy ceiling. About 1.55 billion parameters. Slowest to decode.

Quantized large-v3 variants (the q5_0 and q8_0 files) exist in the same repo. They shrink the file on disk, not the underlying accuracy ceiling. Check the Files and versions tab before you hardcode a filename, and read the official models README for the canonical size ranking.

The trap: you reached for the biggest model

Almost everyone who downloads this specific file does it for the same reason. They want a voice interface, large-v3 is the most accurate checkpoint, so it must be the right one. That instinct is correct for batch transcription and wrong for an interactive agent, and the reason is not a small caveat. There are two failure modes where a bigger model actively hurts you, and large-v3 is the most exposed to both.

Failure one: the decode happens after you stop talking

whisper.cpp transcribes a window of audio once that window has closed. You finish a phrase, then the model runs, then text appears. The bigger the model, the longer that gap. With a 3.1 GB checkpoint at roughly 1.55 billion parameters, the forward pass is the slowest of any model in the repo. For dictation into a document, the gap is invisible. For an agent that should begin acting while you are still mid-sentence, the gap is the product.

What a local large-v3 decode actually does to your latency

You start talking

Audio buffers locally. Nothing is transcribed yet, because whisper.cpp works on a closed window of audio, not a live stream.

You stop, the window closes

Only now does the decode begin. The model has not seen a single word of your phrase until you finish it.

The 3.1 GB model runs a full forward pass

large-v3 is roughly 1.55 billion parameters. On an M-series Mac it decodes faster than real time for batch work, but a forward pass over your phrase still takes a beat you can feel before any text exists.

Text finally appears

The agent only now gets something to act on. For dictation into a notes app that pause is invisible. For an agent that should start moving while you are still mid-sentence, the pause is the entire experience.

This is why Fazm does not decode a local Whisper window at all. The relevant configuration is in Desktop/Sources/TranscriptionService.swift: we open a WebSocket to a streaming ASR with interim_results=true, so partial transcript segments come back while you are still speaking. That is a structural property of streaming, not of model size. No amount of accuracy from large-v3 buys it back, because the model never sees a word until the window closes.

Failure two: the biggest model has the loudest hallucinations on silence

Every Whisper checkpoint can invent text when it is handed silence or low-energy, non-speech audio. The decoder latches onto something and emits it with full confidence. A common shape is a single token repeated several times with nothing else, the model looping on itself. large-v3 does not escape this by being accurate; accuracy is about transcribing speech, and there is no speech here to transcribe. In a push-to-talk voice agent, dead air between utterances is exactly the input that triggers it.

The fix is not a bigger model. It is a guard that recognizes degenerate output and throws it away before it reaches the agent. In Fazm that guard is isRepeatedTokenHallucination in the same TranscriptionService.swift. The rule is deliberately blunt: split the segment into tokens, and if there are four or more of them and they are all the same token, drop it. Real dictation is never one word repeated four times with nothing else, so the false-positive risk is near zero, and the caller falls back to the silence state instead of acting on a ghost phrase.

A silence loop, before and after the guard

The decoder is handed a second of room tone between your phrases. With nothing to transcribe, it loops on a single token and returns it as a confident, final segment. The agent has no way to know this was not speech.

Four or more identical tokens, nothing else
Marked final, high confidence
Agent would act on a phantom command

The point is not that this nine-line check is clever. It is that any serious use of large-v3 in a live voice loop needs something like it, and reaching for the most accurate checkpoint does not give it to you. If you wire large-v3 into a push-to-talk flow, budget for the guard layer up front.

When large-v3 actually is the right call

None of this is an argument against the model. It is an argument against the default. large-v3 earns its 3.1 GB in the workloads it was built for. Here is the honest split.

Download large-v3 when these are true (and skip it when they are not)

You are batch-transcribing recorded audio where a few seconds of latency per file is invisible.
Accuracy on names, jargon, accents, and noisy or multilingual audio is the thing you are optimizing.
The audio must never leave the machine, so a local model is a hard requirement.
You have the disk and memory headroom to hold a 3.1 GB checkpoint resident.
You need an interactive voice interface that reacts before you finish the sentence.
Your bottleneck is command and code dictation, where the model writes "dot com" instead of ".com".

Why Fazm streams Nova-3 instead of decoding large-v3 locally

Fazm is a voice-first agent for macOS: hold a hotkey, talk, and the same Claude Code or Codex agent loop acts on your machine. Voice is the front door, so transcription is on the hot path. You would expect us to ship a local Whisper model, and large-v3 would be the prestige choice. We chose a streaming ASR instead, and the reason is the two failures above, not a dislike of local models.

Concretely: TranscriptionService.swift streams 16 kHz PCM to Deepgram Nova-3 over a WebSocket, takes interim segments back mid-utterance so the agent can react early, runs the isRepeatedTokenHallucination guard on every segment, and applies a spoken-to-written rewrite table (so "dot com" becomes ".com") plus a custom keyterm vocabulary for command and app names. The model tier was never the interesting decision; the streaming and the guard layer were.

That is a real tradeoff, stated plainly: streaming means audio leaves the machine. If your hard requirement is that it must not, then a local model is the correct choice, and large-v3 (or, more sensibly, large-v3-turbo) is where you start, with the guard layer bolted on. We optimized for latency in the agent loop; you may weight privacy higher. Both are defensible, and this is the page that tells you where to get the file either way.

See the voice-first Mac agent

Wiring large-v3 into a live voice loop and fighting latency or silence hallucinations?

Walk through the local-vs-streaming call and the guard layer with the person who shipped it.

ggml-large-v3.bin, answered

What is the exact download URL for ggml-large-v3.bin?

The direct file is https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin. The /resolve/main/ path streams the raw binary; the /blob/main/ path of the same name shows the viewable page with the Files and versions tab. As of 2026-06-23 the file is 3,095,033,483 bytes (about 3.1 GB). Note that /resolve/main/ issues a 302 redirect to a Hugging Face Xet CDN host before the bytes start, so follow redirects with curl -L.

How big is ggml-large-v3.bin and how do I verify the download?

3,095,033,483 bytes, roughly 3.1 GB on disk. The Hugging Face x-linked-etag header exposes the sha256 as 64d182b440b98d5203c4f9bd541544d84c605196c4f7b845dfa11fb23594d1e2, so you can shasum -a 256 the file you downloaded and compare. That is the f16 checkpoint at about 1.55 billion parameters, which is why it is roughly twenty times the size of ggml-base.bin.

ggml-large-v3.bin or ggml-large-v3-turbo.bin?

large-v3-turbo (1,624,555,275 bytes, ~1.62 GB) keeps most of large-v3's accuracy with a heavily pruned decoder, so it decodes far faster. For nearly every interactive or near-real-time use, turbo is the better tradeoff. Reach for the full large-v3 only when you are batch-processing and want the last increment of accuracy on hard audio, and you do not care that each decode is slower.

Why does whisper large-v3 sometimes hallucinate text on silence?

Every Whisper checkpoint, large-v3 included, can latch onto a phrase and invent output when it is fed silence or low-energy, non-speech audio. Common symptoms are a single token repeated several times, or stock phrases the model saw a lot in training. The accuracy ceiling does not protect you here; it is a property of how the decoder behaves on input that contains no speech. The fix is not a bigger model, it is a guard that recognizes degenerate output and drops it.

Is large-v3 the right model for a voice command or coding agent?

Usually no, and not because it is inaccurate. Two things bite an interactive agent that model size makes worse, not better. First, latency: whisper.cpp decodes a window of audio after it closes, so the bigger the model the longer you wait before the agent has anything to act on. Second, on non-speech audio the largest model produces the loudest hallucinations. A responsive voice agent needs streaming transcripts and a guard layer far more than it needs the top accuracy tier.

Do I have to download from Hugging Face, or can the script fetch large-v3?

Either works and gives the identical binary. The official whisper.cpp way is ./models/download-ggml-model.sh large-v3, which constructs the same Hugging Face /resolve/main/ URL and writes models/ggml-large-v3.bin. There is no extra conversion step; the .bin is already in GGML format.

What does Fazm use instead of running ggml-large-v3.bin locally?

Fazm is a voice-first macOS agent, so transcription sits on the hot path and latency matters as much as accuracy. We stream audio over a WebSocket to a real-time ASR (Deepgram Nova-3) with interim_results enabled, so partial transcripts arrive mid-utterance instead of after the phrase ends. On top of that we run a guard in TranscriptionService.swift, isRepeatedTokenHallucination, that drops any segment which is four or more identical tokens with nothing else, the exact silence-loop failure large Whisper models emit. That guard layer is what you would have to build around large-v3 too; it is independent of which model size you pick.

More on Whisper, GGML models, and ASR for a Mac agent

Keep reading

whisper.cpp

ggml-base.bin: the file, the size, and the one thing base gets wrong

The other end of the size range: the 148 MB default, exact byte counts, and the rewrite layer base alone does not give you.

Read

whisper.cpp

ggml-tiny.bin: where it lives and when not to use it

The 77 MB tiny model: exact download, the silence-hallucination trap, and when tiny is the wrong tool.

Read

whisper.cpp

download-ggml-model.sh large-v3-turbo

The pruned-decoder sibling: the script internals and the large-v3-turbo download that keeps most of the accuracy.

Read

ggml-large-v3.bin: the 3.1 GB download, and why bigger is the wrong axis for a voice agent

large-v3 in context, with exact byte counts

The trap: you reached for the biggest model

Failure one: the decode happens after you stop talking

What a local large-v3 decode actually does to your latency

You start talking

You stop, the window closes

The 3.1 GB model runs a full forward pass

Text finally appears

Failure two: the biggest model has the loudest hallucinations on silence

A silence loop, before and after the guard

When large-v3 actually is the right call

Why Fazm streams Nova-3 instead of decoding large-v3 locally

Wiring large-v3 into a live voice loop and fighting latency or silence hallucinations?

ggml-large-v3.bin, answered

Keep reading

ggml-base.bin: the file, the size, and the one thing base gets wrong

ggml-tiny.bin: where it lives and when not to use it

download-ggml-model.sh large-v3-turbo

Comments (••)

Comments ()