ggml-tiny.bin, the 77 MB whisper model: where it lives and when not to use it

Matthew Diakonov, Written with AI

Published June 17, 20267 min read

If you typed the path to ggml-tiny.bin into a search box, you want one of two things: the exact file, or a straight answer on whether the smallest Whisper model is good enough for what you are building. This page gives you both. The download is one line. The second half is the part nobody writing about this file will tell you, because I build a voice-first Mac agent and we chose not to ship this model.

Direct answerVerified 2026-06-17

ggml-tiny.bin is hosted in the ggerganov/whisper.cpp repository on Hugging Face. The direct file URL is:

https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.bin

As of 2026-06-17 the file is 77,691,713 bytes (~77.7 MB). The viewable model page is the blob view in the same repo. Or let the official script fetch it for you:

# from inside a cloned whisper.cpp checkout
./models/download-ggml-model.sh tiny
# -> writes models/ggml-tiny.bin (77.7 MB)

0 MB

on-disk size of ggml-tiny.bin

parameters in the tiny model

0 sizes

tiny / base / small / medium / large

What ggml-tiny.bin actually is

GGML is the on-disk tensor format used by whisper.cpp, the C/C++ port of OpenAI's Whisper. The ggerganov/whisper.cpp repository on Hugging Face is a model store: it holds pre-converted GGML binaries so you do not need Python, PyTorch, or any conversion step. You download ggml-tiny.bin, point the whisper.cpp binary at it, and it runs.

"tiny" is the smallest of the five standard Whisper checkpoints, about 39 million parameters. That is what buys you the 77.7 MB file and the fast decode, and it is also what costs you accuracy. The whisper.cpp models README ranks it bluntly as the fastest and least accurate option. Everything else is a tradeoff up from there.

If you came looking for ggml.ggerganov.com or ggml-model-whisper-tiny.bin

Those two strings come from older whisper.cpp docs and scripts, and both have moved. The original download-ggml-model.sh pulled models from a self-hosted CDN at https://ggml.ggerganov.com using the longer filename pattern ggml-model-whisper-tiny.bin. The CDN is no longer the active source. In the current script the source is set to https://huggingface.co/ggerganov/whisper.cpp, and the old ggml.ggerganov.com line survives only as a commented-out fallback.

The filename changed too. On Hugging Face the tiny model is just ggml-tiny.bin, not ggml-model-whisper-tiny.bin. It is the same weights, same GGML format, just renamed when the models moved to the Hugging Face repo. So the mapping you want is:

Legacy (ggml.ggerganov.com)	Today (Hugging Face)
ggml.ggerganov.com/ggml-model-whisper-tiny.bin	huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.bin
ggml-model-whisper-tiny.en.bin	ggml-tiny.en.bin

Bottom line: drop the -model-whisper from the name, point at Hugging Face instead of the old CDN, and you get the identical 77.7 MB binary. If you just run ./models/download-ggml-model.sh tiny from a fresh checkout, the script already does both for you.

If you searched for ggml-tiny.bin.zip

There is no file named ggml-tiny.bin.zip in ggerganov/whisper.cpp. The model ships uncompressed as a plain ggml-tiny.bin (77,691,713 bytes, ~77.7 MB). A GGML file is already a single packed binary that whisper.cpp mmaps straight off disk, so zipping it would only add an extraction step for no size win. Verified against the Files and versions tab on huggingface.co/ggerganov/whisper.cpp/tree/main on 2026-06-20.

The only .zip files sitting next to the tiny model are the optional Apple Core ML encoders: ggml-tiny-encoder.mlmodelc.zip and the English-only ggml-tiny.en-encoder.mlmodelc.zip, each about 15 MB. They are zipped because a .mlmodelc is a compiled Core ML model directory (weights plus a manifest in separate files), and a folder cannot be hosted as one downloadable object. You unzip that beside the .bin only if you want the encoder pass to run on the Apple Neural Engine; the .bin alone runs fine on CPU and Metal with no unzip step.

One more source of the confusion: the SourceForge mirror publishes whisper-bin-x64.zip (about 4 MB), which is the prebuilt Windows command-line binary, not a model. So three different zips live in this ecosystem, and none of them is a zipped tiny model. The same pattern holds for the base model, covered in the ggml-base.bin.zip writeup.

The model sizes, with exact byte counts

All of these live in the same repository. The byte counts below were read straight from the Hugging Face x-linked-size header on 2026-06-17, so you can sanity-check a download against them.

File	Size	Bytes	When it fits
ggml-tiny.bin	77.7 MB	77,691,713	Fastest, lowest accuracy. Multilingual. Wake words, rough drafts, throwaway transcripts on weak hardware.
ggml-tiny.en.bin	77.7 MB	77,704,715	English-only variant of tiny. Slightly better on English than the multilingual tiny at the same size.
ggml-base.bin	148 MB	147,951,465	The usual next step up when tiny mangles too many words. Still real-time on Apple Silicon.
ggml-small.bin	488 MB	487,601,967	Noticeably more accurate. The point where many people stop for offline dictation.
ggml-medium.bin	1.53 GB	1,533,763,059	Heavy. Good accuracy, but you feel it on latency and memory.

Quantized variants (the q5_1, q8_0 files) exist in the same repo and trade a little accuracy for a smaller footprint. Check the Files and versions tab on the repo for the current list before you hardcode a filename.

When tiny is genuinely the right call

I do not want to talk you out of this model. There are real jobs where 77.7 MB and a fast decode beat everything heavier:

Wake-word and gating. You only need to know roughly what was said to decide whether to hand off to a bigger model. tiny is perfect as the cheap first pass.
Constrained hardware. Old laptops, Raspberry Pi, anything where loading a 500 MB model is a non-starter. tiny runs where small will not.
Strictly-offline requirements. If audio is not allowed to leave the machine, full stop, then a local model is the only honest answer and tiny is the lightest one.
Batch transcription of clean audio. When latency does not matter and the audio is clean, tiny's mistakes are easy to skim past in a draft you are going to edit anyway.

77.7 MB

“The whole reason to reach for tiny is that it is the only Whisper checkpoint small enough to load anywhere, instantly. Everything past that is you paying for accuracy you may not need.”

whisper.cpp models README, fastest / least-accurate tier

Why Fazm does not ship ggml-tiny.bin

Fazm is a voice-first agent for macOS: you hold a hotkey, talk, and the same Claude Code / Codex agent loop acts on your machine. Voice is the front door, so transcription is on the hot path. You would assume we run a local Whisper model. We tried; we did not ship it.

The problem with tiny for a live agent is not mainly word accuracy, it is when the words arrive. whisper.cpp decodes a window of audio after that window closes. You finish a phrase, then you wait for the model to run, then you get text. For a dictation app that is fine. For an agent that should start reacting while you are still mid-sentence, that pause is the whole experience. The model that nails accuracy still loses if the user is staring at a frozen cursor.

So Fazm streams instead. Our TranscriptionService.swift hardcodes private let model = "nova-3" and opens a WebSocket to wss://api.deepgram.com/v1/listen. Audio goes up in chunks; interim (non-final) transcript segments come back mid-utterance, so the loop can move before you stop talking. We also feed a keyterm vocabulary list so command words and app names land, and we added a detector for the degenerate repeated-token loops that small ASR models emit on silence. That detector exists precisely because we saw how badly the lowest tier handles low-energy audio.

That is a real tradeoff, and I will state it plainly: streaming to Deepgram means audio leaves the machine. If your requirement is that it must not, ggml-tiny.bin (or a larger local Whisper model) is the correct choice and this is the page that tells you where to get it. We optimized for latency in the agent loop; you may weight privacy higher. Both are defensible.

Local ggml-tiny.bin vs streaming ASR, for an agent

Same goal, different bottleneck. This is the comparison the download guides skip.

Feature	Local ggml-tiny.bin	Fazm (streaming Nova-3)
Where it runs	On your Mac, fully offline once the 77 MB file is downloaded	Streams audio to Deepgram over a WebSocket, transcript comes back in chunks
Latency for a live agent	Decode happens after a chunk of audio; you wait for the window to close before you get text	Interim (non-final) results arrive mid-sentence, so the loop can react before you stop talking
Accuracy at this size	tiny is the lowest-accuracy whisper model; expect mistakes on names and jargon	Nova-3 with a keyterm vocabulary list, tuned for command-style speech
Privacy	Audio never leaves the machine	Audio is sent to a third-party ASR endpoint (the honest tradeoff we made for latency)
Hallucination on silence	tiny is especially prone to repeated-token loops on silent or low-energy audio	We added a degenerate-repeat detector in TranscriptionService.swift to drop those segments

Not a knock on whisper.cpp. tiny is a great offline first-pass model; it is just optimized for a different constraint than a real-time agent loop.

If you landed here because you are wiring voice into something on a Mac and you are weighing local Whisper against a hosted streaming model, that is exactly the call Fazm made in production. You can read the agent it feeds into, or talk it through.

See the voice-first Mac agent

Wiring voice into a Mac agent and stuck on the local-vs-streaming call?

Walk through the latency, privacy, and accuracy tradeoffs with the person who shipped this decision.

ggml-tiny.bin, answered

What is the exact download URL for ggml-tiny.bin?

The direct file is https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.bin. The viewable page (with the Files and versions tab) is https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-tiny.bin. As of 2026-06-17 the file is 77,691,713 bytes (about 77.7 MB).

Is there a ggml-tiny.bin.zip file?

No. There is no file named ggml-tiny.bin.zip in ggerganov/whisper.cpp. The model ships uncompressed as ggml-tiny.bin (77,691,713 bytes, about 77.7 MB); a GGML file is already a single packed binary, so it is not zipped. The only zips next to the tiny model are the optional Core ML encoders ggml-tiny-encoder.mlmodelc.zip and ggml-tiny.en-encoder.mlmodelc.zip (about 15 MB each), and separately the whisper-bin-x64.zip Windows binary on SourceForge. None of those is a zipped model. Verified on the Files and versions tab on 2026-06-20.

How big is ggml-tiny.bin?

77,691,713 bytes, roughly 77.7 MB on disk. It is the smallest of the standard whisper.cpp GGML models. The English-only ggml-tiny.en.bin is almost identical at 77,704,715 bytes.

Why does my old script point to ggml.ggerganov.com/ggml-model-whisper-tiny.bin?

That is the legacy path. The original whisper.cpp download-ggml-model.sh fetched models from a self-hosted CDN at https://ggml.ggerganov.com using the longer ggml-model-whisper-tiny.bin filename. The models have since moved to Hugging Face and the file was renamed to ggml-tiny.bin. The current script sets its source to https://huggingface.co/ggerganov/whisper.cpp and keeps the ggml.ggerganov.com line only as a commented-out fallback. Same weights, new location and shorter name: download https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.bin instead.

Do I have to use Hugging Face, or can the script fetch it?

Either works. The official whisper.cpp way is ./models/download-ggml-model.sh tiny, which constructs the same Hugging Face URL and saves the file into models/. Manually downloading from the ggerganov/whisper.cpp repo gives you the identical binary; there is no extra conversion step.

What is the difference between ggml-tiny.bin and ggml-tiny.en.bin?

ggml-tiny.bin is multilingual; ggml-tiny.en.bin is trained only on English. At the same ~77.7 MB size, the .en variant tends to be a little more accurate on English audio because it does not spend parameters on other languages. If you only ever transcribe English, prefer .en.

Why does the tiny model produce repeated garbage on silent audio?

Small Whisper models, tiny most of all, are prone to degenerate repeated-token loops when fed silence or low-energy audio. They were trained on speech, so silence pushes them into a hallucination loop. The fix is to gate the model behind voice-activity detection and/or detect the repeated-token pattern and discard those segments. We do the latter in our transcription layer.

Is ggml-tiny.bin good enough for a real-time voice agent?

For wake-word detection or rough notes, yes. For a voice agent that has to act on what you said the moment you said it, the bottleneck is not just word accuracy, it is when the text arrives. whisper.cpp decodes a window after it closes, so you wait. That is why Fazm streams to a real-time ASR (Deepgram Nova-3) that returns interim results mid-utterance instead of running tiny locally. Different jobs, different tools.

What does Fazm actually run for voice input?

Fazm's TranscriptionService.swift hardcodes the model nova-3 and opens a WebSocket to wss://api.deepgram.com/v1/listen, sending audio chunks and receiving streamed transcript segments with a custom keyterm vocabulary. It is a deliberate latency-over-locality choice for the agent loop, not a whisper.cpp build.

More on Whisper, GGML models, and ASR for a Mac agent