ggml-tiny.bin, the 77 MB whisper model: where it lives and when not to use it

M
Matthew Diakonov
7 min read

If you typed the path to ggml-tiny.bin into a search box, you want one of two things: the exact file, or a straight answer on whether the smallest Whisper model is good enough for what you are building. This page gives you both. The download is one line. The second half is the part nobody writing about this file will tell you, because I build a voice-first Mac agent and we chose not to ship this model.

Direct answerVerified 2026-06-17

ggml-tiny.bin is hosted in the ggerganov/whisper.cpp repository on Hugging Face. The direct file URL is:

https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.bin

As of 2026-06-17 the file is 77,691,713 bytes (~77.7 MB). The viewable model page is the blob view in the same repo. Or let the official script fetch it for you:

# from inside a cloned whisper.cpp checkout
./models/download-ggml-model.sh tiny
# -> writes models/ggml-tiny.bin (77.7 MB)
0 MB

on-disk size of ggml-tiny.bin

0M

parameters in the tiny model

0 sizes

tiny / base / small / medium / large

What ggml-tiny.bin actually is

GGML is the on-disk tensor format used by whisper.cpp, the C/C++ port of OpenAI's Whisper. The ggerganov/whisper.cpp repository on Hugging Face is a model store: it holds pre-converted GGML binaries so you do not need Python, PyTorch, or any conversion step. You download ggml-tiny.bin, point the whisper.cpp binary at it, and it runs.

"tiny" is the smallest of the five standard Whisper checkpoints, about 39 million parameters. That is what buys you the 77.7 MB file and the fast decode, and it is also what costs you accuracy. The whisper.cpp models README ranks it bluntly as the fastest and least accurate option. Everything else is a tradeoff up from there.

The model sizes, with exact byte counts

All of these live in the same repository. The byte counts below were read straight from the Hugging Face x-linked-size header on 2026-06-17, so you can sanity-check a download against them.

FileSizeBytesWhen it fits
ggml-tiny.bin77.7 MB77,691,713Fastest, lowest accuracy. Multilingual. Wake words, rough drafts, throwaway transcripts on weak hardware.
ggml-tiny.en.bin77.7 MB77,704,715English-only variant of tiny. Slightly better on English than the multilingual tiny at the same size.
ggml-base.bin148 MB147,951,465The usual next step up when tiny mangles too many words. Still real-time on Apple Silicon.
ggml-small.bin488 MB487,601,967Noticeably more accurate. The point where many people stop for offline dictation.
ggml-medium.bin1.53 GB1,533,763,059Heavy. Good accuracy, but you feel it on latency and memory.

Quantized variants (the q5_1, q8_0 files) exist in the same repo and trade a little accuracy for a smaller footprint. Check the Files and versions tab on the repo for the current list before you hardcode a filename.

When tiny is genuinely the right call

I do not want to talk you out of this model. There are real jobs where 77.7 MB and a fast decode beat everything heavier:

  • Wake-word and gating. You only need to know roughly what was said to decide whether to hand off to a bigger model. tiny is perfect as the cheap first pass.
  • Constrained hardware. Old laptops, Raspberry Pi, anything where loading a 500 MB model is a non-starter. tiny runs where small will not.
  • Strictly-offline requirements. If audio is not allowed to leave the machine, full stop, then a local model is the only honest answer and tiny is the lightest one.
  • Batch transcription of clean audio. When latency does not matter and the audio is clean, tiny's mistakes are easy to skim past in a draft you are going to edit anyway.
77.7 MB

The whole reason to reach for tiny is that it is the only Whisper checkpoint small enough to load anywhere, instantly. Everything past that is you paying for accuracy you may not need.

whisper.cpp models README, fastest / least-accurate tier

Why Fazm does not ship ggml-tiny.bin

Fazm is a voice-first agent for macOS: you hold a hotkey, talk, and the same Claude Code / Codex agent loop acts on your machine. Voice is the front door, so transcription is on the hot path. You would assume we run a local Whisper model. We tried; we did not ship it.

The problem with tiny for a live agent is not mainly word accuracy, it is when the words arrive. whisper.cpp decodes a window of audio after that window closes. You finish a phrase, then you wait for the model to run, then you get text. For a dictation app that is fine. For an agent that should start reacting while you are still mid-sentence, that pause is the whole experience. The model that nails accuracy still loses if the user is staring at a frozen cursor.

So Fazm streams instead. Our TranscriptionService.swift hardcodes private let model = "nova-3" and opens a WebSocket to wss://api.deepgram.com/v1/listen. Audio goes up in chunks; interim (non-final) transcript segments come back mid-utterance, so the loop can move before you stop talking. We also feed a keyterm vocabulary list so command words and app names land, and we added a detector for the degenerate repeated-token loops that small ASR models emit on silence. That detector exists precisely because we saw how badly the lowest tier handles low-energy audio.

That is a real tradeoff, and I will state it plainly: streaming to Deepgram means audio leaves the machine. If your requirement is that it must not, ggml-tiny.bin (or a larger local Whisper model) is the correct choice and this is the page that tells you where to get it. We optimized for latency in the agent loop; you may weight privacy higher. Both are defensible.

Local ggml-tiny.bin vs streaming ASR, for an agent

Same goal, different bottleneck. This is the comparison the download guides skip.

FeatureLocal ggml-tiny.binFazm (streaming Nova-3)
Where it runsOn your Mac, fully offline once the 77 MB file is downloadedStreams audio to Deepgram over a WebSocket, transcript comes back in chunks
Latency for a live agentDecode happens after a chunk of audio; you wait for the window to close before you get textInterim (non-final) results arrive mid-sentence, so the loop can react before you stop talking
Accuracy at this sizetiny is the lowest-accuracy whisper model; expect mistakes on names and jargonNova-3 with a keyterm vocabulary list, tuned for command-style speech
PrivacyAudio never leaves the machineAudio is sent to a third-party ASR endpoint (the honest tradeoff we made for latency)
Hallucination on silencetiny is especially prone to repeated-token loops on silent or low-energy audioWe added a degenerate-repeat detector in TranscriptionService.swift to drop those segments

Not a knock on whisper.cpp. tiny is a great offline first-pass model; it is just optimized for a different constraint than a real-time agent loop.

If you landed here because you are wiring voice into something on a Mac and you are weighing local Whisper against a hosted streaming model, that is exactly the call Fazm made in production. You can read the agent it feeds into, or talk it through.

Wiring voice into a Mac agent and stuck on the local-vs-streaming call?

Walk through the latency, privacy, and accuracy tradeoffs with the person who shipped this decision.

ggml-tiny.bin, answered

What is the exact download URL for ggml-tiny.bin?

The direct file is https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.bin. The viewable page (with the Files and versions tab) is https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-tiny.bin. As of 2026-06-17 the file is 77,691,713 bytes (about 77.7 MB).

How big is ggml-tiny.bin?

77,691,713 bytes, roughly 77.7 MB on disk. It is the smallest of the standard whisper.cpp GGML models. The English-only ggml-tiny.en.bin is almost identical at 77,704,715 bytes.

Do I have to use Hugging Face, or can the script fetch it?

Either works. The official whisper.cpp way is ./models/download-ggml-model.sh tiny, which constructs the same Hugging Face URL and saves the file into models/. Manually downloading from the ggerganov/whisper.cpp repo gives you the identical binary; there is no extra conversion step.

What is the difference between ggml-tiny.bin and ggml-tiny.en.bin?

ggml-tiny.bin is multilingual; ggml-tiny.en.bin is trained only on English. At the same ~77.7 MB size, the .en variant tends to be a little more accurate on English audio because it does not spend parameters on other languages. If you only ever transcribe English, prefer .en.

Why does the tiny model produce repeated garbage on silent audio?

Small Whisper models, tiny most of all, are prone to degenerate repeated-token loops when fed silence or low-energy audio. They were trained on speech, so silence pushes them into a hallucination loop. The fix is to gate the model behind voice-activity detection and/or detect the repeated-token pattern and discard those segments. We do the latter in our transcription layer.

Is ggml-tiny.bin good enough for a real-time voice agent?

For wake-word detection or rough notes, yes. For a voice agent that has to act on what you said the moment you said it, the bottleneck is not just word accuracy, it is when the text arrives. whisper.cpp decodes a window after it closes, so you wait. That is why Fazm streams to a real-time ASR (Deepgram Nova-3) that returns interim results mid-utterance instead of running tiny locally. Different jobs, different tools.

What does Fazm actually run for voice input?

Fazm's TranscriptionService.swift hardcodes the model nova-3 and opens a WebSocket to wss://api.deepgram.com/v1/listen, sending audio chunks and receiving streamed transcript segments with a custom keyterm vocabulary. It is a deliberate latency-over-locality choice for the agent loop, not a whisper.cpp build.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.