ggml-large-v3-turbo.bin: The Fast Whisper Model for Real-Time Transcription

Matthew Diakonov·April 10, 2026·9 min read

whisper ggml large-v3-turbo speech-to-text apple-silicon whisper-cpp real-time

The ggml-large-v3-turbo.bin file is the GGML-format version of OpenAI's distilled Whisper large-v3-turbo model, built for use with whisper.cpp. At approximately 1.5 GB, it runs roughly 6x faster than the full large-v3 model while retaining most of its accuracy. If you need local speech recognition that can keep up with real-time audio on consumer hardware, this is the model to use.

What Makes the Turbo Model Different

The large-v3-turbo model is a distilled version of the full large-v3. OpenAI kept the full 32-layer encoder (the part that processes audio) but compressed the decoder from 32 layers down to just 4. The decoder is the part that generates text tokens from the encoded audio representation, and it turns out you can aggressively shrink it without losing much accuracy.

This design means the turbo model hears audio just as well as the full model. The only place it loses fidelity is in the text generation step, and in practice that difference is small enough that most users will not notice it.

How to Download ggml-large-v3-turbo.bin

The fastest method uses the built-in download script from whisper.cpp:

# Clone and build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

# Download the turbo model
./models/download-ggml-model.sh large-v3-turbo

This fetches ggml-large-v3-turbo.bin (approximately 1.5 GB) into the models/ directory. You can also download directly:

# Direct download via curl
curl -L -o models/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

Verify the download:

ls -lh models/ggml-large-v3-turbo.bin
# Expected: approximately 1.5 GB

# Quick test
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f samples/jfk.wav

If the file is significantly smaller than 1.5 GB, the download was likely interrupted or you received an HTML error page from Hugging Face. Delete the file and try again.

Performance Benchmarks

Here are real-world numbers for ggml-large-v3-turbo.bin across different Apple Silicon chips, processing a 10-minute English audio file:

Chip	Real-time Factor	10min Audio	RAM Usage	Metal GPU
M1	~6x real-time	~1.7 min	~1.7 GB	Yes
M1 Pro	~8x real-time	~1.3 min	~1.7 GB	Yes
M2 Pro	~10x real-time	~1.0 min	~1.7 GB	Yes
M3 Max	~14x real-time	~0.7 min	~1.7 GB	Yes
M4 Pro	~16x real-time	~0.6 min	~1.7 GB	Yes

"Real-time factor" means how many seconds of audio the model processes per second. An M2 Pro at 10x processes 10 seconds of audio in 1 second. Anything above 1x means the model can transcribe faster than you can speak, which is the minimum for live dictation.

Tip

On any Apple Silicon Mac from M1 onward, ggml-large-v3-turbo.bin is fast enough for real-time voice input. Even the base M1 chip at ~6x real-time leaves plenty of headroom for streaming transcription while running other applications.

Quantized Variants

If 1.5 GB is still more than you want, whisper.cpp supports quantization to reduce the file size and memory footprint:

# Download full turbo model first
./models/download-ggml-model.sh large-v3-turbo

# Quantize to q8_0 (recommended for turbo)
./build/bin/quantize models/ggml-large-v3-turbo.bin models/ggml-large-v3-turbo-q8_0.bin q8_0

# Quantize to q5_0 (smaller, slight accuracy trade-off)
./build/bin/quantize models/ggml-large-v3-turbo.bin models/ggml-large-v3-turbo-q5_0.bin q5_0

Variant	File Size	RAM	Speed Impact	Accuracy Impact
f16 (default `.bin`)	1.5 GB	~1.7 GB	Baseline	Baseline
q8_0	~800 MB	~900 MB	~5% faster	Negligible
q5_0	~550 MB	~650 MB	~10% faster	Minor on non-English
q4_0	~450 MB	~550 MB	~15% faster	Noticeable on edge cases

The q8_0 variant is the sweet spot for most users. It cuts memory usage nearly in half with no perceptible accuracy loss. Pre-quantized versions are also available on Hugging Face, so you can skip the quantization step entirely:

curl -L -o models/ggml-large-v3-turbo-q5_0.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo-q5_0.bin

Using the Turbo Model

Command Line

# Basic transcription
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav

# Auto-detect language
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -l auto

# Real-time microphone input (macOS)
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin --capture

# SRT subtitle output
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -osrt

# JSON with word-level timestamps
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -oj -ml 1

Python (pywhispercpp)

from pywhispercpp.model import Model

model = Model("models/ggml-large-v3-turbo.bin")
segments = model.transcribe("meeting.wav")
for seg in segments:
    print(f"[{seg.t0:.2f} -> {seg.t1:.2f}] {seg.text}")

Swift (whisper.cpp bindings)

import whisper

guard let ctx = whisper_init_from_file("models/ggml-large-v3-turbo.bin") else {
    fatalError("Failed to load model")
}

var params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY)
params.language = "en".withCString { UnsafePointer($0) }
params.n_threads = 4

// Run inference on PCM float array
whisper_full(ctx, params, pcmData, Int32(pcmData.count))

let nSegments = whisper_full_n_segments(ctx)
for i in 0..<nSegments {
    let text = String(cString: whisper_full_get_segment_text(ctx, i))
    print(text)
}

whisper_free(ctx)

Common Pitfalls

Confusing the turbo model with the full large-v3. Both model files start with ggml-large-v3, so it is easy to load the wrong one. Check the file size: 1.5 GB is the turbo, 3.1 GB is the full model. If your file is 3.1 GB, you downloaded ggml-large-v3.bin (without -turbo).

ls -lh models/ggml-large-v3*.bin
# ggml-large-v3.bin       ~3.1 GB  (full, slow)
# ggml-large-v3-turbo.bin ~1.5 GB  (turbo, fast)

Building without Metal support. On macOS, whisper.cpp uses Apple's Metal GPU framework to accelerate inference. If you build without -DGGML_METAL=ON, you lose roughly 3x performance. Check the build output for ggml_metal_init to confirm GPU acceleration is active.

Using the wrong audio format. whisper.cpp expects 16-bit WAV at 16kHz. If you pass a different format (MP3, M4A, higher sample rate), the model will either refuse the file or produce garbage output. Convert first:

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

Expecting identical output to the full model. The turbo variant may differ on specific words in challenging audio (heavy background noise, thick accents, code-switching between languages). For most English content, the difference is under 0.5% word error rate. For rare languages with less training data, the gap can be closer to 1-2%.

Warning

If you see whisper_init: invalid model data after downloading, the file is likely corrupted or incomplete. Delete it and re-download. This happens most often when curl follows a redirect to an HTML error page instead of the actual model binary.

When to Choose Turbo vs Full

Scenario	Recommended Model	Why
Live dictation, voice commands	turbo	Must run faster than real-time
Meeting transcription (batch)	turbo	Good enough accuracy, finishes in minutes
Archival transcription, legal, medical	full large-v3	Maximum accuracy on every word
Non-English, rare languages	full large-v3	Turbo loses more on underrepresented languages
8 GB RAM Mac, multitasking	turbo (q8_0)	Only ~900 MB RAM vs 3.3 GB
Raspberry Pi, embedded devices	turbo (q5_0 or q4_0)	Smallest footprint that still works well

For the majority of users running whisper.cpp on a Mac for everyday transcription, ggml-large-v3-turbo.bin is the right default. You get large-v3 quality audio encoding with a decoder fast enough for real-time use.

Wrapping Up

ggml-large-v3-turbo.bin gives you 95%+ of the full large-v3 model's accuracy at a fraction of the compute cost. The 4-layer decoder runs fast enough for real-time transcription on any Apple Silicon Mac, and quantized variants bring it down to under 1 GB of RAM. For local, private speech recognition with whisper.cpp, it is the best starting point.

Fazm is an open source macOS AI agent that uses whisper.cpp with the turbo model for real-time voice input. Learn more or open source on GitHub.

ggml-large-v3-turbo.bin: The Fast Whisper Model for Real-Time Transcription

What Makes the Turbo Model Different

How to Download ggml-large-v3-turbo.bin

Performance Benchmarks

Quantized Variants

Using the Turbo Model

Command Line

Python (pywhispercpp)

Swift (whisper.cpp bindings)

Common Pitfalls

When to Choose Turbo vs Full

Wrapping Up

Related Posts

download-ggml-model.sh large-v3: How to Download the Full Whisper Large Model

ggml-large-v3.bin: Complete Guide to Whisper's Largest GGML Model

download-ggml-model.sh large-v3-turbo: Complete Guide to Downloading Whisper Models

Comments ()

What Makes the Turbo Model Different

How to Download ggml-large-v3-turbo.bin

Performance Benchmarks

Quantized Variants

Using the Turbo Model

Command Line

Python (pywhispercpp)

Swift (whisper.cpp bindings)

Common Pitfalls

When to Choose Turbo vs Full

Wrapping Up

Related Posts

download-ggml-model.sh large-v3: How to Download the Full Whisper Large Model

ggml-large-v3.bin: Complete Guide to Whisper's Largest GGML Model

download-ggml-model.sh large-v3-turbo: Complete Guide to Downloading Whisper Models

Comments (••)

Comments ()