ggml-large-v3-turbo.bin: The Fast Whisper Model for Real-Time Transcription

Matthew Diakonov··9 min read

ggml-large-v3-turbo.bin

The ggml-large-v3-turbo.bin file is the GGML-format version of OpenAI's distilled Whisper large-v3-turbo model, built for use with whisper.cpp. At approximately 1.5 GB, it runs roughly 6x faster than the full large-v3 model while retaining most of its accuracy. If you need local speech recognition that can keep up with real-time audio on consumer hardware, this is the model to use.

What Makes the Turbo Model Different

The large-v3-turbo model is a distilled version of the full large-v3. OpenAI kept the full 32-layer encoder (the part that processes audio) but compressed the decoder from 32 layers down to just 4. The decoder is the part that generates text tokens from the encoded audio representation, and it turns out you can aggressively shrink it without losing much accuracy.

Architecture Comparison: large-v3 vs large-v3-turbolarge-v3 (full)Encoder: 32 layers1280 dim, 20 headsDecoder: 32 layers1280 dim, 20 headslarge-v3-turboEncoder: 32 layers1280 dim, 20 headsDecoder: 4 layers1280 dim, 20 heads1550M params | 3.1 GB | ~3x RT809M params | 1.5 GB | ~10x RTSame encoder = same audio understanding. Smaller decoder = faster text generation.

This design means the turbo model hears audio just as well as the full model. The only place it loses fidelity is in the text generation step, and in practice that difference is small enough that most users will not notice it.

How to Download ggml-large-v3-turbo.bin

The fastest method uses the built-in download script from whisper.cpp:

# Clone and build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

# Download the turbo model
./models/download-ggml-model.sh large-v3-turbo

This fetches ggml-large-v3-turbo.bin (approximately 1.5 GB) into the models/ directory. You can also download directly:

# Direct download via curl
curl -L -o models/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

Verify the download:

ls -lh models/ggml-large-v3-turbo.bin
# Expected: approximately 1.5 GB

# Quick test
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f samples/jfk.wav

If the file is significantly smaller than 1.5 GB, the download was likely interrupted or you received an HTML error page from Hugging Face. Delete the file and try again.

Performance Benchmarks

Here are real-world numbers for ggml-large-v3-turbo.bin across different Apple Silicon chips, processing a 10-minute English audio file:

| Chip | Real-time Factor | 10min Audio | RAM Usage | Metal GPU | |---|---|---|---|---| | M1 | ~6x real-time | ~1.7 min | ~1.7 GB | Yes | | M1 Pro | ~8x real-time | ~1.3 min | ~1.7 GB | Yes | | M2 Pro | ~10x real-time | ~1.0 min | ~1.7 GB | Yes | | M3 Max | ~14x real-time | ~0.7 min | ~1.7 GB | Yes | | M4 Pro | ~16x real-time | ~0.6 min | ~1.7 GB | Yes |

"Real-time factor" means how many seconds of audio the model processes per second. An M2 Pro at 10x processes 10 seconds of audio in 1 second. Anything above 1x means the model can transcribe faster than you can speak, which is the minimum for live dictation.

Tip

On any Apple Silicon Mac from M1 onward, ggml-large-v3-turbo.bin is fast enough for real-time voice input. Even the base M1 chip at ~6x real-time leaves plenty of headroom for streaming transcription while running other applications.

Quantized Variants

If 1.5 GB is still more than you want, whisper.cpp supports quantization to reduce the file size and memory footprint:

# Download full turbo model first
./models/download-ggml-model.sh large-v3-turbo

# Quantize to q8_0 (recommended for turbo)
./build/bin/quantize models/ggml-large-v3-turbo.bin models/ggml-large-v3-turbo-q8_0.bin q8_0

# Quantize to q5_0 (smaller, slight accuracy trade-off)
./build/bin/quantize models/ggml-large-v3-turbo.bin models/ggml-large-v3-turbo-q5_0.bin q5_0

| Variant | File Size | RAM | Speed Impact | Accuracy Impact | |---|---|---|---|---| | f16 (default .bin) | 1.5 GB | ~1.7 GB | Baseline | Baseline | | q8_0 | ~800 MB | ~900 MB | ~5% faster | Negligible | | q5_0 | ~550 MB | ~650 MB | ~10% faster | Minor on non-English | | q4_0 | ~450 MB | ~550 MB | ~15% faster | Noticeable on edge cases |

The q8_0 variant is the sweet spot for most users. It cuts memory usage nearly in half with no perceptible accuracy loss. Pre-quantized versions are also available on Hugging Face, so you can skip the quantization step entirely:

curl -L -o models/ggml-large-v3-turbo-q5_0.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo-q5_0.bin

Using the Turbo Model

Command Line

# Basic transcription
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav

# Auto-detect language
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -l auto

# Real-time microphone input (macOS)
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin --capture

# SRT subtitle output
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -osrt

# JSON with word-level timestamps
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -oj -ml 1

Python (pywhispercpp)

from pywhispercpp.model import Model

model = Model("models/ggml-large-v3-turbo.bin")
segments = model.transcribe("meeting.wav")
for seg in segments:
    print(f"[{seg.t0:.2f} -> {seg.t1:.2f}] {seg.text}")

Swift (whisper.cpp bindings)

import whisper

guard let ctx = whisper_init_from_file("models/ggml-large-v3-turbo.bin") else {
    fatalError("Failed to load model")
}

var params = whisper_full_default_params(WHISPER_SAMPLING_GREEDY)
params.language = "en".withCString { UnsafePointer($0) }
params.n_threads = 4

// Run inference on PCM float array
whisper_full(ctx, params, pcmData, Int32(pcmData.count))

let nSegments = whisper_full_n_segments(ctx)
for i in 0..<nSegments {
    let text = String(cString: whisper_full_get_segment_text(ctx, i))
    print(text)
}

whisper_free(ctx)

Common Pitfalls

Confusing the turbo model with the full large-v3. Both model files start with ggml-large-v3, so it is easy to load the wrong one. Check the file size: 1.5 GB is the turbo, 3.1 GB is the full model. If your file is 3.1 GB, you downloaded ggml-large-v3.bin (without -turbo).

ls -lh models/ggml-large-v3*.bin
# ggml-large-v3.bin       ~3.1 GB  (full, slow)
# ggml-large-v3-turbo.bin ~1.5 GB  (turbo, fast)

Building without Metal support. On macOS, whisper.cpp uses Apple's Metal GPU framework to accelerate inference. If you build without -DGGML_METAL=ON, you lose roughly 3x performance. Check the build output for ggml_metal_init to confirm GPU acceleration is active.

Using the wrong audio format. whisper.cpp expects 16-bit WAV at 16kHz. If you pass a different format (MP3, M4A, higher sample rate), the model will either refuse the file or produce garbage output. Convert first:

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

Expecting identical output to the full model. The turbo variant may differ on specific words in challenging audio (heavy background noise, thick accents, code-switching between languages). For most English content, the difference is under 0.5% word error rate. For rare languages with less training data, the gap can be closer to 1-2%.

Warning

If you see whisper_init: invalid model data after downloading, the file is likely corrupted or incomplete. Delete it and re-download. This happens most often when curl follows a redirect to an HTML error page instead of the actual model binary.

When to Choose Turbo vs Full

| Scenario | Recommended Model | Why | |---|---|---| | Live dictation, voice commands | turbo | Must run faster than real-time | | Meeting transcription (batch) | turbo | Good enough accuracy, finishes in minutes | | Archival transcription, legal, medical | full large-v3 | Maximum accuracy on every word | | Non-English, rare languages | full large-v3 | Turbo loses more on underrepresented languages | | 8 GB RAM Mac, multitasking | turbo (q8_0) | Only ~900 MB RAM vs 3.3 GB | | Raspberry Pi, embedded devices | turbo (q5_0 or q4_0) | Smallest footprint that still works well |

For the majority of users running whisper.cpp on a Mac for everyday transcription, ggml-large-v3-turbo.bin is the right default. You get large-v3 quality audio encoding with a decoder fast enough for real-time use.

Wrapping Up

ggml-large-v3-turbo.bin gives you 95%+ of the full large-v3 model's accuracy at a fraction of the compute cost. The 4-layer decoder runs fast enough for real-time transcription on any Apple Silicon Mac, and quantized variants bring it down to under 1 GB of RAM. For local, private speech recognition with whisper.cpp, it is the best starting point.

Fazm is an open source macOS AI agent that uses whisper.cpp with the turbo model for real-time voice input. Learn more or open source on GitHub.

Related Posts