download-ggml-model.sh large-v3-turbo: Complete Guide to Downloading Whisper Models

Matthew Diakonov··9 min read

download-ggml-model.sh large-v3-turbo

If you are setting up whisper.cpp for local speech recognition, the first thing you need after compiling is a model file. The download-ggml-model.sh script is the official way to fetch pre-converted GGML models from Hugging Face. Running ./models/download-ggml-model.sh large-v3-turbo gives you the best balance of speed and accuracy for real-time transcription on Apple Silicon.

What the Script Actually Does

The download-ggml-model.sh script lives in the models/ directory of the whisper.cpp repository. When you run it, here is what happens under the hood:

  1. It validates the model name you passed (in this case, large-v3-turbo)
  2. It constructs the download URL pointing to the Hugging Face ggerganov/whisper.cpp repository
  3. It downloads the GGML-format model file (ggml-large-v3-turbo.bin) using curl or wget
  4. It places the file in the models/ directory alongside the script itself

The downloaded file is a pre-converted GGML binary. You do not need Python, PyTorch, or any conversion tools. The file is ready to use with whisper.cpp immediately.

download-ggml-model.shHugging Faceggerganov/whisperggml-large-v3-turbo.bin~1.5 GBScript fetches pre-converted GGML binary from Hugging Face CDN

Step-by-Step Setup

# Clone whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp

# Build with Metal support (automatic on Apple Silicon)
make -j

# Download the large-v3-turbo model
./models/download-ggml-model.sh large-v3-turbo

After the download completes, verify the file exists:

ls -lh models/ggml-large-v3-turbo.bin
# Expected: ~1.5 GB file

Then test it with a sample audio file:

./main -m models/ggml-large-v3-turbo.bin -f samples/jfk.wav

You should see the transcription output within a few seconds on any M-series Mac.

Available Model Variants

The script accepts any of the following model names. Choosing the right one depends on your hardware, latency requirements, and accuracy needs.

| Model | Size | Speed (M2 Pro) | Accuracy | Best For | |---|---|---|---|---| | tiny | 75 MB | ~30x real-time | Low | Quick prototyping, testing | | tiny.en | 75 MB | ~30x real-time | Low (English only) | English-only prototypes | | base | 142 MB | ~20x real-time | Fair | Lightweight apps | | base.en | 142 MB | ~20x real-time | Fair (English only) | English dictation, low-end hardware | | small | 466 MB | ~12x real-time | Good | Balanced for mid-range hardware | | small.en | 466 MB | ~12x real-time | Good (English only) | English apps on constrained devices | | medium | 1.5 GB | ~6x real-time | Very good | High accuracy, multilingual | | medium.en | 1.5 GB | ~6x real-time | Very good (English only) | High accuracy English | | large-v3 | 3.1 GB | ~3x real-time | Excellent | Maximum accuracy | | large-v3-turbo | 1.5 GB | ~10x real-time | Near-excellent | Best speed/accuracy ratio |

The large-v3-turbo model is roughly half the size of large-v3 and runs 3x faster, while losing less than 1% accuracy on standard benchmarks. For real-time desktop use, it is the clear winner.

Tip

If you only need English transcription, large-v3-turbo still outperforms the English-only medium.en model in both speed and accuracy on Apple Silicon. The .en variants were more useful before the turbo model existed.

Why large-v3-turbo Over Other Models

The turbo variant was introduced by OpenAI as a distilled version of large-v3. The key differences:

  • Decoder layers reduced from 32 to 4, which is why it runs so much faster
  • Encoder unchanged, preserving the acoustic understanding of the full large model
  • File size halved from 3.1 GB to 1.5 GB, fitting comfortably in unified memory on any M-series Mac
  • Multilingual support preserved, unlike the .en variants which only handle English

For a desktop AI agent processing voice commands, the speed improvement matters more than the marginal accuracy difference. A command spoken to your computer needs to be transcribed in under a second to feel responsive. The full large-v3 model cannot do this reliably on Apple Silicon; the turbo variant can.

Script Internals and Options

The script is a straightforward Bash file. Here are the key details:

# The script constructs this URL pattern:
# https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-{model}.bin

# Usage:
./models/download-ggml-model.sh <model-name>

# Examples:
./models/download-ggml-model.sh large-v3-turbo
./models/download-ggml-model.sh base.en
./models/download-ggml-model.sh small

The script checks for curl first, then falls back to wget. On macOS, curl is always available, so this works out of the box. On minimal Linux containers, you may need to install one of them.

If the download is interrupted, simply re-run the command. The script will overwrite the partial file and start fresh. There is no resume support built in, but for a 1.5 GB file on a modern connection this rarely matters.

Common Pitfalls

  • Running the script from the wrong directory. The script expects to be run from the whisper.cpp root, or at least from a path where models/ exists as a subdirectory. If you run bash download-ggml-model.sh large-v3-turbo directly inside the models/ folder, it will still work, but the file will land in a nested models/models/ path. Always run it as ./models/download-ggml-model.sh large-v3-turbo from the repo root.

  • Typos in the model name. The script does not give a helpful error for invalid model names. If you type large-v3turbo (missing the hyphen) or largev3-turbo, you will get a 404 from Hugging Face that shows up as a small HTML error page saved as the .bin file. If your model file is under 1 KB, you downloaded an error page, not a model. Delete it and retry with the correct name.

  • Firewall or proxy blocking Hugging Face. Corporate networks sometimes block huggingface.co. The symptom is a timeout or connection refused error. You can download the file manually from the Hugging Face web UI and place it in the models/ directory yourself.

  • Disk space. The large-v3-turbo model is 1.5 GB. If your disk is nearly full, the download will fail partway through. Check with df -h . before downloading.

  • Confusing GGML with other formats. If you download a PyTorch .pt file or a Core ML .mlmodelc from somewhere else, whisper.cpp will not load it. The script downloads the correct GGML format. Stick with it.

Warning

If whisper.cpp crashes with "invalid model data" after downloading, the file is likely corrupted or is an HTML error page. Delete models/ggml-large-v3-turbo.bin and re-run the download script. Check the file size first: it should be approximately 1.5 GB.

Quantized Alternatives

If you want to reduce memory usage further, whisper.cpp supports quantized model formats. These are not available through the download script directly, but you can convert them yourself:

# After downloading the standard model
./models/download-ggml-model.sh large-v3-turbo

# Quantize to Q5_0 (reduces size by ~40%, minimal accuracy loss)
./quantize models/ggml-large-v3-turbo.bin models/ggml-large-v3-turbo-q5_0.bin q5_0

| Format | Size | Memory Usage | Accuracy Impact | |---|---|---|---| | Full (f16) | 1.5 GB | ~1.6 GB RAM | Baseline | | Q8_0 | ~800 MB | ~900 MB RAM | Negligible | | Q5_0 | ~550 MB | ~600 MB RAM | Very slight |

For most desktop use cases, the full f16 model is fine because Apple Silicon Macs have plenty of unified memory. Quantization is more useful on constrained devices like Raspberry Pi or older Intel machines.

Using the Model After Download

Once the file is in place, here is a minimal working example for transcribing an audio file:

# Transcribe a WAV file
./main -m models/ggml-large-v3-turbo.bin -f input.wav

# Transcribe with timestamps
./main -m models/ggml-large-v3-turbo.bin -f input.wav -otxt

# Transcribe from microphone (requires ffmpeg for audio capture)
ffmpeg -f avfoundation -i ":0" -ar 16000 -ac 1 -f wav - | ./main -m models/ggml-large-v3-turbo.bin -f -

# Output as JSON
./main -m models/ggml-large-v3-turbo.bin -f input.wav -oj

For integration into applications, whisper.cpp also provides a C API and bindings for Python (pywhispercpp), Swift, Go, and other languages. The model path is the same regardless of which binding you use.

Wrapping Up

The download-ggml-model.sh large-v3-turbo command is the fastest path to high-quality local speech recognition. It fetches a pre-converted model that works immediately with whisper.cpp, no conversion pipeline needed. For real-time voice input on Apple Silicon, the turbo variant gives you near-large-v3 accuracy at 10x real-time speed, which is fast enough for interactive use in desktop agents and voice-controlled workflows.

Fazm is an open source macOS AI agent that uses local whisper.cpp for voice input. Open source on GitHub.

Related Posts