whisper.cpp Metal on Apple Silicon: GPU Acceleration for Local Speech-to-Text
whisper.cpp Metal on Apple Silicon
Metal is Apple's GPU framework, and whisper.cpp uses it to offload matrix multiplications to the Apple Silicon GPU during inference. The result: speech-to-text that runs 2x to 4x faster than CPU-only mode, depending on your chip and model size. If you are running whisper.cpp on any M-series Mac and not using Metal, you are leaving performance on the table.
How Metal Acceleration Works in whisper.cpp
whisper.cpp uses ggml as its tensor computation backend. When you build with Metal enabled, ggml compiles a set of Metal Shading Language (MSL) kernels that handle the heavy linear algebra operations on the GPU. The CPU still orchestrates the inference loop, manages memory, and runs operations that are not worth offloading (small element-wise ops, for example). The large matrix multiplications in the encoder and decoder layers go to the GPU.
The key advantage on Apple Silicon is unified memory. On a discrete GPU system (NVIDIA + CUDA), tensors must be copied across the PCIe bus between CPU RAM and GPU VRAM. On Apple Silicon, the CPU and GPU share the same physical memory pool. This means the Metal backend avoids the copy overhead entirely, which matters especially for smaller models where transfer time would dominate compute time.
Building whisper.cpp with Metal
Metal support is enabled by default when you build on macOS with an M-series chip. Here is the standard build:
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)
The GGML_METAL=ON flag is technically the default on macOS, but specifying it explicitly makes your build script self-documenting. The build process compiles the Metal shaders from ggml/src/ggml-metal/ggml-metal.metal into a metallib that gets loaded at runtime.
To verify Metal is active, run any transcription and look for this line in the output:
ggml_metal_init: found device: Apple M2 Pro
If you see that, the GPU is in play. If you do not see any ggml_metal_init lines, something went wrong with the build.
Warning
If you build with make instead of cmake, Metal is still supported but the flags differ. Use GGML_METAL=1 make -j. The cmake path is recommended for consistency with upstream changes.
Disabling Metal (for comparison or debugging)
If you want to benchmark CPU-only performance or debug an issue:
cmake -B build-cpu -DGGML_METAL=OFF
cmake --build build-cpu --config Release -j$(sysctl -n hw.logicalcpu)
This is useful for isolating whether a problem is Metal-specific or a general inference bug.
Performance: Metal vs. CPU-Only
The speedup from Metal depends on the model size and your specific Apple Silicon chip. Larger models benefit more because they have more matrix multiplication work to offload. Here are representative numbers for transcribing a 60-second English audio clip:
| Chip | Model | CPU-only (seconds) | Metal GPU (seconds) | Speedup | |---|---|---|---|---| | M1 | tiny | 1.2 | 0.9 | 1.3x | | M1 | base | 2.8 | 1.5 | 1.9x | | M1 | large-v3-turbo | 12.1 | 5.8 | 2.1x | | M2 Pro | tiny | 0.8 | 0.6 | 1.3x | | M2 Pro | base | 1.9 | 0.9 | 2.1x | | M2 Pro | large-v3-turbo | 8.4 | 3.2 | 2.6x | | M3 Max | large-v3-turbo | 6.2 | 1.9 | 3.3x | | M4 Pro | large-v3-turbo | 5.5 | 1.6 | 3.4x |
A few patterns to note:
- The tiny model gets minimal benefit from Metal (1.3x). The model is so small that the overhead of dispatching to the GPU nearly matches the compute savings.
- The large-v3-turbo model consistently sees 2x+ speedup across all chips. This is the sweet spot for Metal acceleration.
- Newer chips with more GPU cores (M3 Max with 40 cores, M4 Pro with 20 cores) see larger speedups because more work can run in parallel on the GPU.
Choosing the Right Model for Metal
Not every whisper model is worth running with Metal. The choice depends on your use case:
| Model | Parameters | Metal benefit | Best for | |---|---|---|---| | tiny / tiny.en | 39M | Minimal (1.3x) | Quick prototyping, CI pipelines | | base / base.en | 74M | Moderate (1.9x) | Realtime on constrained devices | | small / small.en | 244M | Good (2.2x) | Balanced accuracy and speed | | medium / medium.en | 769M | Good (2.5x) | High accuracy, can tolerate slight delay | | large-v3-turbo | 809M | Strong (2.6x+) | Best accuracy-to-speed ratio | | large-v3 | 1.5B | Strong (3x+) | Maximum accuracy, batch processing |
For a voice-controlled desktop agent that needs to respond in real time, large-v3-turbo with Metal is the recommended choice. It delivers near large-v3 accuracy at roughly half the inference time.
Download the model:
./models/download-ggml-model.sh large-v3-turbo
Tuning Metal Performance
Thread count
whisper.cpp uses CPU threads for operations that are not offloaded to Metal. The default thread count is 4. On machines with many performance cores, increasing this can help:
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -t 8
On an M2 Pro (8 performance cores + 4 efficiency cores), -t 8 is a good starting point. Going beyond your performance core count provides diminishing returns because the efficiency cores are slower.
GPU layers
By default, all layers are offloaded to Metal when it is enabled. You can control this with the -ngl flag:
# Offload all layers (default with Metal)
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -ngl 999
# Offload only 16 layers (partial GPU)
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -ngl 16
Partial offloading is rarely useful on Apple Silicon because unified memory means there is no VRAM limit to worry about. The only scenario where you might reduce GPU layers is if you are running multiple GPU-heavy applications simultaneously and want to leave GPU bandwidth for them.
Flash attention
whisper.cpp supports flash attention on Metal, which reduces memory usage and can improve throughput for longer audio segments:
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -fa
The -fa flag enables flash attention. On Apple Silicon with Metal, this is especially effective because it reduces the memory bandwidth pressure on the unified memory bus.
Tip
Combine Metal with flash attention for the best throughput: -fa reduces memory pressure while the GPU handles the heavy compute. On an M2 Pro with large-v3-turbo, this combination processes 60 seconds of audio in about 2.8 seconds.
Common Pitfalls
-
Building without Xcode Command Line Tools. Metal shader compilation requires the Xcode CLT. If you see errors about missing
metalormetallibduring build, runxcode-select --installfirst. -
Running on Intel Macs and expecting Metal. Metal compute shaders work on Apple Silicon only for whisper.cpp. Intel Macs with AMD GPUs have Metal support in theory, but the ggml Metal backend targets Apple Silicon's unified memory architecture specifically. On Intel Macs, you will fall back to CPU-only inference.
-
Assuming more threads always helps. Setting
-t 16on an M1 with 4 performance cores will not speed things up. The efficiency cores are roughly half the speed of performance cores for this workload. Match your thread count to your performance core count. -
Forgetting to check the model format. whisper.cpp uses GGML-format model files, not the original PyTorch weights. If you download a
.ptfile from Hugging Face, you need to convert it withconvert-pt-to-ggml.pybefore whisper.cpp can load it. -
Not updating whisper.cpp. The Metal backend gets regular performance improvements. A build from six months ago could be 20-30% slower than current
masteron the same hardware. Pull and rebuild periodically.
Minimal Working Example
A complete example from clone to transcription:
# Clone and build with Metal
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)
# Download the large-v3-turbo model
./models/download-ggml-model.sh large-v3-turbo
# Record 5 seconds of audio (or use any .wav file)
# sox is available via: brew install sox
rec -r 16000 -c 1 -b 16 /tmp/test.wav trim 0 5
# Transcribe with Metal + flash attention
./build/bin/whisper-cli \
-m models/ggml-large-v3-turbo.bin \
-f /tmp/test.wav \
-t 8 \
-fa \
--print-realtime
You should see ggml_metal_init: found device: Apple M... in the output, followed by the transcription. On an M2 Pro, a 5-second clip transcribes in under 0.5 seconds with this configuration.
Metal vs. CoreML vs. WhisperKit
If you are on Apple Silicon, you have three options for local Whisper inference. Here is how they compare:
| Feature | whisper.cpp (Metal) | whisper.cpp (CoreML) | WhisperKit | |---|---|---|---| | Language | C/C++ | C/C++ + CoreML model | Swift | | GPU backend | Metal compute shaders | Apple Neural Engine + GPU | CoreML (ANE + GPU) | | Setup complexity | Low (cmake + build) | Medium (need CoreML model conversion) | Low (Swift package) | | Streaming support | Yes (real-time mode) | Yes | Yes | | Model flexibility | Any GGML model | Requires CoreML conversion per model | Pre-converted models only | | Best for | CLI tools, servers, cross-platform | Maximum throughput on ANE | Native Swift/iOS apps |
whisper.cpp with Metal is the most flexible option. It works with any GGML model, supports all whisper.cpp flags, and does not require model conversion. CoreML can be faster for specific models because it can use the Apple Neural Engine (ANE) in addition to the GPU, but it requires converting each model to CoreML format first. WhisperKit is the best choice if you are building a native Swift application and want a clean API.
For a command-line tool or a background daemon that processes audio files, whisper.cpp with Metal is the practical choice.
Wrapping Up
Metal acceleration turns whisper.cpp from "fast enough" to "real-time with headroom" on Apple Silicon. The build is straightforward, the unified memory architecture eliminates the copy overhead that plagues discrete GPU setups, and the speedup scales with model size. For local speech-to-text on a Mac, this is the setup that actually delivers.
Fazm is an open source macOS AI agent that uses local whisper.cpp for voice input. Open source on GitHub.