whisper.cpp Metal on Apple Silicon: GPU Acceleration for Local Speech-to-Text

Matthew Diakonov·April 7, 2026·11 min read

whisper-cpp metal apple-silicon gpu-acceleration speech-to-text macos

Metal is Apple's GPU framework, and whisper.cpp uses it to offload matrix multiplications to the Apple Silicon GPU during inference. The result: speech-to-text that runs 2x to 4x faster than CPU-only mode, depending on your chip and model size. If you are running whisper.cpp on any M-series Mac and not using Metal, you are leaving performance on the table.

How Metal Acceleration Works in whisper.cpp

whisper.cpp uses ggml as its tensor computation backend. When you build with Metal enabled, ggml compiles a set of Metal Shading Language (MSL) kernels that handle the heavy linear algebra operations on the GPU. The CPU still orchestrates the inference loop, manages memory, and runs operations that are not worth offloading (small element-wise ops, for example). The large matrix multiplications in the encoder and decoder layers go to the GPU.

The key advantage on Apple Silicon is unified memory. On a discrete GPU system (NVIDIA + CUDA), tensors must be copied across the PCIe bus between CPU RAM and GPU VRAM. On Apple Silicon, the CPU and GPU share the same physical memory pool. This means the Metal backend avoids the copy overhead entirely, which matters especially for smaller models where transfer time would dominate compute time.

Building whisper.cpp with Metal

Metal support is enabled by default when you build on macOS with an M-series chip. Here is the standard build:

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

The GGML_METAL=ON flag is technically the default on macOS, but specifying it explicitly makes your build script self-documenting. The build process compiles the Metal shaders from ggml/src/ggml-metal/ggml-metal.metal into a metallib that gets loaded at runtime.

To verify Metal is active, run any transcription and look for this line in the output:

ggml_metal_init: found device: Apple M2 Pro

If you see that, the GPU is in play. If you do not see any ggml_metal_init lines, something went wrong with the build.

Warning

If you build with make instead of cmake, Metal is still supported but the flags differ. Use GGML_METAL=1 make -j. The cmake path is recommended for consistency with upstream changes.

Disabling Metal (for comparison or debugging)

If you want to benchmark CPU-only performance or debug an issue:

cmake -B build-cpu -DGGML_METAL=OFF
cmake --build build-cpu --config Release -j$(sysctl -n hw.logicalcpu)

This is useful for isolating whether a problem is Metal-specific or a general inference bug.

Performance: Metal vs. CPU-Only

The speedup from Metal depends on the model size and your specific Apple Silicon chip. Larger models benefit more because they have more matrix multiplication work to offload. Here are representative numbers for transcribing a 60-second English audio clip:

Chip	Model	CPU-only (seconds)	Metal GPU (seconds)	Speedup
M1	tiny	1.2	0.9	1.3x
M1	base	2.8	1.5	1.9x
M1	large-v3-turbo	12.1	5.8	2.1x
M2 Pro	tiny	0.8	0.6	1.3x
M2 Pro	base	1.9	0.9	2.1x
M2 Pro	large-v3-turbo	8.4	3.2	2.6x
M3 Max	large-v3-turbo	6.2	1.9	3.3x
M4 Pro	large-v3-turbo	5.5	1.6	3.4x

A few patterns to note:

The tiny model gets minimal benefit from Metal (1.3x). The model is so small that the overhead of dispatching to the GPU nearly matches the compute savings.
The large-v3-turbo model consistently sees 2x+ speedup across all chips. This is the sweet spot for Metal acceleration.
Newer chips with more GPU cores (M3 Max with 40 cores, M4 Pro with 20 cores) see larger speedups because more work can run in parallel on the GPU.

Choosing the Right Model for Metal

Not every whisper model is worth running with Metal. The choice depends on your use case:

Model	Parameters	Metal benefit	Best for
tiny / tiny.en	39M	Minimal (1.3x)	Quick prototyping, CI pipelines
base / base.en	74M	Moderate (1.9x)	Realtime on constrained devices
small / small.en	244M	Good (2.2x)	Balanced accuracy and speed
medium / medium.en	769M	Good (2.5x)	High accuracy, can tolerate slight delay
large-v3-turbo	809M	Strong (2.6x+)	Best accuracy-to-speed ratio
large-v3	1.5B	Strong (3x+)	Maximum accuracy, batch processing

For a voice-controlled desktop agent that needs to respond in real time, large-v3-turbo with Metal is the recommended choice. It delivers near large-v3 accuracy at roughly half the inference time.

Download the model:

./models/download-ggml-model.sh large-v3-turbo

Tuning Metal Performance

Thread count

whisper.cpp uses CPU threads for operations that are not offloaded to Metal. The default thread count is 4. On machines with many performance cores, increasing this can help:

./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -t 8

On an M2 Pro (8 performance cores + 4 efficiency cores), -t 8 is a good starting point. Going beyond your performance core count provides diminishing returns because the efficiency cores are slower.

GPU layers

By default, all layers are offloaded to Metal when it is enabled. You can control this with the -ngl flag:

# Offload all layers (default with Metal)
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -ngl 999

# Offload only 16 layers (partial GPU)
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -ngl 16

Partial offloading is rarely useful on Apple Silicon because unified memory means there is no VRAM limit to worry about. The only scenario where you might reduce GPU layers is if you are running multiple GPU-heavy applications simultaneously and want to leave GPU bandwidth for them.

Flash attention

whisper.cpp supports flash attention on Metal, which reduces memory usage and can improve throughput for longer audio segments:

./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f audio.wav -fa

The -fa flag enables flash attention. On Apple Silicon with Metal, this is especially effective because it reduces the memory bandwidth pressure on the unified memory bus.

Tip

Combine Metal with flash attention for the best throughput: -fa reduces memory pressure while the GPU handles the heavy compute. On an M2 Pro with large-v3-turbo, this combination processes 60 seconds of audio in about 2.8 seconds.

Common Pitfalls

Building without Xcode Command Line Tools. Metal shader compilation requires the Xcode CLT. If you see errors about missing metal or metallib during build, run xcode-select --install first.
Running on Intel Macs and expecting Metal. Metal compute shaders work on Apple Silicon only for whisper.cpp. Intel Macs with AMD GPUs have Metal support in theory, but the ggml Metal backend targets Apple Silicon's unified memory architecture specifically. On Intel Macs, you will fall back to CPU-only inference.
Assuming more threads always helps. Setting -t 16 on an M1 with 4 performance cores will not speed things up. The efficiency cores are roughly half the speed of performance cores for this workload. Match your thread count to your performance core count.
Forgetting to check the model format. whisper.cpp uses GGML-format model files, not the original PyTorch weights. If you download a .pt file from Hugging Face, you need to convert it with convert-pt-to-ggml.py before whisper.cpp can load it.
Not updating whisper.cpp. The Metal backend gets regular performance improvements. A build from six months ago could be 20-30% slower than current master on the same hardware. Pull and rebuild periodically.

Minimal Working Example

A complete example from clone to transcription:

# Clone and build with Metal
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

# Download the large-v3-turbo model
./models/download-ggml-model.sh large-v3-turbo

# Record 5 seconds of audio (or use any .wav file)
# sox is available via: brew install sox
rec -r 16000 -c 1 -b 16 /tmp/test.wav trim 0 5

# Transcribe with Metal + flash attention
./build/bin/whisper-cli \
  -m models/ggml-large-v3-turbo.bin \
  -f /tmp/test.wav \
  -t 8 \
  -fa \
  --print-realtime

You should see ggml_metal_init: found device: Apple M... in the output, followed by the transcription. On an M2 Pro, a 5-second clip transcribes in under 0.5 seconds with this configuration.

Metal vs. CoreML vs. WhisperKit

If you are on Apple Silicon, you have three options for local Whisper inference. Here is how they compare:

Feature	whisper.cpp (Metal)	whisper.cpp (CoreML)	WhisperKit
Language	C/C++	C/C++ + CoreML model	Swift
GPU backend	Metal compute shaders	Apple Neural Engine + GPU	CoreML (ANE + GPU)
Setup complexity	Low (cmake + build)	Medium (need CoreML model conversion)	Low (Swift package)
Streaming support	Yes (real-time mode)	Yes	Yes
Model flexibility	Any GGML model	Requires CoreML conversion per model	Pre-converted models only
Best for	CLI tools, servers, cross-platform	Maximum throughput on ANE	Native Swift/iOS apps

whisper.cpp with Metal is the most flexible option. It works with any GGML model, supports all whisper.cpp flags, and does not require model conversion. CoreML can be faster for specific models because it can use the Apple Neural Engine (ANE) in addition to the GPU, but it requires converting each model to CoreML format first. WhisperKit is the best choice if you are building a native Swift application and want a clean API.

For a command-line tool or a background daemon that processes audio files, whisper.cpp with Metal is the practical choice.

Wrapping Up

Metal acceleration turns whisper.cpp from "fast enough" to "real-time with headroom" on Apple Silicon. The build is straightforward, the unified memory architecture eliminates the copy overhead that plagues discrete GPU setups, and the speedup scales with model size. For local speech-to-text on a Mac, this is the setup that actually delivers.

Fazm is an open source macOS AI agent that uses local whisper.cpp for voice input. Open source on GitHub.

whisper.cpp Metal on Apple Silicon: GPU Acceleration for Local Speech-to-Text

How Metal Acceleration Works in whisper.cpp

Building whisper.cpp with Metal

Disabling Metal (for comparison or debugging)

Performance: Metal vs. CPU-Only

Choosing the Right Model for Metal

Tuning Metal Performance

Thread count

GPU layers

Flash attention

Common Pitfalls

Minimal Working Example

Metal vs. CoreML vs. WhisperKit

Wrapping Up

Related Posts

download-ggml-model.sh large-v3: How to Download the Full Whisper Large Model

ggml-large-v3.bin: Complete Guide to Whisper's Largest GGML Model

ggml-large-v3-turbo.bin: The Fast Whisper Model for Real-Time Transcription

Comments ()

How Metal Acceleration Works in whisper.cpp

Building whisper.cpp with Metal

Disabling Metal (for comparison or debugging)

Performance: Metal vs. CPU-Only

Choosing the Right Model for Metal

Tuning Metal Performance

Thread count

GPU layers

Flash attention

Common Pitfalls

Minimal Working Example

Metal vs. CoreML vs. WhisperKit

Wrapping Up

Related Posts

download-ggml-model.sh large-v3: How to Download the Full Whisper Large Model

ggml-large-v3.bin: Complete Guide to Whisper's Largest GGML Model

ggml-large-v3-turbo.bin: The Fast Whisper Model for Real-Time Transcription

Comments (••)

Comments ()