First Speculative Decoding Across GPU and Neural Engine on Apple Silicon
First Speculative Decoding Across GPU and Neural Engine
Speculative decoding is a technique where a small, fast model generates draft tokens that a larger model then verifies. The key insight for Apple Silicon - you can run both models simultaneously on different parts of the same chip.
Two Models, One Chip
Apple Silicon has three compute units - CPU, GPU, and the Apple Neural Engine (ANE). Most local AI inference uses the GPU. The ANE sits mostly idle for LLM workloads because most model architectures are not optimized for it.
Speculative decoding changes this equation. The small draft model (around 1B parameters) can run on the ANE while the large verification model runs on the GPU. Both execute in parallel on the same chip, using different silicon.
Why 1B on ANE Is the Sweet Spot
The Neural Engine is designed for matrix operations at lower precision. A 1B parameter model fits perfectly - it is small enough to run entirely on the ANE with no memory pressure, and fast enough to generate draft tokens ahead of the larger model.
Models larger than 1B start exceeding what the ANE can handle efficiently. Models smaller than 1B do not produce good enough drafts, so the larger model rejects too many tokens and you lose the speed benefit.
The 1B sweet spot gives you 2-3x speedup over running just the large model alone. That is the difference between 15 tokens per second and 40 tokens per second on an M3 Pro.
What This Means for Desktop Agents
Faster inference means more responsive desktop agents. When your agent needs to make 50 tool calls to complete a task, each call taking 2 seconds versus 5 seconds adds up to a meaningful difference in total completion time.
Local speculative decoding makes on-device AI agents practical for interactive workflows where latency matters. The technology is still early, but the hardware capability is already in every modern Mac.
Fazm is an open source macOS AI agent. Open source on GitHub.