Apple Silicon for local AI
Local AI hardware tradeoffs on Apple Silicon: bandwidth, memory, and the third axis no one mentions
Every Apple Silicon buyer's guide for local AI argues bandwidth versus unified memory. That framing is correct for chat workloads and wrong the moment your local AI does any actual computer-use work. The third axis is per-turn input size, which is set by your screen-state representation, not by your chip.
Direct answer (verified 2026-05-11)
There is no single right chip. There are three workloads and three answers.
- Chat-only LLM: memory bandwidth wins. M4 Max 546 GB/s decodes ~4.5x faster than M4 base 120 GB/s on the same model.
- Desktop agent: prefill compute and per-turn input size win. A Pro tier with an accessibility-tree-based agent often beats a Max running a screenshot-based one.
- Diffusion / video: total unified memory wins. Bandwidth is secondary, thermal sustain matters.
Bandwidth specs verified against Apple's M4 Pro and Max announcement and the M4 family page.
The bandwidth-versus-memory framing is incomplete
The standard advice on local AI hardware for Apple Silicon goes like this. Memory bandwidth decides how fast you generate tokens, because LLM decode is memory-bandwidth bound: every new token requires streaming the entire weight matrix through the chip. Total unified memory decides what model you can fit at all. Pick the chip tier that lets your target model fit, then spend any extra money on the highest-bandwidth tier you can afford. Done.
That advice is correct for one specific workload: chat. One user message in, a few hundred tokens out, repeat. Decode tokens per second is the visible number. Bandwidth dominates. The Max ranking on every benchmark you see for local LLMs is real and it reflects what those benchmarks actually measure.
It stops describing reality the moment your local AI does anything except chat. Two specific things break the model. The input-to-output ratio inverts. And the cost of the input is set by something outside the chip.
Three workloads, three different chips
Chat
Short prompt in, a few hundred tokens out. Decode bound. Memory bandwidth is the variable that moves the felt-speed needle. A Max generates roughly 4.5x faster than a base on the same 13B model. This is the workload every Apple Silicon for local AI guide is implicitly written for.
Agent
Looped tool calls. Each turn re-sends system prompt, tool schemas, conversation history, and screen state. Per-turn input is 2,000 to 15,000 tokens depending on representation. Prefill bound. Compute matters more than bandwidth. Per-turn input size matters more than either.
Diffusion and video
Total unified memory plus sustained GPU thermal headroom. Bandwidth is secondary because the per-step work is matmul-heavy across the latents, not weight-streaming per token. A 64 GB Max sustains generation that an 18 GB Pro cannot start.
Why prefill is the agent's wall
Decode is one weight pass per output token. Prefill is one weight pass plus matrix multiplications across the entire input length. On Apple Silicon, prefill scales with GPU TFLOPS, not with memory bandwidth. The Max has more cores than the Pro and the Pro has more than the base, so the rankings happen to look similar to the bandwidth rankings, but they are not the same curve. Prefill compute is a separate axis.
For a chat workload this barely matters because the input is a few hundred tokens. For an agent loop the input balloons. Each turn carries the full system prompt, every tool schema the agent might call, the history of every prior turn, and the current screen state. Three or four turns into a real task, per-turn input lives in the 10,000 to 15,000 token range. With prefill running at a few hundred tokens per second on consumer hardware, that is many seconds of waiting before the model picks its next action. Multiply by the number of turns to finish the task and the gap between snappy and unusable collapses fast.
This is why decode tokens per second, the number every benchmark publishes, is misleading for the agent case. Two chips with identical decode can have very different prefill curves, and the prefill curve is what the user feels.
Where the seconds go in one agent turn
Prefill
Input tokens / prefill TPS. Dominates on long inputs.
Decode
Tool-call JSON, usually under 200 output tokens.
Tool execution
Click, type, navigate, wait for the app.
Next prefill
Re-encode of the new screen state plus the loop tail.
The third axis: how you describe the screen
Once you accept that prefill is the bottleneck for an agent, the next question is how to shrink the per-turn input. Most of the per-turn payload is fixed (system prompt, tool schemas, a few prior turns). The variable part is the screen-state dump that gets attached on every turn. There are two ways to do it.
One: send a screenshot to a vision-capable model. Roughly 1,500 to 3,000 input tokens per image depending on resolution and tile policy. The model gets pixels and has to figure out what is interactive. Reliable on visually clean apps, brittle on dense ones, expensive every turn.
Two: walk the macOS accessibility tree and serialize each visible element as a small object with role, optional text, optional bounding box. A serialized accessibility tree of the same window typically lands in the 200 to 400 token range. The model gets structured semantics directly. Click targets come with names and coordinates, not pixel guesses.
Across a ten-step task the difference compounds to roughly 6x cumulative input tokens. On a Pro tier that is the difference between a few seconds per turn and ten or fifteen. The chip did not change. The representation did.
“The accessibility-tree path is what makes a local LLM viable at all for an agent loop on consumer hardware. Same chip, same model, different representation, six times less per-turn input.”
Fazm engineering notes
How Fazm wires this together (the anchor fact)
Fazm is a macOS computer-use agent that defaults to the accessibility-tree path. The reason it can drive any local LLM runtime you point it at is one UserDefault and three lines of Swift.
The Settings page exposes a Custom API Endpoint field. The field writes to a UserDefault called customApiEndpoint at SettingsPage.swift:885. When the agent subprocess spawns, ACPBridge.swift:468-470 reads that value and exports it as ANTHROPIC_BASE_URL into the bridge environment. From the agent's point of view it is still talking to the Anthropic Messages API. From your machine's point of view it is talking to a local llama.cpp server, vLLM in Anthropic mode, an OpenRouter Anthropic proxy, or LM Studio. The agent code does not change. Only the URL does.
The accessibility-tree screen state itself is supplied by a bundled binary called mcp-server-macos-use, wired up at acp-bridge/src/index.ts:100. That binary walks the AX tree of the focused window and returns the compact element list the model reasons over. No screenshot is taken on a normal turn. The 6x token reduction is a consequence of which binary is in the loop, not which chip is underneath.
Verify either by reading the source on GitHub or by toggling Settings > AI Chat > Custom API Endpoint in the app and watching the bridge subprocess in Activity Monitor switch its outbound traffic to the URL you typed.
A decision matrix that survives the buyer's-guide framing
| If you mostly do | The bottleneck is | Spend extra on | Reasonable floor |
|---|---|---|---|
| Chat with a 13B local model | Decode (bandwidth) | Higher chip tier (Pro or Max) for bandwidth | M4 Pro 24 GB |
| Chat with a 70B local model | Fitting the model + bandwidth | Max with 64 GB or 128 GB | M4 Max 64 GB |
| Desktop agent on accessibility tree | Prefill compute, per-turn input | Pro tier; representation matters more than chip | M3 Pro or M4 Pro 18 to 36 GB |
| Desktop agent on screenshots | Prefill (input is 6x larger) | Either change the representation, or buy a Max | M4 Max 36 GB |
| Diffusion image generation | Unified memory + thermal | Memory ceiling; Studio over MacBook | M4 Max 64 GB or Mac Studio |
| Whisper / on-device ASR | Negligible at modern chip tiers | Anything you already have | M1 base or newer |
The honest version of the buyer's guide is six rows, not one.
What this changes about how you spec a Mac for local AI
If you are buying for chat, the existing advice holds. Pick the highest-bandwidth tier you can stomach for the model size you want to run. The Max is the right answer if you can afford it.
If you are buying for an agent, stop reading the chat benchmarks and decide on the screen-state representation first. If your agent is going to be screenshot-based, you need a Max-tier prefill envelope to make it feel reasonable and you may still find the per-turn cost annoying. If your agent is going to be accessibility-tree-based, a Pro tier with a 13B-class model is enough for most of the desktop workloads people actually run, and the spare budget should go to memory headroom for occasionally swapping in a larger model rather than to bandwidth you will not feel.
If you are buying for diffusion or video, the chip tier choice is mostly a thermal and memory choice. The desktop form factor matters more than the SKU number.
“I had been about to upgrade to a Max because every guide said bandwidth. Realising the agent loop is prefill bound, not decode bound, saved me about $1,500 and the Pro is still snappy.”
Pointing Fazm at your local runtime?
Show me your stack. We can walk through the screen-state and per-turn cost together and figure out whether you actually need to upgrade the chip.
Frequently asked questions
Which Apple Silicon chip do I actually need for local AI in 2026?
Honest answer: it depends on what you mean by local AI. For chat-only LLM inference, memory bandwidth decides decode speed and the order is base 120 GB/s, Pro 273 GB/s, Max 546 GB/s on M4. A Max generates roughly 4.5x faster on the same model than a base. For a desktop agent that drives apps and the browser, the bottleneck flips to prefill compute and to per-turn input size, and a Pro tier is often enough if your agent uses an accessibility tree instead of feeding the model a full screenshot every turn. For diffusion image generation, total unified memory matters more than bandwidth because the bottleneck is fitting the model and the latents, not streaming weights per token.
Why does memory bandwidth show up everywhere as the only number that matters?
Because LLM decode (the per-token generation phase) is memory-bandwidth bound. Each new token requires streaming the entire weight matrix through the chip. Bandwidth in GB/sec divided by model size in GB gives a hard ceiling on tokens per second. That is true and the existing buyer's guides cover it well. What they leave out is that the decode phase is only one of three things a real local AI workload spends time on. The other two are prefill (processing the input prompt before the first token) and tool execution (everything that happens between turns of an agent loop). Once those enter the picture the chip choice changes.
What is the difference between prefill bound and decode bound?
Decode is one weight pass per token output. Prefill is one weight pass plus matrix multiplications across the entire input length. On a long input, prefill is the wall you hit before the model emits anything. On Apple Silicon, prefill scales with GPU TFLOPS (compute), not with memory bandwidth. The Max has more cores than the Pro and the Pro has more than the base, which is why the bandwidth ranking and the prefill ranking happen to look similar, but they are not the same number. A 10,000 token input on a 13B model on an M4 base is dominated by prefill time. A short chat reply on the same chip is dominated by decode.
Why does an agent care about prefill more than chat does?
Chat is one round trip. The system message is small, the user message is small, the model writes a few hundred tokens, done. An agent is a loop. Each turn re-sends the system prompt, the tool schemas, the conversation history, and a fresh dump of the screen state. After three or four turns the per-turn input lives in the 10,000 to 15,000 token range. With prefill running at a few hundred tokens per second on consumer hardware, that is many seconds of waiting before the model can pick its next action. Multiply by the number of turns it takes to finish a task and you get the difference between an agent that feels responsive and one that you give up on.
What is the screen-state representation and why does it set the per-turn cost?
There are two ways to tell a model what is on screen. One is to send a screenshot to a vision-capable model, which costs roughly 1,500 to 3,000 input tokens per image depending on resolution and tile policy. The other is to walk the macOS accessibility tree and serialize each visible element as a small object (role, optional text, optional bounding box). A serialized accessibility tree of the same window typically lands in the 200 to 400 token range. Across a ten-step task the difference compounds to roughly 6x the cumulative input tokens. That 6x sits on top of every other hardware choice you make. If you pick the screenshot path, you have effectively undone the bandwidth advantage you just paid extra for.
Can a Pro tier really keep up with an agent loop?
Yes, if the agent is built around an accessibility tree rather than a screenshot. An M3 Pro or M4 Pro has enough prefill compute to chew through a 2,000 to 3,000 token per-turn input in a few seconds, and the decode at 25 to 35 tokens per second is comfortable for short tool-call emissions. The same Pro chip falls over running a screenshot-based agent because each turn balloons to 10,000 plus tokens of input and the prefill cost scales linearly. The hardware did not change. The representation did.
Where do I point Fazm at a local LLM runtime?
Settings > AI Chat > Custom API Endpoint. The text field writes to a UserDefault called customApiEndpoint at SettingsPage.swift line 885. When the agent subprocess spawns, ACPBridge.swift lines 468 to 470 reads that value and exports it as ANTHROPIC_BASE_URL into the bridge environment. Any local runtime that exposes an Anthropic-compatible Messages API works: a small shim in front of llama.cpp's server, an Anthropic mode on vLLM, an OpenRouter Anthropic proxy, LM Studio with the right plugin. The agent code does not change. Only the URL does.
What about diffusion models, video, and Whisper? Same rules?
Different bottlenecks for each. Diffusion image generation is bound by total unified memory plus GPU compute, because the model and the latent buffers all need to live in RAM and the per-step work is matmul-heavy. Bandwidth matters less than for LLMs. Video generation is dominated by VRAM (so unified memory size) and by sustained thermal headroom; this is a Max or Studio category problem. Whisper and other ASR models are small and prefill bound on a single audio buffer, so any current Apple Silicon chip handles them in real time. Lumping all of these into a single 'local AI' bucket and asking which chip is best is what makes most buyer's guides unhelpful.
Is Apple Silicon better than a discrete GPU for local AI on a per-dollar basis?
For models that fit in consumer GPU VRAM (say up to a 13B at 4-bit on a 16 GB card), no. A discrete card with 1,000 plus GB/sec of bandwidth runs them faster. For models that do not fit, Apple Silicon wins by being able to load them at all. A 70B at 4-bit needs roughly 40 GB of memory and there is no consumer single-GPU answer to that other than a Mac with 64 plus GB of unified memory. The honest framing is not better or worse but coverage of the model size range you actually plan to run.
Is there a single number that summarizes whether my chip is enough?
Per-turn end-to-end latency. Take your typical per-turn input size in tokens, divide by your runtime's measured prefill speed, add a typical decode time for the tool-call output, then add tool execution time. If that lands under five seconds you have a snappy agent. Five to fifteen seconds is workable but you will notice. Over fifteen seconds and people stop using it. If the prefill term dominates and you cannot change chips, change the screen-state representation before you change anything else.
Related on this site
Local LLM desktop agent throughput: the number that matters is not generation tok/s
Why prefill on screen state, not decode tok/s, is what decides whether a local LLM keeps up inside an agent loop.
Screenshot vs accessibility tree: the per-turn token cost
A direct comparison of the two screen-state representations, with token counts measured on real Mac apps.
Local LLM vs local AI agent on macOS
Running the model on your Mac is half the story. Running the agent loop on your Mac is the other half, and the constraints are different.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.