Local agents / throughput
The number that matters for a local LLM desktop agent is not the one in your chat benchmark
For desktop agents the throughput question reorders. Generation tok/s, the number every local-LLM benchmark prints, sits inside a tiny window at the end of each turn. The bigger window, on every single turn of the agent loop, is prefill of a large input that mostly contains a fresh dump of your screen state. The variable that decides whether a local LLM is usable as a desktop agent is how compact that screen-state representation is, not how fast the GPU shaves the last decode token.
Throughput for a local LLM desktop agent is governed by prompt-processing speed multiplied by per-turn input length, not generation speed. On Apple Silicon, prefill on a 13-30B quantized model lands in the low hundreds of tok/s while a typical agent turn re-sends the system prompt, full tool schemas, conversation history, and the current screen state. A screenshot of that screen costs roughly 1,500-3,000 input tokens. A compact macOS accessibility tree of the same screen costs roughly 200-400 input tokens, a 6x cumulative reduction over a 10-step task. That ratio, not the model size, is what decides whether the agent feels usable. Source: cap constants in MacosUseSDK / AccessibilityTraversal.swift verified 2026-04-29; benchmark numbers from the public Fazm blog post linked below.
Why chat benchmarks mislead you about agent throughput
Chat is one prompt followed by one stream of tokens. You see the first token, you start reading, and the visible throughput is whatever the runtime emits per second after prefill ended. Prefill happens once and the user does not feel it on a short input.
An agent is a loop. Every turn the runtime re-receives a prompt that contains the system message, the full tool schema, the message history so far, and the latest snapshot of the world. The runtime then runs prefill across that entire input from scratch (modulo prefix caching, which most consumer setups do not have configured), produces a small JSON tool call, your runtime executes that tool, the result is appended, and the loop runs again. Every iteration repeats the prefill cost.
On consumer Apple Silicon the relevant numbers look roughly like this: a 13B class quantized model on an M3 Pro decodes at 15-22 tok/s, while prefill on the same machine usually lands somewhere between 150 and 400 tok/s depending on backend and quantization. On an M4 Max with 64-128 GB you can push prefill higher, but the ratio of prefill speed to decode speed stays in the same ballpark across the consumer range.
Now imagine a turn that re-reads 12,000 input tokens and emits 150 output tokens. At 200 tok/s prefill and 18 tok/s decode that is 60 seconds of prefill and 8 seconds of decode. The user feels a 68-second wait. Halve the input and the prefill term collapses faster than anything you can do to the model.
Screen state is the dominant variable, not model size
The agent has to know what is on screen to decide what to do next. There are two ways to give the model that information.
One: take a screenshot, send it as an image. A vision-capable model tiles that image and produces input tokens for each tile. A 1440x900 region typically lands in the 1,500 to 3,000 token range. The model has to attend over every pixel: text, icons, decoration, ads, scrollbar chrome, the OS menu bar. None of that is interactable, but all of it consumes the prefill budget.
Two: walk the accessibility tree of the focused window and serialize it. macOS exposes a structured representation of every visible interactable element through the AXUIElement API. The model receives a JSON array of objects with one role, an optional label, and a four-number bounding box per element. A typical app screen serializes to 200-400 tokens. There is nothing about decoration, no rendering chrome, no menu-bar noise unless you ask for it.
Over a 10-step task the public Fazm benchmark on browser-tool comparison measured a 150,000 vs 25,000 input-token gap between the two approaches. Six times. That gap rides on top of every other optimization you do. Speculative decoding and prefix caching help the screenshot path too, and the screenshot path is still the slow one because the input was always larger.
The accessibility tree shape, with the actual cap constants
The exact shape Fazm uses comes from the open-source MacosUseSDK. The traversal lives in Sources/MacosUseSDK/AccessibilityTraversal.swift. Three caps keep the per-call output bounded:
maxElements = 2000, past which the traversal returns withtruncated = truein stats.maxDepth = 100, past which child recursion stops.maxTraversalSeconds = 5.0, past which whatever was collected is returned with a warning.
The element shape itself is six fields:
public struct ElementData: Codable, Hashable, Sendable {
public var role: String
public var text: String?
public var x: Double?
public var y: Double?
public var width: Double?
public var height: Double?
}That is the entire payload per element. No parent pointers, no styling, no z-index, no DOM-style attribute bag. Most elements serialize to 30-60 JSON tokens; the median screen lands well inside the 200-400 token band even when the underlying app has a deep view hierarchy. The cap constants exist precisely so a pathological app cannot blow up the prefill budget on you.
How a single turn actually runs against a local LLM
The wiring inside Fazm: when you toggle Custom API Endpoint in Settings (SettingsPage.swift line 887, the textfield at line 983), the value lands in a UserDefault called customApiEndpoint. When the agent process spawns, the bridge layer (ACPBridge.swift line 406-408) reads it and exports it as ANTHROPIC_BASE_URL in the agent's environment.
From that point the agent does not know it is talking to a local model. It calls the Messages API, the call hits whatever URL you pointed at: an Anthropic-compatible vLLM server on localhost:8000, an LM Studio instance behind a small shim, the Z.ai Anthropic-compatible endpoint, GitHub Copilot's Anthropic bridge. The reasoner is hot-swappable through one env var. The agent's tool surface (browser via Playwright extension, Mac apps via the macos-use MCP, terminal, file system, Google Workspace) stays identical.
One agent turn against a local LLM
- 1
1. Capture
macos-use traverses the focused window's accessibility tree under the 2000-element / 100-depth / 5-second caps.
- 2
2. Serialize
ResponseData is JSON-encoded: role, optional text, four-number bounding box per element. Typical: 200-400 tokens.
- 3
3. Prefill
Local runtime processes system + tools + history + tree. With prefix caching on, only the tree and last user turn are new.
- 4
4. Decode
Model emits a tool-call JSON, usually under 200 output tokens. This is where decode tok/s actually shows up in the latency budget.
- 5
5. Execute
Fazm runs the tool (click, type, AppleScript, browser action), captures the next tree, the loop repeats.
Prefix caching is the second lever, and it is free
Most of the per-turn input does not change. The system prompt is identical across turns. The tool schemas are identical. The first N messages of the conversation are identical. Only the trailing piece (the latest tool result + the new accessibility tree) is new.
Runtimes that support automatic prefix caching skip the prefill cost on the unchanged part of the prompt. llama.cpp via cache_prompt, vLLM via its automatic prefix cache, MLX servers via their KV reuse hooks. Turn it on, configure your model server to keep enough KV memory available for the active session, and your effective per-turn prefill collapses to the size of the diff between turn N and turn N+1. With the accessibility-tree path, that diff is small because the tree is small to begin with.
On the screenshot path, prefix caching helps less. Each new screenshot is a different image, and image tokens after re-tiling do not align with the previous turn's image tokens, so the cache does not get a hit on the changing piece. You eat full image-token prefill on every turn.
What actually fits on consumer Apple Silicon
The realistic operating envelope, assuming Anthropic-compatible serving and tool-use-trained models:
- M2/M3 base, 16 GB: 7B-class instruct at 4-bit, decode 25-40 tok/s, prefill 200-400 tok/s. Fine for one-app workflows where the tree stays small. Will choke on long task histories without aggressive context trimming.
- M3 Pro, 18-36 GB: 13B-class instruct at 4-bit, decode 15-22 tok/s, prefill 150-300 tok/s. The practical floor for cross-app workflows. With prefix caching on, per-turn latency lands in the 5-8 second range.
- M4 Max, 64-128 GB: 30-70B at 4-bit, decode 30-45 tok/s, prefill 400-800 tok/s. The tool-call quality is what makes this band actually useful, not the throughput. The throughput is fine for any honest desktop-agent task at this size.
Below the M3 Pro band, the limiter usually flips from throughput to tool-call accuracy: the small models miss the right element to click. That is a model-quality problem, not a throughput problem, and no amount of optimization fixes it.
What the runtime actually prints when you watch one turn
A representative turn against a 13B class model behind an Anthropic shim, with the accessibility-tree path on, looks something like this in the runtime log:
Two things to notice. First, the cache-hit term is most of the prefill budget. Without prefix caching the same turn would have eaten 11,847 / 218 = 54 seconds of prefill instead of 6. Second, the tree column is 307 tokens. If that line said 2,400 because we sent a screenshot, prefill would be 11 seconds even with the cache hit, and on a fresh conversation (no cache) we would be back at the minute-plus regime.
Where the per-turn time actually goes
The actual operating checklist
If your local desktop agent feels slow, work the list in this order:
- Audit per-turn input length. Log
input_tokenson every Messages API call. If it is over 15K, you have a screen-state representation problem before you have a runtime problem. - Switch from screenshots to accessibility tree for any agent step where the model only needs to know about interactable elements. Reserve screenshots for genuine vision steps (reading a PDF, identifying a chart, OCR fallback when accessibility fails).
- Turn on prefix caching in your runtime. llama.cpp
cache_prompt: true, vLLM--enable-prefix-caching. Confirm cache hits in the runtime log. - Trim tool schemas you do not use. A typical agent ships 30+ tools and the model only needs 3-5 for any given session. Tool schemas are deceptively heavy because every parameter description costs tokens.
- Cap conversation history. Past 10-15 turns the model rarely benefits from the full transcript. Summarize or window aggressively.
- Only then look at the model. Quantization choice, decode tok/s, speculative decoding. These all matter, but in the agent regime they fight for the small slice of the latency budget that decode actually owns.
The cases where this rule breaks
Two honest counterexamples. First, on canvases where the accessibility tree is genuinely thin (Figma, Photoshop, Blender, custom Electron apps that ship no AX hierarchy), the tree is unhelpful and a screenshot is the only path. The argument here only applies when the app actually exposes its semantic structure.
Second, for short tasks (under 3 turns), the prefill cost is paid only a few times and the difference between 2 seconds and 12 seconds of total prefill is annoying but not disqualifying. The argument compounds with task length, and most useful desktop-agent tasks are 10+ turns.
Third, with a real frontier-class hosted model, prefill is so fast that the screenshot/tree distinction matters far less for latency, though it still matters for cost. The local-LLM regime is where the math tilts hardest.
Anything that exposes an Anthropic-compatible Messages API can sit behind ANTHROPIC_BASE_URL and run the agent loop above.
Local runtimes that the rule applies to
llama.cpp
Server mode with cache_prompt for prefix caching.
vLLM
Automatic prefix caching, plus a community Metal backend on Apple Silicon.
MLX
Apple's native runtime, fastest decode on M-series for many model sizes.
LM Studio
OpenAI-shaped server; needs a small Anthropic shim in front.
Ollama
Easiest setup; behind an Anthropic-compatible proxy it slots in cleanly.
vllm-mlx
MLX-backed vLLM with a built-in Anthropic mode.
Walk through your throughput numbers with us
Bring a recent agent log and we will look at where the prefill budget actually went.
FAQ
Why is my local LLM agent slow even though my chat throughput is fine?
Chat is one prompt with a small system message; agents are a loop. Every turn re-sends the system prompt, the full tool schema, the conversation so far, and a fresh dump of screen state. On consumer Apple Silicon, prefill (prompt processing) does not run at the same tokens per second as generation; with a large per-turn input it dominates. If a turn is 12,000 input tokens and your prefill runs at 300 tok/s, that is a 40-second wait before the model emits its first action. The model does not need to be slow for the agent to feel slow.
Should I be looking at prefill tok/s or decode tok/s when I pick a runtime?
Both, but prefill matters more for desktop agents. A useful sanity check: take your typical per-turn input length (system prompt + tool schemas + last few turns + current screen state), divide by your runtime's measured prefill speed, and that is your floor on time-to-first-tool-call. Decode (generation) speed only matters in the small window where the model is emitting the tool-call JSON, which is usually well under 200 tokens.
How much does the screen-state representation actually cost in tokens?
A single screenshot fed to a vision-capable model typically lands in the 1,500 to 3,000 input-token range depending on resolution and tile policy. A compact accessibility tree of the same window, with role plus optional text plus a four-number bounding box per element, lands in the 200 to 400 input-token range for a typical app screen. Over a 10-step task the public benchmark from the Fazm blog measured 150,000 vs 25,000 tokens, a 6x cumulative reduction. That is the difference between a local 13B model finishing the task in two minutes and finishing in twelve.
What stops the accessibility tree from blowing up on a complex app?
MacosUseSDK caps every traversal in three ways: maxElements = 2000, maxDepth = 100, and maxTraversalSeconds = 5.0 (see AccessibilityTraversal.swift line 103-105). Past those limits the traversal returns truncated and logs a warning. The element shape itself is six fields (role, optional text, optional x, y, width, height), which keeps each element under roughly 30 to 60 tokens of JSON regardless of how deep the underlying view hierarchy is.
Can I point Fazm at LM Studio, Ollama, or a vLLM server?
Yes, if it speaks Anthropic's Messages API. Fazm has a Custom API Endpoint toggle in Settings (SettingsPage.swift line 887, 983) that writes to a UserDefault called customApiEndpoint. At bridge spawn time that value is exported as ANTHROPIC_BASE_URL (ACPBridge.swift line 406-408) for the agent process. So the agent process is the same; the reasoner behind it is whatever you pointed the bridge at. LM Studio's OpenAI-compatible endpoint and llama.cpp's server need a small Anthropic shim in front; vllm-mlx and Anthropic-compatible proxies (Z.ai, OpenRouter Anthropic mode, GitHub Copilot bridge) work directly.
What model size is realistic for desktop-agent work on a Mac?
On an M3 Pro at 18-36 GB, a 13B-class instruct model with strong tool-use training is the practical floor; you will get 15 to 22 decode tok/s and prefill that is fast enough for short turns. On an M4 Max at 64-128 GB, 30B to 70B at 4-bit quantization is the sweet spot, with 30 to 45 decode tok/s and prefill that comfortably handles the 10-15K input tokens a typical agent turn produces. Models without tool-use training underperform regardless of size; the bottleneck flips from throughput to wrong-action-count.
Does running the model locally actually save round-trip time?
Yes for prefill of large inputs, no for tiny outputs. A cloud frontier model has prefill speeds in the thousands of tok/s and serializes a tool call in under a second; a local 13B does prefill at hundreds of tok/s. The local win is privacy and cost, not raw speed, and the win is real only if you keep per-turn input small. The accessibility-tree path is what makes a local LLM viable at all for an agent loop on consumer hardware.
What about speculative decoding and KV-cache reuse?
Both help. Speculative decoding accelerates the decode phase, so it shaves a second or two off the tool-call emission. KV-cache reuse across turns (when the runtime supports prefix caching) is a much bigger win because the system prompt and tool schemas at the start of every turn are identical; the only changing prefix is the new screen state. llama.cpp 'cache_prompt' and vLLM's automatic prefix caching both implement this. With prefix caching on, the per-turn prefill cost collapses to whatever's new, which is exactly the screen-state diff.
Is there a single number that summarizes 'usable' for a local desktop agent?
I use this rough rule: per-turn end-to-end latency under 5 seconds is fine, 5 to 15 seconds is workable, over 15 seconds is unusable. Per-turn latency is roughly (input_tokens / prefill_tps) + (output_tokens / decode_tps) + tool execution. Plug in your numbers. If the first term dominates, look at your screen-state representation before you look at the model.
Other notes on running the agent locally
Keep reading
Run vLLM locally on Mac and plug it into an AI agent that drives any Mac app
The one-field setting that rewrites ANTHROPIC_BASE_URL so a Metal-backed vLLM server can drive Finder, Calendar, and any signed Mac app.
Local-first AI coding agent, when local means the agent and not just the model weights
Most guides argue about which model to run on your Mac. This one is about the agent process itself running locally as a desktop app.
Open-source local desktop agent
What it actually means to have the entire agent surface (model swap, tool runtime, screen-state capture) sitting on your machine.