Notes from building one shipping macOS agent

The heterogeneous local AI scheduler gap is three gaps, and one tool will not close all three

People on X say it like one phrase. The compute on an Apple Silicon Mac is heterogeneous, no consumer runtime schedules across it, the gap is obvious. Yes. But there are two more layers above silicon that also lack a scheduler, and pretending the problem is silicon-only is the reason your local agent feels worse than a chat window even though the chip is fast.

Matthew Diakonov, Written with AI

Published May 12, 20269 min read

Direct answer (verified 2026-05-12)

The "heterogeneous local AI scheduler gap" is the missing layer between "I have an M-series Mac and I want a local AI agent" and "the agent reliably does the next right thing". The gap exists at three layers, not one: silicon (nobody co-schedules CPU, GPU, and ANE for a single model), work-type (nobody routes reasoning, ASR, OCR, classification, and embeddings to different compute targets), and quality-vs-latency (nobody picks small-fast vs big-slow per request). The honest move is to stop trying to ship one tool that does all three. In Fazm the seam is two lines of Swift: customApiEndpoint at SettingsPage.swift line 885, exported as ANTHROPIC_BASE_URL at ACPBridge.swift line 469. Whatever runtime is behind that URL owns silicon scheduling. The agent owns work-type.

The thesis, in one paragraph

Every other guide on this topic argues silicon: should the model run on Metal, on the ANE, on the AMX CPU matmul units. That argument is correct and not enough. A local-AI agent is a system, not a model. A system has more than one compute consumer. A scheduler that only routes inside a single model graph is the silicon layer scheduler. A scheduler that routes between work types (reason, transcribe, read screen, classify) is a different scheduler. A scheduler that picks between a 1.5B and a 70B per request is a third. They want to live in three different processes. The reason the gap feels unfillable is that we keep trying to fill all three from one tool.

Gap one, silicon: the ANE is idle and Metal is the default

On an M-series chip there are three compute blocks. The CPU has AMX matrix units that are good for small dense matmuls and prefill on short inputs. The GPU is the high-bandwidth target for streaming a large weight matrix through attention and feed-forward layers, and it is what Metal-backed inference uses. The Apple Neural Engine is the highest-throughput, lowest-power compute, but it expects CoreML-converted graphs with quantizations it supports, and it is dimensioned for vision and ASR shapes, not for streaming a 13B weight matrix per token.

A real silicon scheduler would split a single transformer graph across all three. The embedding and the small adapter passes go to the ANE. The big matmuls in attention and feed-forward go to the GPU. Some prefill bookkeeping goes to the CPU. Apple themselves ship their on-device foundation model with roughly this split. Third-party consumer runtimes do not. llama.cpp lives on Metal. MLX lives on Metal. Ollama is a wrapper around runtimes that live on Metal. The ANE goes idle. That is gap one.

Where the work goes today vs where a real silicon scheduler would put it

MLX has experimental ANE paths. llama.cpp has a Metal-Performance- Shaders bridge that can route ops through CoreML. None of those are production for arbitrary models. The honest claim today is that the silicon scheduler is the runtime's job, and your job is to pick a runtime that is getting close, not to build the scheduler yourself inside an agent.

Gap two, work-type: an agent has five compute consumers, not one

Open a desktop agent and watch what it actually does in a minute of use. It reasons (forward pass on a big LLM). It transcribes what you said (ASR, ideally a streaming Whisper or Parakeet). It reads screen state (either OCR via Vision framework, or, better, the accessibility tree). It classifies your request to pick a route (small model, sometimes regex). It might embed something for memory (a small sentence-transformer). Five compute consumers. Five different ideal targets. Five different latency budgets.

Consumer local-AI tools treat this as one consumer because the local-AI conversation came from the local-LLM community, and a local LLM is one component. The work-type scheduler is the piece that says "transcription goes to a streaming endpoint, reasoning goes to messages.create, screen-state is the AX tree (no model at all), classification is a 3B local, embeddings are a CPU batch". That router is small. It is also missing from almost every local-agent tool I have looked at.

Work-type routing inside one agent turn

The interesting fact about this diagram is the second destination on the right. The accessibility tree is structured text that macOS already maintains for every running app. Reading it costs zero model inference. A serialized window of a typical macOS app lands in 200 to 400 tokens, versus 1,500 to 3,000 tokens for a screenshot fed to a vision model. The work-type scheduler that knows the AX path exists makes the silicon scheduler argument partly moot for this branch: you do not need to schedule a vision model across CPU/GPU/ANE if you do not run a vision model.

You can see this concretely in the open source bridge. The 19 stdio tools at acp-bridge/src/fazm-tools-stdio.ts include capture_screenshot as an option for the canvas/WebGL fallback case, but the default screen-state input to the model is the AX tree. The work-type router lives in the loop, not the runtime.

Gap three, quality vs latency: small model classify, big model reason

The user says "close the second tab". The right response is one tool call. The user says "summarize this 40-page PDF and email the gist to my accountant with a polite ask for a tax call". The right response is a long plan and several tool calls. Sending both to the same model is throwing capacity at the first one and starving the second one. Sending both to a small model fails on the second one. Sending both to a big model is slow on the first one and burns budget.

The classifier that picks the lane is the quality-vs-latency scheduler. It is also where most of the perceived speedup on a desktop agent lives, not in the underlying model inference speed. A 50ms classification that lets the loop short-circuit to a tool call without calling the big model at all wins on every UX metric a person can feel. The classifier itself can be local, on a 1.5B or 3B model, or even regex for the simplest cases. It is a different model from the reasoner. It is a different endpoint from the reasoner. It is missing in almost every consumer desktop agent, including the ones that beat their chest about being "fully local".

“Across a ten-step task the screenshot path costs roughly 6x the cumulative input tokens of the accessibility-tree path. That 6x sits on top of every silicon choice you make. If you pick the screenshot path, you have undone the bandwidth advantage you just paid extra for.”

From the field notes at /t/local-ai-hardware-bandwidth-vs-memory

Counterargument: "just wait, Apple will ship it"

A reasonable reply is that Apple is the one party with the right incentives and the chip access to ship a real silicon scheduler, and they are obviously shipping pieces of it (the on-device foundation model, the new Vision Pro inference pipeline, the progressively richer CoreML toolchain). True, and irrelevant to the agent layer. Even when Apple ships a perfect silicon scheduler tomorrow, the work-type layer and the quality-vs-latency layer still need to exist somewhere. They are a different kind of code from a CoreML graph compiler. They live in the agent, not the chip.

The corollary is that an agent that decouples its model wire from its work-type router is not affected by which silicon scheduler you eventually settle on. The day a runtime ships true CPU+GPU+ANE co-scheduling, the agent points at that runtime via the same URL. The work-type and quality-vs-latency code underneath does not move. That is the design payoff for not building one monolithic local-AI agent.

Resolution: where the seam goes

Three layers, three places for the scheduler. Silicon scheduling belongs inside the inference runtime. It is the runtime's job to dispatch ops to CPU AMX, GPU Metal kernels, or the ANE for the parts of the graph that benefit. Pick a runtime that gets close. Work-type scheduling belongs inside the agent loop. The loop knows which compute consumer is firing on this turn, and it routes accordingly. Quality-vs-latency scheduling belongs in a top router that classifies the incoming request. It can be a regex, a 3B local, or a small cloud model.

In the Fazm codebase the seam between the agent loop and the runtime is two lines of Swift. The Settings UI writes a string; the bridge subprocess reads that string into an env var on spawn. Everything else (the 19 native tools, the cron scheduler, the accessibility-tree screen state, the conversation store, the permission system) sits above the seam and does not care what silicon scheduler runs on the other side of the URL.

// Desktop/Sources/MainWindow/Pages/SettingsPage.swift  (line 885)
@AppStorage("customApiEndpoint") private var customApiEndpoint: String = ""

// Desktop/Sources/Chat/ACPBridge.swift  (lines 467-470)
// Custom API endpoint (allows proxying through Copilot, corporate gateways, etc.)
if let customEndpoint = defaults.string(forKey: "customApiEndpoint"),
   !customEndpoint.isEmpty {
  env["ANTHROPIC_BASE_URL"] = customEndpoint
}

You will not find a scheduler in those two lines. That is the point. The scheduler is whatever is behind the URL, and the agent does not care. Anthropic's production API is one answer. A local mlx-omni-server with --use-ane is another. A llama.cpp build that just gained a CoreML branch is a third. The agent stays the same.

Practical reading order if you are designing one of these

Start with the work-type layer, not the silicon layer. Decide what the agent reads and writes per turn before you decide what chip block it runs on. The work-type layer is where the AX-tree decision lives, and that decision changes your per-turn token budget by roughly 6x. Once the work-type layer is honest, the silicon layer is a runtime choice you can make in the morning. Then add the quality-vs-latency router at the top, because that is where most of your felt latency goes. Do not start with chips. Start with the routing table.

Want to talk through where the seam should go in your stack?

If you are building a desktop agent and trying to decide what stays local and what moves behind a configurable URL, book 30 minutes and walk through it with me.

Frequently asked questions

What does 'heterogeneous local AI scheduler gap' actually mean?

It is the gap between 'I have a Mac and I want a local AI agent' and 'the agent is running and clicking the right button right now'. The gap exists at three layers. First, silicon: an M-series chip has CPU cores, GPU cores, and the Apple Neural Engine, but consumer runtimes pick one and ignore the others (llama.cpp and MLX default to Metal on GPU, the ANE goes idle, the CPU AMX matrix units go idle). Second, work-type: an agent does five kinds of work (reason, transcribe, OCR or read screen state, classify, embed) and each has a different ideal compute target, but no consumer tool routes across them. Third, quality-vs-latency: the question 'is the user asking me to click a button or to write a 4,000-word report' has a different best answer, and almost no consumer agent has a router for it. The 'scheduler' that would close those three at once does not exist, and one tool will not ship it.

Why does everyone keep saying 'just use ANE'?

Because the ANE is the highest-throughput, lowest-power compute block on the chip, so on paper it should be the answer. In practice it has two strict limits that kill it for general LLM inference. First, the kernels need to be CoreML-converted with quantizations the ANE supports, which is a narrow path; full transformer attention with long context typically falls back to GPU. Second, the ANE is sized for vision and ASR-shaped workloads, not for streaming a 13B weight matrix per token. Apple themselves run their own foundation models with a split: the ANE handles the embedding and the small adapter passes, the GPU handles the big matmuls. That split is the thing third-party runtimes have not productized. Until they do, 'just use ANE' for a chat model is a bumper sticker, not a scheduler.

Is the gap closing? Is there a scheduler coming?

At the silicon layer, slowly. MLX has experimental ANE backends, llama.cpp has a CoreML-via-Metal-Performance-Shaders path for parts of a graph, and Apple keeps shipping more of their own model code that demonstrates the split. None of those are production for arbitrary models yet. At the work-type and quality-vs-latency layers, the gap is wider, because nobody is even building it. Most local-AI products try to be one runtime serving one model and call that 'the agent'. That is not a scheduler, it is one entry in a scheduling table.

How does Fazm sidestep all three layers?

By not being a runtime. The Fazm desktop app speaks the Anthropic Messages HTTP API and points at one URL. That URL can be api.anthropic.com, a corporate proxy, a GitHub Copilot gateway, or a local server like LM Studio, mlx-omni-server, or claude-code-mlx-proxy. Whatever sits behind that URL owns the silicon scheduling. Fazm's job is not to dispatch tensors to the ANE. Fazm's job is the layer above, which is what work the agent does this turn (read screen state, call a tool, ask the user to clarify, schedule a recurring run). Look at Desktop/Sources/MainWindow/Pages/SettingsPage.swift line 885 for the UserDefault that stores the URL, and Desktop/Sources/Chat/ACPBridge.swift lines 467-470 for the env-var export. That is the whole seam.

What about screen state? Doesn't reading the screen need a vision model on the ANE?

Only if you build it that way. The other way is to read the macOS accessibility tree, which is structured text that the OS already maintains for every running app, and serialize it as a small JSON object. No model, no ANE, no CV pipeline. The accessibility-tree path is what removes the second-largest scheduler decision (where does the OCR/CV model live) from the problem entirely. Fazm uses the AX tree for the agent's screen-state input. capture_screenshot exists as one of the 19 tools in acp-bridge/src/fazm-tools-stdio.ts but the loop does not feed a screenshot to the model every turn; it reaches for one only when the AX tree is empty, which is the canvas/WebGL/custom-drawn case.

If the agent does not run a local model, in what sense is it 'local AI'?

It is the part of the system that runs on your machine and touches your apps, your screen, your files, your logged-in browser, and the audio coming out of your speakers and into your microphone. That is the part that needs to be local. Whether the reasoner is api.anthropic.com or a 13B MLX model running on the same Mac is a wire choice. The Fazm code path is identical for both. People who say 'local AI' meaning 'the model weights live on disk and nothing leaves the box' are talking about one valid privacy posture. People who say 'local AI' meaning 'the agent that acts on my machine is on my machine, with a model wire that I control' are talking about another. Conflating them is what makes the scheduler question feel impossible.

Why is the work-type layer the one nobody talks about?

Because the local-AI conversation grew out of the local-LLM community, and a local LLM is one component. The agent on top is a system. A system has more than one compute consumer (the reasoner, the speech recognizer, the speech synthesizer, the screen-state reader, the embedding model for memory, the classifier for routing). Each one has its own latency budget and its own quality requirement. Lumping them under 'local AI' makes you talk about chips when you should be talking about the routing table. Until consumer tools render that table visible, the scheduler gap will keep getting framed as a silicon problem.

Where should the scheduler actually live?

Three places, one per layer. Silicon scheduling belongs in the inference runtime: it is the runtime's job to dispatch ops to CPU, GPU, or ANE in the right combination for a given graph. Work-type scheduling belongs in the agent loop: the loop should know that 'transcribe what the user said' goes to a different endpoint than 'plan the next click', and route accordingly. Quality-vs-latency scheduling belongs in a top router that classifies the incoming request: short clarifications go to a small fast model, long reports go to a big slow one. None of those three want to be one process. The scheduler is missing because we keep trying to make it one process.

What is the smallest concrete experiment to feel this gap on my own Mac?

Three commands. (1) Run llama-server with a 13B model on Metal, time a 200-token generation. (2) Run the same model via mlx-omni-server with --use-ane (where supported), time the same generation. (3) Run a much smaller classification model (a 1.5B or 3B) in parallel and have it pick between the two for short vs long requests. You will discover, in that order: the GPU is fine but the chip is two-thirds idle, the ANE accelerates parts of small graphs but is not a silver bullet for the whole model, and the router in step three is doing more work for end-user-perceived latency than either backend swap. That last finding is the one nobody on the buyer's-guide pages mentions, and it is where the gap lives.

Related field notes

Loop anatomy

Local LLM runtime done, agent loop missing: the six pieces the runtime never shipped

Ollama or llama.cpp gives you a forward pass and a KV cache. The other six pieces of a real agent loop (tool schema, tool sandbox, screen state, conversation state, scheduler, swappable endpoint) live above the runtime.

Read

Hardware

Local AI hardware tradeoffs on Apple Silicon: bandwidth, memory, and the third axis no one mentions

Every Apple Silicon buyer's guide for local AI argues bandwidth versus unified memory. That framing is correct for chat and wrong the moment your agent does any actual computer-use work.

Read

MLX

Local MLX model for desktop loops: the one settings field that wires it in

How a 13B MLX model behind a small Anthropic-compatible HTTP shim becomes the reasoner for a real macOS computer-use agent, in one settings field.

Read

The thesis, in one paragraph

Gap one, silicon: the ANE is idle and Metal is the default

Gap two, work-type: an agent has five compute consumers, not one

Work-type routing inside one agent turn

Gap three, quality vs latency: small model classify, big model reason

Counterargument: "just wait, Apple will ship it"

Resolution: where the seam goes

Practical reading order if you are designing one of these

Want to talk through where the seam should go in your stack?

Frequently asked questions

Related field notes

Local LLM runtime done, agent loop missing: the six pieces the runtime never shipped

Local AI hardware tradeoffs on Apple Silicon: bandwidth, memory, and the third axis no one mentions

Local MLX model for desktop loops: the one settings field that wires it in

Comments (••)

Comments ()