Field notes from one shipping macOS agent
Local LLM runtime done, agent loop missing
You loaded a 13B model. Tokens stream out fine. You ask it to do something across two apps and the next step is just, what. The runtime is finished and the agent is still missing. The reason is structural: the runtime is one component, the agent loop is six components, and nobody ships the other five with the model server. Below is the list, with the file in one open source loop where each piece lives.
Direct answer (verified 2026-05-07)
Six things a local LLM runtime does not give you: a tool schema interface the model can call, a tool-execution sandbox with timeouts, a screen-state representation, persistent conversation state across turns, a scheduler for recurring runs, and a stable model endpoint your loop can swap. The runtime fills exactly the last slot when you point an agent loop at it. The other five are the loop, and you have to either build them or pick a loop that already has them.
Source: acp-bridge/src/index.ts, fazm-tools-stdio.ts, and ACPBridge.swift. Each piece below names a line.
Why the runtime feels finished but the agent still cannot work
A runtime is a model server. Ollama, LM Studio, llama.cpp, vllm-mlx, MLX itself: same shape under different skins. You hand it a request, it runs a forward pass, it streams tokens back, it forgets. That is the entire contract. The runtime cannot click a button, cannot read your screen, cannot remember yesterday, cannot wake itself up at 9 AM, cannot tell if a tool just hung for two minutes or finished successfully. None of those failures are bugs in the runtime. They are pieces of a different layer of the stack.
That layer is the agent loop. It is the program that sits between you and the runtime, calling the model in a cycle and feeding tool results back as the next user turn. Everything that makes the system feel intelligent (tool definitions, sandboxing, screen awareness, memory, scheduling) lives inside that loop. The runtime is one component the loop calls. Treating it as the whole thing is the same mistake as installing Postgres and waiting for an app to appear.
The cleanest way to make the missing parts concrete is to read a real loop and find each piece. The rest of this page does that against one open source agent on macOS. If you read another loop later, the shape will be the same; only the file names change.
What the runtime owns vs what the loop owns
A simple split that clears most of the confusion. The columns below are not the runtime against a competing runtime. They are the loop against the runtime, side by side, on a single agent system.
| Feature | Local LLM runtime | Agent loop |
|---|---|---|
| Forward pass: tokens in, tokens out | Yes, this is the runtime | Not its job |
| KV cache and prefix caching | Yes, vLLM and llama.cpp do this | Delegated to the runtime |
| Decoding strategy and JSON mode | Yes, the sampler and structured output mode | Delegated to the runtime |
| Tool schemas the model can call | No concept of a tool | 19 tools in fazm-tools-stdio.ts |
| Tool-execution sandbox with timeouts | Not even an interface | 10s / 120s / 300s tiers in index.ts |
| Screen-state representation | Runtime never sees the screen | Accessibility tree via macos-use MCP |
| Persistent conversation state | Stateless across calls | ChatMessageStore plus session chain |
| Scheduler and recurring runs | One request, one fire | cron-runner.mjs plus routines_* tools |
| Swappable model endpoint | Yes, you point the loop at it | customApiEndpoint, SettingsPage line 885 |
The six pieces that fill the gap
Walked top to bottom. Each one names the file you can open in the open source bridge. The point is not that you have to use this specific loop. The point is to recognize the shape so you can spot it, or its absence, in any other framework you are evaluating.
1. A tool schema interface the model can call
The runtime sees tokens. Tools live in the loop. In Fazm's open source bridge that means 19 native stdio tools registered at acp-bridge/src/fazm-tools-stdio.ts (execute_sql, capture_screenshot, request_permission, scan_files, ask_followup, routines_create, save_knowledge_graph, speak_response, set_user_preferences, and more), plus an MCP fan-out for macos-use, playwright, fazm_tools, whatsapp, and google-workspace. The model gets a JSON schema; the loop is the dispatcher.
2. A tool-execution sandbox with real timeouts
Tools that hang freeze the agent. The loop has to enforce wall-clock limits and synthesize a failure result the model can read. In acp-bridge/src/index.ts lines 114-116 there are three timeout tiers: 10 seconds for internal tools like ToolSearch, 120 seconds for MCP tools, 300 seconds for everything else. The watchdog in startToolTimer (around line 141) tracks every active call and emits a synthetic timeout event if the deadline passes. None of this is in your runtime; all of it is required.
3. A screen-state representation
The interesting tool calls happen against apps the model has never seen. The loop has to convert the current screen into something the model can reason about. The two paths are screenshots (1500-3000 vision tokens per turn) or a compact accessibility tree (200-400 tokens). On a local 13B model the difference is whether a 10-step task finishes in two minutes or twelve. The loop owns this choice. The runtime never sees the screen.
4. Persistent conversation state across turns
Three layers, normally conflated. The in-flight messages on this turn (system prompt, tool schemas, recent history, new input). The persisted thread that survives a restart. The longer-lived memory the loop reloads at session start so the model knows the user without burning tokens recapping every time. Fazm carries those as a SQLite-backed message store plus a session-chain pattern in UserDefaults so the upstream session id can roll forward without amputating prior context. The runtime is stateless. The loop holds memory.
5. A scheduler for recurring and triggered runs
Most useful agent work is not one-off. It is the same task on a cron, every morning, every hour, every Friday. A scheduler that fires the loop on a calendar trigger is part of an agent system, not the runtime. Fazm ships acp-bridge/src/cron-runner.mjs and exposes routines_create, routines_list, routines_update, routines_remove, and routines_runs as tools the agent can call to schedule its own future runs. Each scheduled fire spawns a fresh ACP session against the same bridge.
6. A stable model endpoint the loop can swap
This is the seam where the runtime plugs in. In Fazm one @AppStorage line at SettingsPage.swift line 885 declares customApiEndpoint, the Settings UI writes to it (lines 983-998), and ACPBridge.swift lines 467-470 read it on bridge spawn and export it as ANTHROPIC_BASE_URL. Anything that speaks the Anthropic Messages API drops in: an LM Studio shim, vllm-mlx, an Anthropic-compatible proxy, GitHub Copilot bridge. The agent process and the 19 tools never change. Only the reasoner behind them does.
The seam that lets you keep the runtime swappable
One line you should be able to find in any agent loop you adopt. In Fazm it lives at SettingsPage.swift line 885:
// Desktop/Sources/MainWindow/Pages/SettingsPage.swift:885
@AppStorage("customApiEndpoint") private var customApiEndpoint: String = ""The Settings UI writes a URL into that key. When the bridge spawns, it reads the same key and exports it as the env var the agent process picks up. From ACPBridge.swift lines 467 to 470:
// Desktop/Sources/Chat/ACPBridge.swift:467
// Custom API endpoint (allows proxying through Copilot, corporate gateways, etc.)
if let customEndpoint = defaults.string(forKey: "customApiEndpoint"), !customEndpoint.isEmpty {
env["ANTHROPIC_BASE_URL"] = customEndpoint
}That is the entire seam between the runtime and the loop. The agent process under the bridge is @agentclientprotocol/claude-agent-acp 0.29.2 (see acp-bridge/package.json), and it speaks the Anthropic Messages API. Anything that also speaks that API plugs straight in: a vllm-mlx server with an Anthropic shim, an LM Studio bridge, a corporate gateway, GitHub Copilot in Anthropic mode. The 19 stdio tools and the MCP fan-out do not change. Only the reasoner behind them does.
When you go shopping for a loop, this is the field to look for. If there is no equivalent of customApiEndpoint and no env-var injection point, swapping the runtime later means rewriting the loop. That is the wrong end of the stack to be married to.
“The runtime ships you the tokens. Everything else is the loop, and the loop is six things, not one.”
Fazm bridge source
Two ways to fill the gap from here
One: build the loop yourself. Define a tool-call format the model can emit, write a parser, register tools, add timeouts, persist messages, expose a scheduler, design a screen-state extractor, build a settings field for the model endpoint. It is a real project. A weekend gets you the toy version; a month gets you a version that does not eat your data.
Two: pick a loop that already has the six pieces and point its model endpoint at your local runtime. Fazm does this on macOS; Cline and Goose do it on different transports. Read whichever is closest to your shape, find its customApiEndpoint equivalent, and swap the URL. The loop runs the same; the model behind it is now your local runtime.
Either way, the diagnosis flips. The thing that was missing was never the model. It was the five components around it that nobody packaged together until you went looking for them.
Bolting a local runtime onto a real agent loop
Walk through the six pieces against your stack. We will read the loop you are evaluating and point at the seam where your runtime drops in.
Frequently asked questions
I have Ollama running and a 13B model loaded. Why does it still feel like nothing works?
Because the runtime is one piece of the system, not the whole system. Ollama, LM Studio, llama.cpp, vLLM and friends are model servers. They take a request, run a forward pass, and stream tokens back. They do not define tools the model can call, do not execute those tools, do not maintain conversation state across turns, do not represent screen state, do not schedule recurring runs. An agent loop is the code that does all of those things and uses the runtime as one component, the way a web app uses a database. Loading a model and waiting for an agent to emerge is like installing Postgres and waiting for an app to appear.
What exactly is an agent loop, in two sentences?
A program that calls the model in a loop, passing back the output of the previous tool call as the next user turn, until the model emits a stop signal. That is it. Everything else (tool definitions, tool sandbox, state, memory, scheduling) is plumbing the loop needs to actually do useful work on a real machine.
If the runtime is so thin, what gives a model the ability to call tools at all?
Two things outside the runtime. First, a tool schema sent to the model as part of the prompt, normally as a JSON schema that names each tool, its arguments, and what it does. Second, a parser on the host side that recognizes when the model has emitted a tool-call structure and dispatches it to a real function. The runtime just streams tokens. The schema design and the dispatcher live in the agent loop. In Fazm's open source bridge those 19 stdio tools are defined at acp-bridge/src/fazm-tools-stdio.ts (execute_sql, capture_screenshot, request_permission, scan_files, ask_followup, routines_create, and so on), and the dispatcher is wrapped by the watchdog in acp-bridge/src/index.ts around lines 109 to 240.
Can I just point my local runtime at a framework that already has the loop?
Yes, and that is the cleanest way to fill the gap. The seam to look for is whether the framework lets you swap the model endpoint without touching the rest. In Fazm that seam is one Settings field that writes to UserDefaults key customApiEndpoint at SettingsPage.swift line 885, and one env-var assignment at ACPBridge.swift line 469 that exports it as ANTHROPIC_BASE_URL when the bridge spawns. Anything that speaks the Anthropic Messages API plugs in there: an LM Studio shim, a vllm-mlx server, an Anthropic-compatible proxy. The loop and the 19 tools stay constant; only the reasoner behind them changes.
Why does the agent loop need a screen-state representation if the model can read text?
Because most desktop work is not in the chat window. It is in another app, behind a button the model has never seen, on a screen that is changing every few seconds. The loop needs a way to convert the current screen into a structure the model can reason about and reference by ID when it emits a click. The two real options are a screenshot (vision-tokenized, roughly 1500-3000 input tokens per turn) or a compact accessibility tree (200-400 tokens for the same window, with explicit roles and bounding boxes). The accessibility-tree path is what makes the loop affordable on a local 13B model; the screenshot path is what every framework defaults to and why local agents feel slow.
What is conversation state, beyond the chat history I already have?
Three layers most setups conflate into one. First, the in-flight message list the model sees on this turn (system prompt + tool schemas + last N turns + new user input). Second, the persisted thread that survives a process restart (a SQLite or file-backed store, so closing the app and reopening it does not lose the conversation). Third, longer-lived facts the loop reloads at the start of every session (a memory file like MEMORY.md plus topic files, so the model knows who the user is across sessions without burning tokens on every turn). A bare runtime owns none of these. The loop has to.
Where does scheduling fit? My model is for one-off prompts.
Most useful agent work is not one-off. It is running every morning at 9 to summarize unread Gmail, or every hour to check a Reddit thread, or every Friday to invoice clients. A scheduler that fires the loop on a cron or a calendar trigger is part of an agent system, not a plain LLM app. Fazm's bridge ships a runner at acp-bridge/src/cron-runner.mjs and exposes routines_create, routines_list, routines_update, routines_remove, and routines_runs as tools so the agent itself can schedule its own future runs. The runtime knows nothing about any of that.
What about MCP servers? I keep seeing them mentioned next to local LLM stuff.
MCP (Model Context Protocol) is a standard for exposing tools to a loop, not a replacement for the loop. An MCP server like macos-use, playwright, or fazm_tools defines a set of tools and a transport (stdio, HTTP, websocket). The loop has to spawn it, register the tools, route tool calls into it, surface the response, and time it out if it hangs. In Fazm's bridge, MCP tools have a 120s timeout and other tools have a 10s or 300s timeout (TOOL_TIMEOUT_INTERNAL_MS, TOOL_TIMEOUT_MCP_MS, TOOL_TIMEOUT_DEFAULT_MS at acp-bridge/src/index.ts lines 114-116). Without those timeouts a single hung subprocess freezes the agent. MCP is one piece, not the whole.
Is there an open source agent loop I can read to fill this in for myself?
Two on macOS that are worth opening side by side. Fazm at github.com/mediar-ai/fazm wraps Anthropic's claude-agent-acp 0.29.2 (acp-bridge/package.json) and adds 19 native tools, a permission system, a scheduler, screen-state via accessibility APIs, and a customApiEndpoint seam for swapping in any Anthropic-compatible local runtime. Cline and Goose are also good references for the same shape on a different transport. Read whichever is closest to what you want; once you have seen the loop, the runtime stops feeling like the bottleneck.
What does the runtime actually still own once you separate the loop?
Three things and only three things. The forward pass (tokens in, tokens out). The KV cache and any prefix caching tricks that make repeated turns cheap. The decoding strategy (sampling, speculative decoding, structured output mode, JSON mode). Everything else (when to call the model, what to put in the prompt, how to parse the output, how to react to it, what to remember) belongs to the loop. Picking a runtime is mostly about prefill speed and prefix caching support; picking a loop is about the other five pieces on this page.
Other field notes on the same stack
Adjacent reading
Local LLM workflow literacy, the five primitives that turn a chatbox into work
The broader operational vocabulary: agent loop, screen-state representation, swappable reasoner, skills, persistent memory. Each anchored to a file in the open source app.
Local LLM desktop agent throughput, the number that matters is not generation tok/s
Once the loop is in place, the bottleneck flips to prefill speed. Per-turn input length divided by prefill tok/s is your floor on time-to-first-tool-call. Decode speed barely matters.
Accessibility tree limits beyond the browser
Where the AX-tree representation stops working, why canvas apps are blind, and which fallbacks earn their place when the tree returns nothing useful.