LOCAL MODEL / REAL MAC AGENT

Run vLLM locally on a Mac, then give it hands

Every guide on running vLLM on a Mac in 2026 ends at the same place: a curl against http://localhost:8000/v1/chat/completions that returns a JSON completion. Interesting for about thirty seconds. The step that turns a local model into something useful is wiring it to an agent that can actually do things inside your other Mac apps. That part takes one field in a signed Mac app's Settings. This page is about that field, the Swift line that reads it, and the small proxy shim between vLLM's OpenAI protocol and the Anthropic-shaped calls a real Mac agent wants to make.

M
Matthew Diakonov
9 min read
4.9from Written from the Fazm macOS source tree
Custom API Endpoint, one field in Settings
ANTHROPIC_BASE_URL override in ACPBridge.swift
LiteLLM or claude-code-router as the Anthropic shim
Drives any Mac app via AXUIElement
No screenshots, no cloud calls

What every other guide on this stops at

The pages that currently come up for this topic walk you through three install paths: build vLLM from source on the CPU (the official macOS path), install the vllm-project/vllm-metal plugin for Apple GPU acceleration via MLX, or use Docker Model Runner which added vllm-metal as a first-class backend in April 2026. All three get you a running server on http://localhost:8000 that speaks the OpenAI Chat Completions protocol. The guides then run a curl, print the JSON response, and end.

What they do not tell you is what to do with the server. A model you cannot call from anything real is a toy. On a Mac, the interesting thing to do with a local model is to give it hands, the way Claude Sonnet does when it is connected to a desktop agent: read the open Calendar window, draft a WhatsApp reply, move files in Finder, add a row to a Google Sheet. None of that works from a curl.

The shortcut is that there is already a consumer Mac app where Claude does exactly that, and it exposes a single setting that lets you swap Claude for whatever is answering on your localhost. The rest of this page is that setting, where it lives in the source, and the exact shim you need between vLLM and the call shape the agent makes.

How a local vLLM call becomes a click in Finder

vLLM server
LiteLLM shim
ACP bridge
Custom API Endpoint
Finder
Calendar
WhatsApp
vLLM
vllm-metal
vllm-mlx
MLX
Docker Model Runner
LiteLLM
claude-code-router
ANTHROPIC_BASE_URL
AXUIElement
kAXFocusedWindowAttribute
ACP v0.29.2
MCP
Qwen2.5
Llama 3.3

The anchor fact: one line of Swift, three lines of shim

The whole trick lives in four lines of code inside the Fazm macOS source. When the Swift app spawns the Node.js ACP bridge that talks to Claude over JSON-RPC, it checks a UserDefaults key called customApiEndpoint. If it is non-empty, the app sets env["ANTHROPIC_BASE_URL"]to whatever you pasted into the field. From that moment every model call the bridge makes goes to your server instead of api.anthropic.com.

There is no dev mode toggle, no undocumented plist, no XPC gymnastics. The setting ships in every signed Fazm build.

Desktop/Sources/Chat/ACPBridge.swift
Desktop/Sources/MainWindow/Pages/SettingsPage.swift
1 field

Settings > AI Chat > Custom API Endpoint. The placeholder reads 'https://your-proxy:8766'. Toggling it off clears the value and respawns the ACP bridge so the next query routes through the default Anthropic endpoint again.

Desktop/Sources/MainWindow/Pages/SettingsPage.swift:936

Serve vLLM on Apple Silicon, then front it with an Anthropic shim

vLLM itself speaks the OpenAI Chat Completions protocol. The ACP bridge inside Fazm speaks Anthropic's Messages protocol, because it runs claude-code-acp under the hood and uses the Anthropic SDK. That is a small translation problem, not a blocker. Three things solve it today.

LiteLLM's proxy is the most flexible option. It accepts Anthropic requests on the ingress, maps them to whatever model backend you configure (vLLM is just an OpenAI-compatible endpoint to LiteLLM), and forwards tool_use blocks through correctly. claude-code-router is the more opinionated, single-purpose choice, built specifically for this pattern and aware of claude-code-acp's shape. And if you are starting fresh, the waybarrios/vllm-mlx fork ships an Anthropic-compatible server out of the gate, so you skip the shim entirely.

Serve Qwen on vLLM, shim it with LiteLLM, hand it to Fazm
run-local-agent.sh

Why screenshots would be the wrong primitive here

The other pattern for desktop agents is a vision-model loop: take a screenshot, send it to a multimodal model, ask it for x/y coordinates, click. That pattern exists because it works cross-platform and needs no platform APIs. It is also expensive in three ways that matter when your model is running on your own M-chip. Each bitmap is a long visual input that chews tokens. OCR on dark mode and Retina text is fragile. And the loop has to re-capture after every click because state has changed.

The macOS accessibility tree solves all three problems at once. It is the same tree VoiceOver walks: every element already has a role, a title, a value, a frame, and a list of children. AXUIElementCreateApplication(pid) plus AXUIElementCopyAttributeValue return structured data the model can reason about directly, usually a few hundred tokens per window, and the tree updates in place when the app state changes. That matters a lot more when the model doing the reasoning is a 7 to 14 billion parameter local model than when it is a frontier cloud model.

AX tree instead of pixels

AXUIElementCreateApplication, AXUIElementCopyAttributeValue, kAXFocusedWindowAttribute. The same tree VoiceOver reads. Structured, small, current.

Works with any Mac app

Finder, Calendar, Messages, Notes, Xcode, WhatsApp, Slack, VS Code. If the app implements accessibility, the agent can see it and drive it.

Consumer app, not a framework

Signed .app from fazm.ai. No pip install, no Docker, no venv. The ACP bridge and five MCP servers are bundled inside the .app.

Five MCP servers bundled

fazm_tools, playwright, macos-use, whatsapp, google-workspace. Hardcoded in acp-bridge/src/index.ts around line 1266.

Bridge respawns on endpoint change

Toggling the Custom API Endpoint field calls restartBridgeForEndpointChange() so the Node subprocess picks up the new ANTHROPIC_BASE_URL without an app restart.

Zero outbound, zero per-token

When the endpoint points at localhost, every agent query stays on your machine. No Anthropic credits consumed, no network required after model download.

0Settings field to switch models
0Swift lines that do the override
0Bundled MCP servers
0Outbound calls on a local endpoint

End-to-end, in the order a normal Mac user would do it

1

1. Pick a backend for vLLM on your Mac

CPU-only build from source is the simplest. vllm-project/vllm-metal is faster on M-series via Metal and MLX. Docker Model Runner on Mac supports vllm-metal now if you prefer containers.

2

2. Serve a small, tool-use-friendly model

Qwen2.5-7B-Instruct or Llama-3.1-8B-Instruct at 4-bit on a 16 GB Mac. Larger quantized models for 32 GB or more. Expose on 127.0.0.1:8000, served-model-name 'local-qwen'.

3

3. Put an Anthropic-shaped shim in front

LiteLLM proxy is the flexible choice. claude-code-router is the purpose-built one. vllm-mlx skips the shim because it already speaks Anthropic. Whichever you pick, listen on 127.0.0.1:4000.

4

4. Install Fazm and enable the Custom API Endpoint

Download Fazm from fazm.ai. Open Settings, click AI Chat, toggle on Custom API Endpoint, paste http://127.0.0.1:4000, commit. The ACP bridge respawns with ANTHROPIC_BASE_URL set.

5

5. Grant accessibility permission, then try a task

On first prompt Fazm walks you through System Settings > Privacy & Security > Accessibility. Then ask something like 'open my calendar and summarise today'. The agent reads the AX tree of Calendar, not pixels.

What you should sanity-check before you declare victory

  • vllm serve is healthy: curl http://127.0.0.1:8000/v1/models returns your served-model-name.
  • The shim speaks Anthropic: curl -X POST http://127.0.0.1:4000/v1/messages with an x-api-key header returns a proper Anthropic-shaped response.
  • Fazm's bridge restarted: after saving the endpoint, the first chat reply streams from the local model (often slower first-token, different style).
  • Accessibility actually works: trigger a task against Finder; Fazm should call kAXFocusedWindowAttribute and summarise the window's children, not a screenshot description.
  • Tool calling made it through: ask 'what is on my calendar today?' If the agent calls the google-workspace MCP instead of guessing, the shim preserved tool_use blocks correctly.

Reality check

A 7B model quantized to 4-bit will not behave like claude-sonnet-4-6 on long agent trajectories. It will choose the wrong tool sometimes and hallucinate element names. What it will do, for free, is handle routine desktop chores.

Treat local vLLM as the model that handles repetitive work with no network, and keep a cloud model on the bundled bridge for anything that needs hard reasoning. Flip between them by toggling the Custom API Endpoint switch off and on. No restart required; the bridge respawns on every commit.

Local vLLM vs the usual cloud agent loop

FeatureCloud model + browser-only agentLocal vLLM + Fazm
Inference locationRemote API, billed per tokenYour M-chip, no network required
How the agent sees the screenScreenshot + vision modelAXUIElement tree, kAXFocusedWindowAttribute
Apps it can touchOne Chromium tabAny Mac app that implements AX
SetupBrowser extension + cloud accountvLLM + LiteLLM + one field in Fazm Settings
Per-query costAnthropic or OpenAI token costElectricity
PrivacyFull prompt + screenshots go to the vendorNothing leaves the device
Model swapPick from vendor dropdownChange the served-model-name in vLLM
Tool useWhatever the extension exposesFive MCP servers bundled, tool_use proxied by LiteLLM

Where this fits in the Fazm source tree, if you want to verify

The two files that matter areDesktop/Sources/Chat/ACPBridge.swift(1,612 lines; the env override lives at lines 380 to 382 inside the function that constructs the Node subprocess environment) andDesktop/Sources/MainWindow/Pages/SettingsPage.swift(the settings card titled "Custom API Endpoint" with the TextField placeholder "https://your-proxy:8766" lives at lines 906 to 952). The accompanying accessibility probe that makes the rest of the agent loop possible lives in Desktop/Sources/AppState.swift as testAccessibilityPermission() around line 433, which uses AXUIElementCreateApplication and kAXFocusedWindowAttribute to verify AX is alive on the frontmost app.

The bundled MCP servers are declared in acp-bridge/src/index.ts around line 1266, and the native Mach-O binary for macos-use is resolved earlier in that file around line 63 as join(contentsDir, "MacOS", "mcp-server-macos-use"). Those servers are what the local vLLM model will call once the shim forwards tool_use blocks correctly.

By the numbers

0+tokens/sec on vllm-mlx for Llama-3.1-8B on M3 Max
0xTTFT improvement in vllm-metal v0.2.0 over v0.1.0
0.0xthroughput gain in the same vllm-metal release window

Figures published by the vllm-metal and vllm-mlx projects in April 2026; your Mac will vary depending on memory and model size.

When local vLLM is the right choice, and when it is not

Running a local model as the brain of a desktop agent is a real win for a narrow set of tasks. Offline reliability (planes, trains, bad hotel wifi). Anything where the context contains material you are not allowed to ship to a third party. High-volume repetitive work where the per-token cost of a cloud call starts to matter. Any workflow that shouldn't leave a trace in vendor logs.

For complex, long-horizon agent trajectories, the gap between a local 7-14B model and Claude Sonnet 4.6 is still real. The correct move is usually hybrid: run the bundled Anthropic backend by default, and toggle the Custom API Endpoint on when you are doing something private or bulk. Fazm is built for that toggle; the bridge respawns cleanly on every commit, and bridgeMode = 'builtin' keeps the UI otherwise unchanged.

Want help wiring your local vLLM into a Mac agent?

Book 20 minutes. Bring your machine, your model, and your use case. We'll walk through the endpoint switch, the shim choice, and the accessibility loop live.

Book a call

Frequently asked questions

Can vLLM actually run on a Mac in 2026, or is it CUDA-only?

Both answers are true. The vLLM main repository ships CPU-only macOS support on Apple Silicon; you build from source with 'git clone https://github.com/vllm-project/vllm && cd vllm && uv pip install -r requirements/cpu.txt && uv pip install -e .' and get FP32 or FP16 inference on the M-series CPU. For GPU, the community-maintained vllm-project/vllm-metal plugin (April 2026 v0.2.0) brings Metal as the attention backend, and waybarrios/vllm-mlx wraps MLX with an OpenAI-compatible and Anthropic-compatible server hitting 400+ tokens per second on larger Macs. Docker Model Runner added vllm-metal as a first-class backend in the same window. Which one you pick depends on model size and how much of your RAM you want to give up.

Why bother wiring vLLM into a Mac agent when I can just curl localhost:8000?

A curl proves the server is up. It does not do the work you wanted to do. The reason to serve a model locally is usually one of two things: you have a document or an app you cannot upload to a cloud provider, or you do not want to pay per-token for repetitive tasks. Neither of those is satisfied by a REPL. An agent that can read your Calendar, drag a file in Finder, reply in WhatsApp, or rename twenty files in a folder is what turns a local endpoint into a tool. That is the step every ranking guide skips.

What does Fazm actually read from my Mac when the agent takes action?

The macOS accessibility tree, not screenshots. AXUIElementCreateApplication(pid) plus AXUIElementCopyAttributeValue walks the same tree VoiceOver reads: every element has a role (kAXButtonRole, kAXTextAreaRole), a title, a value, a frame, and a list of children. Fazm pulls that as structured data, usually a few hundred tokens per window. The probe function lives in Desktop/Sources/AppState.swift testAccessibilityPermission() around line 433, and the call against the frontmost app uses kAXFocusedWindowAttribute. Bitmap capture is a fallback, not the primary channel.

Where exactly is the Custom API Endpoint setting, and what does it touch?

Fazm main window > Settings > AI Chat. The card is titled 'Custom API Endpoint' with an icon of a server rack. Toggle it on and a text field appears with placeholder 'https://your-proxy:8766'. Under the hood, '@AppStorage("customApiEndpoint")' in Desktop/Sources/MainWindow/Pages/SettingsPage.swift line 840 persists the value. When the value is non-empty, Desktop/Sources/Chat/ACPBridge.swift at lines 380 to 382 sets env['ANTHROPIC_BASE_URL'] = customEndpoint on the Node.js ACP bridge subprocess before spawning it. Changing the field calls chatProvider?.restartBridgeForEndpointChange(), which respawns the bridge so the new URL takes effect on the next query.

vLLM speaks OpenAI, not Anthropic. How does Fazm's endpoint override help me?

You need a thin shim. Three options ship today: LiteLLM proxy (pip install litellm && litellm --model openai/your-vllm-model --api_base http://localhost:8000/v1) exposes an Anthropic-compatible endpoint on localhost:4000. claude-code-router (npm i -g @musistudio/claude-code-router) does the same and was purpose-built for this flow. waybarrios/vllm-mlx skips the shim entirely because it serves both OpenAI and Anthropic shapes from one process. Whichever you pick, point Fazm's Custom API Endpoint at the shim's base URL and the ACP bridge will route every Claude call through your local server.

Does tool-calling work when I run a local model, or does the agent just lose its hands?

Tool calling works if the shim implements Anthropic's tools schema, which LiteLLM, claude-code-router, and vllm-mlx all do. Fazm ships five MCP servers hardcoded in acp-bridge/src/index.ts around line 1266: fazm_tools (internal tool dispatch), playwright (browser via the Playwright MCP Bridge), macos-use (the native Mach-O binary at Contents/MacOS/mcp-server-macos-use inside the signed .app), whatsapp (Catalyst app driven through accessibility), and google-workspace (Gmail, Calendar, Drive via a bundled Python MCP server). The model you run locally decides which tool to call; the shim forwards the tool_use block; Fazm dispatches it. The quality of tool selection depends on the model, not the transport.

Which local model should I try first on an M-series Mac?

For a 16 GB machine, Qwen2.5-7B-Instruct or Llama-3.1-8B-Instruct quantized to 4-bit MLX fits comfortably. Both understand tool-use JSON well enough to drive a Mac app for short tasks. For 32 GB or more, Qwen2.5-14B, Mistral-Small-24B-2501, or Llama-3.3-70B at aggressive quantization are the common picks. None of them match claude-sonnet-4-6 on long agentic trajectories yet, but for routine clicks (renaming files, composing a draft reply, pulling calendar events into a note) they work fine and never leave your machine.

What do I lose by swapping the bundled Fazm endpoint for a local vLLM server?

Three things. First, Fazm's bundled Claude credits stop getting consumed, because requests no longer hit Anthropic. Second, the model list dropdown stops reflecting Anthropic's catalog; you see whatever your shim advertises. Third, the 'Your Claude Account' OAuth path (bridgeMode = 'personal') only makes sense against api.anthropic.com, so you either leave bridgeMode on 'builtin' and let the Custom API Endpoint field override, or you leave both blank and point only at the shim. You gain: zero per-token cost, zero outbound traffic, full control of the model running the agent.