local AI, below the chat box

Best Local AI 2026: the Access-Layer Stack the Model Roundups Never Discuss

Every 'best local AI' list ranks the same models and runners: Llama 4 Scout, Qwen 2.5, Mistral Small 3, Ollama, LM Studio, Jan. That is the first two layers of a four-layer problem. This guide is about the other two, because they are what separates a chat box running on your GPU from a local AI that can actually touch the apps on your Mac.

Matthew Diakonov, Written with AI

Published April 18, 202612 min read

Try Fazm free

4.9from 200+

Bring your own local LLM via Custom API Endpoint

Real AXUIElement access layer, not screenshots

Open source at github.com/mediar-ai/fazm

The Local AI Stack

Model, runner, access layer, agent loop

Model roundups rank Llama, Qwen, Mistral. Useful, incomplete.

Runner roundups rank Ollama, LM Studio, MLX. Useful, incomplete.

On macOS the access layer is the real bottleneck.

Fazm's access layer reads AX trees, not pixels.

One toggle swaps the model for any local proxy.

0:00 / 0:05

What the 'Best Local AI' Roundups Actually Cover

I pulled the top ten results for 'best local ai' this month. Five of them are model leaderboards (Llama 4 Scout, Qwen 2.5 Coder 14B, Mistral Small 3, DeepSeek Coder, Phi-4-mini). Three are runner comparisons (Ollama, LM Studio, Jan, GPT4All, text-generation-webui). Two are PC buying guides. Not one of them answers the question a Mac user actually asks when they type 'best local AI' into Google.

That question is: 'what local AI can I run on my Mac that will actually help me with the work on my screen?' The benchmarks do not measure that. MMLU does not care whether the model can read a Slack thread, click a send button, or rename twenty files based on content. Those are not model capabilities; they are stack capabilities, and the model is only one layer in the stack.

The purpose of this guide is to describe the whole stack, name what is missing from every roundup, and show where that missing part lives in Fazm's source code. Line numbers included, because an anchor fact that cannot be verified is just branding.

Every 'Best Local AI' Roundup, Summarized

These are the names that appear over and over. All of them are models or model runners. None of them is an access layer.

Llama 4 Scout

Qwen 2.5 Coder

Mistral Small 3

DeepSeek Coder

Phi-4-mini

Gemma 3

Ollama

LM Studio

Jan

GPT4All

text-generation-webui

LocalAI

llama.cpp

MLX

vLLM

Msty

4 layers

“A local chat model that only sees what you paste into it is less 'local AI' than a cloud model with a full accessibility-tree view of your Mac. Local is about what the AI sees, not where the weights live.”

Fazm engineering notes

The Four Layers of an Actually Useful Local AI

Swap any one of these and the whole experience changes. The 'best local AI' roundups almost never name the last two, which is why their picks feel impressive in a chat window and useless in a work session.

1. The model

Llama 4 Scout, Qwen 2.5 Coder, Mistral Small, Claude, GPT. The reasoning engine. The only layer most roundups cover. Swappable.

2. The runner

Ollama, LM Studio, MLX, llama.cpp, vLLM. Serves the model over a local HTTP endpoint. Also swappable. Also widely covered.

3. The access layer

How the AI sees and touches your apps. Accessibility APIs (fast, structured) or screenshots (slow, lossy) or nothing (chat only). This is where Fazm lives.

4. The agent loop

What plans the action, calls the tools, posts the results. Claude-Agent-SDK, OpenAI tool-use, or a custom loop. Without this, the model is a chat box.

How the Layers Talk to Each Other

This is the call graph for a single turn of a local AI on your Mac. Every arrow on the left side is a macOS permission gate. Every arrow on the right side is a synthetic action.

A single turn of a useful local AI

The Anchor Fact: One TextField That Makes Fazm BYO-Local-Model

Fazm ships with Claude as the default model. Most people want that, because at the time of writing Claude is the best multi-step tool-use model available. But if you want a fully local brain and are willing to run a proxy, the switch is one TextField in Settings. Here is the code, verbatim, with file path and line numbers.

Desktop/Sources/MainWindow/Pages/SettingsPage.swift

The placeholder text is https://your-proxy:8766. The onSubmit handler calls restartBridgeForEndpointChange(), which stops the ACP bridge so the next chat call picks up the new endpoint. That function is defined in ChatProvider.swift and is two pages from here in the source tree.

Desktop/Sources/Providers/ChatProvider.swift

The endpoint is stored in UserDefaults under the key customApiEndpoint. Every subsequent chat call uses it as the base URL for the Anthropic-format request. Fazm still speaks the Anthropic protocol (for tool-use parity), so in practice you front your local runner with LiteLLM or Claude-Code-Router. That is the one detail the product copy does not spell out, and it is why this guide has file paths in it.

The Access Layer, in Five Lines

This is what 'local AI' actually means if you care about privacy. The accessibility tree of the frontmost app is read in-process, with no screenshot, no network call, no external service. The model gets structured UI text. Whatever you point the model layer at, this step stays on your Mac.

Desktop/Sources/AppState.swift

Contrast this with screenshot-first desktop agents (Anthropic Computer Use, OpenAI Operator, most 'browser AI' startups). They capture a PNG and send it to a vision model. The PNG crosses the network, the vision model OCRs it every turn, and the tokens stack up. Fazm reads the tree directly and the model sees labeled roles instead of pixels.

BYO Local Model: 4 Steps

The order matters. Run the model first, put a translating proxy in front of it, paste the proxy URL into Fazm. Nothing recompiles; the bridge picks up the new endpoint on the next query.

1
Run Ollama
ollama serve, then ollama pull qwen2.5-coder:14b. Ollama listens on localhost:11434.
2
Start a proxy
LiteLLM or Claude-Code-Router in Anthropic-compatible mode. Point it at localhost:11434. Bind it on localhost:8766.
3
Open Fazm Settings
Settings > AI Chat > Custom API Endpoint. Toggle on, paste http://localhost:8766, press Return.
4
Send a message
The bridge restarts with the new endpoint. Every subsequent query hits your local model. The access layer keeps working.

Verify the Anchor Fact Yourself

Three commands. The first clones the source, the second prints the Custom API Endpoint settings card, the third prints the bridge restart function that reads UserDefaults.

Fact-checking the Custom API Endpoint claim

What Actually Happens On a Local-AI Turn

With Ollama + LiteLLM in front of Fazm, a single query walks through this exact chain. Every node is either on-device or controlled by you.

One chat turn with a local model + Fazm access layer

You speak or type

Push-to-talk transcribes via Whisper or Groq (configurable). Typed queries skip transcription entirely.

Fazm reads the AX tree

AXUIElementCreateApplication + AXUIElementCopyAttributeValue on the frontmost app. Structured text. No screenshot.

ACP bridge assembles the prompt

Access-layer output + your prompt + tool schemas. All in the Anthropic messages format.

Request hits your proxy URL

LiteLLM rewrites Anthropic -> Ollama native. The weights are on your SSD. The request never leaves your machine.

Model returns tool calls

Fazm executes them: AXPress, AXSetValue, CGEvent keystrokes, file edits via the bundled Claude-Agent-SDK skills.

Result rendered back to you

Chat response in the Fazm floating bar. Any side effects are visible in the apps you just had open.

What 'Local' Actually Covers in This Setup

The word 'local' is overloaded. Here is a plain map of which parts live on your machine and which do not, under the BYO-model configuration.

On-device in a Fazm + Ollama setup

Model weights and inference (Ollama on your GPU/CPU)
Accessibility-tree reads of your active window
Clipboard and selected-text reads
File reads, writes, and edits by the agent loop
AX actions and synthetic keystrokes back to the OS
Prompt assembly in the ACP bridge (local Node process)
Anonymous PostHog telemetry (opt-out in Settings)
Subscription / rate-limit checks if on a paid plan

The Stack in Numbers

layers in a working local-AI stack

lines of Swift in the endpoint settings card

bundled Agent Skills in the loop layer

screenshot calls in the default access path

The 47 lines is the exact span of settingsCard(settingId: "aichat.endpoint") in SettingsPage.swift:906-952. The 17 skills live in Desktop/Sources/BundledSkills. The zero screenshot calls is conditional on 'default access path': Fazm can capture a window screenshot when the user explicitly asks for visual context, but the accessibility-tree read is the primary channel.

The Four Layers, Compared Across 'Best Local AI' Picks

Feature	Ollama / LM Studio chat	Fazm + local model
Model choice (Llama, Qwen, Mistral, etc.)	Any (native)	Any, via Custom API Endpoint
Runner	Built-in	BYO (Ollama, LM Studio, MLX)
Reads the active window in any app	No	Yes (AX tree, no screenshot)
Clicks buttons in other apps	No	Yes (AXPress, AXSetValue)
Agent loop with tool use	Chat only	Yes (Claude-Agent-SDK, 17 skills)
Prompt/response on-device	Yes	Yes (with proxy to local runner)
Voice input (push-to-talk)	No	Yes
Open source	Ollama yes, LM Studio no	Yes (github.com/mediar-ai/fazm)

Stop choosing between 'local model' and 'useful AI'

Fazm gives you an accessibility-API access layer plus a one-toggle Custom API Endpoint. Bring any local LLM behind a proxy and keep the full Mac agent loop.

Download Fazm →

Frequently asked questions

What is the best local AI in 2026?

It depends on what 'local AI' means to you. If you want a chat box running on your GPU, the leaderboard answer is Llama 4 Scout (17B active of 109B MoE, 10M context) for general work and Qwen 2.5 Coder 14B for code. If you want a local AI that can actually do things on your Mac (click buttons, read your windows, edit files across real apps), the model is only one of four layers you need: a model, a runner (Ollama, LM Studio, MLX), an access layer, and an agent loop. Most 'best local AI' lists only cover the first two. Fazm ships the last two (real AXUIElement access layer, Claude-Agent-SDK-based loop) and a Custom API Endpoint setting in SettingsPage.swift:906-952 that routes requests to any Anthropic-compatible proxy, so you can front Ollama or LM Studio with a proxy and keep the access and agent layers.

Why does 'best local AI' only mean models and runners in most roundups?

Because that is what the benchmarks cover. MMLU, HumanEval, and the open-source leaderboards measure model quality at question-answering. They say nothing about whether the AI can read the Calendar app, click a button in Safari, or edit a Figma file. When a roundup says 'best local AI,' it almost always means 'best local chat.' Once you ask for work-on-your-Mac AI, the question changes from 'which model?' to 'how does the model see your apps, and what moves on the other side?' That is the access layer and the agent loop, and those are what this guide focuses on.

Can Fazm run fully offline with a local LLM?

Partially. Fazm's chat provider speaks the Anthropic API format through an ACP bridge, so you can point it at any Anthropic-compatible proxy (such as LiteLLM or Claude-Code-Router) in front of Ollama or LM Studio. Open Settings > AI Chat > Custom API Endpoint and paste the proxy URL (placeholder text is https://your-proxy:8766). The call chain routes through your proxy instead of api.anthropic.com. The access layer (accessibility tree reads, AX actions, file operations) runs fully locally regardless. The only remaining outbound calls are for feature flags, rate-limit telemetry, and, if you use it, the GeminiAnalysisService used for the passive observer loop.

Where in the Fazm source can I verify the Custom API Endpoint actually exists?

Open Desktop/Sources/MainWindow/Pages/SettingsPage.swift, lines 906 through 952. The TextField at line 936 has placeholder 'https://your-proxy:8766'. Submitting it fires Task { await chatProvider?.restartBridgeForEndpointChange() }, which is defined in Desktop/Sources/Providers/ChatProvider.swift:2100-2107. That method stops the ACP bridge so the next chat call picks up UserDefaults.standard.string(forKey: 'customApiEndpoint'). The endpoint is read by the bridge at startup and used as the base URL for all Anthropic-format requests.

Why does the access layer matter more than the model for 'best local AI'?

Because the access layer decides what the model can even see. A 671B-parameter model that only gets the text you paste into a chat box is a smaller-scope AI than a 7B model that has the accessibility tree of every window on your Mac and can click, type, and read in real apps. Fazm's access layer calls AXUIElementCreateApplication and AXUIElementCopyAttributeValue (Desktop/Sources/AppState.swift:439 and 441) to read structured UI without a screenshot. That data stays on your machine. The model could be local or cloud, but the 'local' part that matters most for privacy (what your AI sees) already is local.

Is a local model good enough to drive a Mac agent loop?

Depends on the loop. For simple tasks (summarize this window, rename these files, draft this reply), Qwen 2.5 14B or Llama 3.3 70B are more than capable when paired with a solid access layer. For complex multi-step work with tool use (the Claude-Agent-SDK pattern Fazm uses by default), the current tool-use benchmarks still favor Claude and GPT. The practical setup many Fazm users run is: local model for quick accessibility-tree queries through the Custom API Endpoint, Claude for heavy multi-step work. The endpoint toggle is per-session, so you can switch.

What are the actual files I need to modify to swap in a local LLM?

None, if you use a proxy. The path is: run Ollama or LM Studio locally, run LiteLLM or Claude-Code-Router as an Anthropic-compatible front proxy pointing at your local model, paste the proxy URL into Fazm's Settings > AI Chat > Custom API Endpoint. That is three UI steps and two terminals. If you want to skip the proxy and hit the model directly, you would need to change the ACP bridge bootstrap in Desktop/acp-bridge/src/index.ts to speak your chosen local protocol (OpenAI-compatible, Ollama native, or MLX). The endpoint-driven path is the supported one.

Does Fazm use a local model or Claude by default?

Claude by default, through either the user's Claude OAuth session (personal mode) or a bundled Anthropic API key (builtin mode). See Desktop/Sources/Providers/ChatProvider.swift, the modeOverride and bridgeMode properties around lines 433 through 496. The Custom API Endpoint setting overrides both modes. When empty, traffic goes to Anthropic; when set, traffic goes to your proxy. No recompile needed.

How does Fazm compare to Ollama's native chat UI for 'best local AI'?

Ollama is a model runner plus a chat UI. Fazm is not a model runner. Fazm is an access layer and agent loop that happens to be BYO-model through the Custom API Endpoint. They are complementary, not competitive. Run Ollama to serve Qwen 2.5 14B, front it with a proxy, and use Fazm as the thing that reads your Mac's UI and drives it. Ollama's own UI cannot click a button in Slack, and Fazm cannot serve model weights. The combination is the actual 'best local AI' stack for Mac work.

The model is 10% of a local AI. The stack is the other 90%.

Pick any model from the roundups. Pair it with Fazm's access layer and agent loop. That is what 'best local AI' should have meant all along.