Local AI debatemacOSHonest notes

Who calls local AI not ready, what they actually mean

Nobody owns this debate with a single quote. The claim is a stance, not a tweet. Four loose camps push it publicly, and all four are arguing about the same narrow layer of the stack: local model inference. Below is who they are, what they each actually mean, and the layers of an agent that already run locally on a Mac whether the model layer is ready or not.

Matthew Diakonov, Written with AI

Published May 12, 20267 min read

Direct answer · verified 2026-05-12

When people say local AI is not ready, they almost always mean local model inference on a normal laptop. They are usually right about that. They are usually wrong about the conclusion, because an agent is not a model. The screen-reading, app-control, audio-capture, and memory layers already run locally on macOS through accessibility APIs that have shipped for over a decade. The honest 2026 answer is a hybrid stack, not a binary.

Source: Fazm app source at github.com/m13v/fazm. Specific files cited inline below.

The four camps

These are archetypes, not specific individuals. Each one shows up in developer discourse with reliable enough cadence that you can predict the arguments before you read them.

Camp 1

Cloud-platform founders

If your company sells inference by the token, local AI maturing is an existential P&L event. Public commentary from this camp usually frames frontier capability as the only capability that matters, with latency, RAG quality, eval scores, and reliability all pointing at the same conclusion: do it in our datacenter. The argument is internally consistent. It is also load-bearing for the business. Discount accordingly.

Camp 2

Enterprise compliance buyers

A CISO buying for ten thousand seats does not care whether one engineer got Llama 3 working on a MacBook. They care whether the deployment can be centrally observed, patched, key-rotated, fleet-managed, and audited under SOC 2 controls. Most local stacks fail that bar today because nobody has built the enterprise wrappers. Camp 2 is not making a capability claim, they are making a procurement claim. Both can be true.

Camp 3

Frontier-benchmark researchers

The benchmark gap between a frontier cloud model and an 8B local model in 2026 is still wide on long-context reasoning, multi-step tool use, and code agents. Camp 3 points at MMLU-Pro, GPQA, SWE-bench, and the gap is real. The unstated assumption is that you need frontier-grade reasoning for every task. For chatting about quantum field theory, yes. For renaming files, filling a form, or moving a calendar invite, a 7B model is plenty. The argument is true and irrelevant to most desktop work.

Camp 4

First-time Ollama users who bounced

Most loudly online. Someone with 16 GB of RAM tried to run a 13B model, hit four tokens per second, watched their fan spool up, and wrote a post titled something like “local AI is not ready for prime time.” The post was honest. The setup was wrong. A 4-bit 7B model on the same Mac runs at 35-60 tokens per second and does most tasks a desktop agent actually asks of it. The takeaway is real (defaults are bad) but the conclusion (not ready) generalizes from a misconfigured local setup, not from the ceiling.

The agent stack has four layers. Three of them are already local.

Pulled from the Fazm Mac app source at github.com/m13v/fazm. If you disagree with any line below, the repo is the receipt.

What runs locally on a working Mac agent in 2026

Screen reading via AXUIElementCreateApplication + AXUIElementCopyAttributeValue against kAXFocusedWindowAttribute (AppState.swift:488-490)
App control: mouse, keyboard, window focus via CGEvent and NSWorkspace, in-process, no network
Audio capture via AVAudioEngine, byte-for-byte local until you opt into transcription
Workflow memory and replay via local GRDB / SQLite (Desktop/.build/checkouts/GRDB.swift)
Voice transcription: today routes to Deepgram. Local Whisper/Parakeet is on the work list, not shipped
LLM reasoning: today routes to your chosen provider (Anthropic, OpenAI, corporate proxy). Local 7B-30B inference is not production-grade for tool chains in 2026

Read the list again. Four of the six lines are checked. The two that are not (transcription and the LLM call) are exactly what the “not ready” discourse is about. That is the entire scope of the legitimate critique. Everything else, the boring 30-year-old craft of reading a window and clicking a button, already runs on your machine and always has.

What critics hear vs. what builders mean

Most of the public argument is a vocabulary mismatch. Toggle below.

When you say 'we built a local AI agent,' the listener pictures a 70B parameter LLM running on your laptop, alone, doing all the thinking with zero network calls. That picture is technically possible on an M-series Mac with 64 GB+ of unified memory, but it is slow, it is fragile under real tool chains, and almost nobody actually ships this. So the listener mentally rounds you down to 'this person is overclaiming' and reaches for the closest cached opinion: local AI is not ready.

Assumes 'local' means model layer only
Assumes you are claiming frontier capability on a laptop
Reaches for cached skepticism instead of asking which layer

The right question to ask

“Is local AI ready” is the wrong question. It compresses four layers into one bit. The honest version of the question, the one that actually predicts whether a given agent will work for you, is two parts:

Question 1

Which layers of the stack does this agent run on my machine?

Screen reading? App control? Audio? Memory? Workflow replay? Most of those should be local for any agent that drives logged-in software. If they are not local, you are sending your screen and your microphone to a stranger.

Question 2

Where does the model live, and can I swap it?

If the model is cloud-only and locked to one provider, you are renting a dependency. If the model layer is pluggable (any Anthropic-compatible endpoint, corporate proxy, or, eventually, a local 7B), you keep the option to migrate as the local-inference ceiling rises.

Both questions are answerable in a five-minute conversation. Neither is answered by the headline word “local.” That is why the debate keeps eating its own tail.

How Fazm answers both questions

For Question 1, the local surface is roughly everything visible in the checklist above. Screen state, app control, audio capture, workflow memory, all in-process on your Mac. We use macOS accessibility APIs (AXUIElement, kAXFocusedWindowAttribute) rather than periodic screenshots, which means the agent sees real element identifiers, not pixels. Fewer misclicks, no need to ship the screen to a vision model just to find a button.

For Question 2, the model layer is your choice. The default is Anthropic, but the same provider abstraction accepts any Anthropic-compatible endpoint. That includes corporate proxies, GitHub Copilot endpoints, and self-hosted gateways. The day a local 7B with strong tool-use training becomes viable on a 32 GB Mac is the day you change one URL.

The transcription provider is also swappable. Deepgram is the current default for first-token latency. Local Whisper and Parakeet paths are on the work list. We do not ship them yet because the latency on a voice-driven interface is, today, worse than the cloud round-trip on a good network. That trade will flip on Apple Silicon. We are not going to pretend it already has.

Frequently asked questions

Is there one specific person who said local AI is not ready?

No single quote owns this debate. The claim is a stance, not a tweet. Four loose camps push it in public: cloud-platform founders whose business model needs cloud to win, enterprise IT buyers who cannot deploy what they cannot audit at scale, frontier-benchmark researchers comparing 8B local checkpoints to 400B+ cloud frontier models, and first-time Ollama users who installed it once, hit a 4-token-per-second wall, and never came back. Each camp is honest within its frame. The mistake is collapsing the four into one claim.

Does Fazm run fully local?

Fazm runs part-local and is honest about which part. Screen reading runs through the macOS Accessibility APIs (AXUIElement) directly. App control (clicks, keystrokes, window focus) runs through the same APIs. Voice transcription currently uses Deepgram over WebSocket. The reasoning model is whichever provider you connect (Claude, OpenAI, a corporate proxy, or any Anthropic-compatible gateway). The local layer is the screen and the keyboard. The cloud layer is the words and the thinking. We do not call the whole stack local.

Why not run the LLM locally too?

Today, a 7B or 13B model on a Mac with 16-32 GB of unified memory cannot drive a multi-step agent through real apps reliably. It loses track of long tool chains, misnames windows, and pauses awkwardly on every turn. A 70B model can on a 64+ GB Mac, but the first-token latency on those is rough for voice-driven UX. The point critics are right about is that agent-grade local inference on a normal laptop is not solved in 2026. The point they miss is that the inference layer is one of four layers, and the other three are already local.

What does Fazm actually ship locally?

Screen state via AXUIElementCreateApplication and AXUIElementCopyAttributeValue against the focused window, including the accessibility tree of every visible element. Mouse and keyboard control via CGEvent. Window focus tracking via NSWorkspace. Audio capture via AVAudioEngine. All of that runs in-process on your Mac with no outbound network. The source for the screen-reading path is the AppState.swift file in github.com/m13v/fazm.

What does Fazm send to a cloud service?

Audio bytes for transcription go to Deepgram over WebSocket. The transcribed text plus a structured representation of the current screen go to the LLM provider you chose. Nothing else leaves the machine without a tool call you can see. Both endpoints are configurable. If you route through a corporate proxy or a self-hosted model gateway that speaks the Anthropic API, the cloud surface area shrinks to whatever your gateway forwards.

If local model inference is the only weak layer, why is the discourse so loud?

Because the model is the most exciting layer. It is the layer that makes Twitter threads, beats benchmarks, and ships keynotes. The unglamorous layers (accessibility APIs, screen capture, audio routing, window management) are 30-year-old craft. They work. They are not a story. So the discourse fixates on the one layer that is genuinely unfinished and treats it as the entire field.

When will local model inference catch up for agents on a normal Mac?

Honest answer, nobody knows. Apple Silicon memory bandwidth and unified memory size are the bottleneck for the model layer, and both are improving on every generation. MLX, llama.cpp, and the GGUF ecosystem are closing the wrapper gap. A reasonable bet is that small reasoning models with strong tool-use training (7B to 30B class, distilled from frontier models) will become viable for desktop agents on a 32 GB M-series Mac inside 18 months. Until then, the practical answer is the hybrid stack: local for screen and keyboard, cloud for words.

Is Fazm open source?

Yes. The Mac app source is at github.com/m13v/fazm. You can read the AXUIElement calls in AppState.swift, the Deepgram transcription path in TranscriptionService.swift, and the LLM provider abstraction in Sources/Providers. The MIT license. If you disagree with any line in this guide, the repo is the receipt.

Talk through your stack, layer by layer

Bring the agent you are evaluating and we will walk the four layers together: where it reads your screen, where it sends your audio, where it does its thinking, and where it stores its memory. Honest answers in 20 minutes.

More honest notes

Who calls local AI not ready, what they actually mean

The four camps

Cloud-platform founders

Enterprise compliance buyers

Frontier-benchmark researchers

First-time Ollama users who bounced

The agent stack has four layers. Three of them are already local.

What critics hear vs. what builders mean

The right question to ask

How Fazm answers both questions

Frequently asked questions

Talk through your stack, layer by layer

Related reading

Local LLM runtime is done. The agent loop is the missing piece.

Local AI privacy beyond inference

Computer-use agent reliability

Comments ()

Who calls local AI not ready, what they actually mean

The four camps

Cloud-platform founders

Enterprise compliance buyers

Frontier-benchmark researchers

First-time Ollama users who bounced

The agent stack has four layers. Three of them are already local.

What critics hear vs. what builders mean

The right question to ask

How Fazm answers both questions

Frequently asked questions

Talk through your stack, layer by layer

Related reading

Local LLM runtime is done. The agent loop is the missing piece.

Local AI privacy beyond inference

Computer-use agent reliability

Comments (••)

Comments ()