Argument

Agent scaffolding matters more than model. The proof is eight lines in a system prompt.

You can read the proof yourself. Open Desktop/Sources/Chat/ChatPrompts.swift in the Fazm repo. The desktopChat template runs from line 14 to 167. The bottom of it, lines 93 through 115, is a tool routing block that tells the underlying Claude model what to do differently than it would do on its own. There are eight explicit overrides. Each one is a turn the model lost in production, captured into a rule that ships with the harness regardless of which model the user picks.

Matthew Diakonov, Written with AI

Published April 28, 20269 min read

What 'scaffolding' actually refers to here

For this page, the harness is everything around the model that turns next-token prediction into a usable Mac app. The system prompt and its templating. The MCP server registry that decides which tools the model sees this turn. The per-tool wall-clock watchdog. The session manager. The retry-on-interrupt. The user permission layer. The runtime guardrails on tool inputs and outputs. The diagnostic dump on user-cancel. The screenshot resize watcher. None of those are inside the model. All of them are inside the agent process running on the user's Mac.

The argument is not that the model is unimportant. The argument is that the marginal hour, on a shipped product, returns more invested in the harness than spent waiting for a model upgrade. And the easiest way to see why is to read one team's prompt and count the rules.

Section 1

Eight overrides in twenty-three lines

ChatPrompts.swift line 97 opens the routing table with 'Tool routing:' and the first bullet. The block ends at line 115 with the close of the tools tag. Inside that span, there are eight rules in NEVER, ALWAYS, MUST, or CRITICAL form. Eight is not a target the team set. It is the count of distinct things the same model kept doing wrong on this stack until somebody patched the prompt.

Screenshots route to capture_screenshot, never browser_take_screenshot. The latter only sees the browser viewport. The user almost always means the whole desktop, including the Excel window behind the chat bar.
WhatsApp follows a four-step contract. search, open_chat, verify get_active_chat, then send_message. The verify step is what stops a wrong send when search returned two contacts with overlapping names.
Telegram never goes through Playwright. The bundled telegram skill runs Python telethon via Bash. It is two orders of magnitude faster, dodges captchas, and does not spend an entire context window on the web Telegram DOM.
Native Mac apps go through mcp__macos-use__*, not Playwright. Finder is not a webpage. Treating it like one was a category error the model made often enough to earn a line.
CRITICAL: never type reasoning into the user's open application. The exact text in the prompt: 'Typing your chain-of-thought into a user's document (Word, Notes, etc.) is a critical failure.' This rule was written because it had happened.
browser_navigate vs the macOS open command. If the user explicitly wants the agent to interact with a page, use Playwright. If the user just wants a URL opened in their own browser to look at, use 'open'. Mixing these wastes a turn loading the extension and produces a tab the user did not ask for.
Tab hygiene: list before navigate. browser_tabs list is the first call, match by domain, switch with browser_tabs select if found, otherwise reuse the current tab. Close anything you opened. Without this rule the agent ends a five-minute task with eleven new tabs the user has to clean up.
NEVER run find on the home root. The exact phrasing: 'NEVER run find ~ or any recursive search on the entire home directory, it scans millions of files and hangs for minutes.' This is a small rule and a large outcome.

Outside the routing block, two more rules carry the same kind of weight. The ask_followup rule on line 113 says the call must be the absolute last tool call of the turn and that no text or tool call may follow it, because the floating-bar UI has nothing to do with the trailing content if it does. And the onboarding chat, on line 228, opens with 'ABSOLUTE LENGTH RULE: every message you send MUST be 1 sentence, MAX 20 words. No exceptions. Never write 2 sentences in one message. This is the number one rule.' That line is uppercase, repeated, and wrapped in 'no exceptions' for a reason. The model still tries to write paragraphs. The harness still has to override it.

Section 2

When the prompt fails, the runtime catches it

A system prompt is best-effort. The model reads it, attends to most of it, and still occasionally does something the prompt forbade. For one specific class of failure, Fazm has the runtime catch the model in the act. The file is acp-bridge/src/fazm-tools-stdio.ts and the relevant block is lines 587 through 687, where the execute_sql tool intercepts every write query before it reaches SQLite.

The model, when running in a chat observer session, has access to execute_sql against the user's local fazm.db. The system prompt tells it to use save_observer_card for the common case of saving an observation. The model sometimes ignores that and writes raw SQL anyway, often against a table the model invented from the shape of the schema (memories, user_facts, observations, the usual culprits). Without a guardrail, the INSERT silently fails against a non-existent table, the model believes the save succeeded, and the user observation is gone.

Line 617 of fazm-tools-stdio.ts declares the allowlist:

const KNOWN_TABLES = new Set([
  "observer_activity", "ai_user_profiles", "chat_messages",
  "indexed_files", "local_kg_nodes", "local_kg_edges", "grdb_migrations",
]);

On every write, the bridge regex-matches the target table from the INSERT INTO or UPDATE clause, lowercases it, and checks membership. If the table is not in the set, the call is rejected with a string the model will read on its next turn:

Blocked: table "X" does not exist in the app database. To save observations, use your file tools to write memory files or use save_observer_card instead, do NOT write raw SQL.

The error message is part of the harness too. It is in the form the model can act on: 'do this instead, here is the name of the tool.' The next turn picks save_observer_card and the work moves on. Without that one block of code, every model in the family would lose user data on the same failure path.

Section 3

One concrete request, end to end

Pick the simplest example: the user asks Fazm to message someone on Telegram. The model, by default, will reach for whatever web automation tool is nearest. The harness intervenes at three distinct layers before the model ever picks a tool.

User: 'message Anya on Telegram, ask if Friday works'

The harness made three decisions the model did not. It picked which tools were even in the menu (Playwright was excluded from the Telegram path by the routing rule). It enforced that ask_followup must be the absolute last call of the turn (not a model-side preference, a UI contract). And it length-capped the reply through the response_style block on lines 76 to 91. None of those are model improvements. None of those move when the model upgrades.

Section 4

What the model does, what the harness enforces

Five concrete rows. Left column is what the same model does on a stack without these rules. Right column is what the harness forces. The model does not know which column it is in. The user only sees the result.

Feature	Same model, no overrides	With Fazm harness
User asks for a screenshot of their open Excel file	Calls browser_take_screenshot, returns a screenshot of an empty Chrome tab	Routing table forces capture_screenshot with mode 'screen' and returns the actual desktop
User says 'message Anya on Telegram'	Opens web.telegram.org in Playwright, fights captcha and 2FA, takes 12 turns	Routing table forbids Playwright for Telegram, runs Python telethon via Bash, two turns
User asks for follow-up suggestions at end of reply	Calls ask_followup, then writes another sentence, then calls another tool	Hard rule that ask_followup is the absolute last tool call, turn ends cleanly
Model reaches for execute_sql on a hallucinated table	INSERT INTO memories silently fails, agent thinks the save worked, user data lost	KNOWN_TABLES allowlist at fazm-tools-stdio.ts:617 rejects the call and tells the model to use save_observer_card
User wants to think through a hard problem	Model types its chain of thought into the user's open Notes document	CRITICAL rule in routing table: never type reasoning into any application, only into the response

Counterargument

But surely a smarter model fixes this on its own?

This is the strongest version of the objection. If the model were good enough, it would discover from the tool descriptions alone that telethon beats Playwright on Telegram and that capture_screenshot beats browser_take_screenshot on a desktop ask. Two answers.

First, even if a future model learns to converge on the right answer in five turns, a shipped product cannot afford five turns. Each turn is a tool call, a wait, a re-prompt. Five wrong turns is a thirty-second wait the user reads as 'this thing is broken.' The harness collapses those five turns into one because the rule is in the prompt at turn zero.

Second, several of the rules are not about model intelligence, they are about contracts the harness owns. ask_followup ending the turn is a UI promise. The 20-word length cap on onboarding messages is a floating-bar geometry constraint. The 'never-type-reasoning-into-the-user's-document' rule is about the user's trust, not about a model's reasoning. No improvement in pretraining quality fixes any of those because they are not facts, they are decisions.

The version of the argument that gives the model its full due is: the model picks better sentences, the harness picks better turns. Sentence quality is what improved most between every major frontier model release in the last two years. Turn quality, on a real product, is mostly carried by what the harness has memorized from production failures, and that does not transfer when you swap the model.

Resolution

How to read this if you are picking an agent

Three concrete things to do, all of them in fifteen minutes, without trusting any vendor's marketing copy.

Read the system prompt. If the candidate is open source, the file is in the repo. If it is closed, ask. The presence of a long, specific routing table with NEVER and ALWAYS rules is a positive signal. A short, vague, encouraging prompt is a signal that nobody on the team has run this in production with users yet.
Look at the tool registry and timeouts. In Fazm, the registry is buildMcpServers around line 992 of acp-bridge/src/index.ts. The watchdog tiers (10 seconds for internal tools, 120 for MCP, 300 for the rest) are constants at the top of that file. If the candidate has one global timeout or none at all, the agent will hang at some point and the user will see the spinner forever.
Find one runtime guardrail. The KNOWN_TABLES allowlist at fazm-tools-stdio.ts:617 is the kind of code you only write after you have seen the model do the thing the guardrail prevents. A repo with at least one of those has been bitten by a real user. A repo without any has not.

Once you have that picture, the question 'which model does it use' is interesting but secondary. Two products on the same model can produce wildly different user experiences because the harness is doing different work behind each one. That is the entire claim.

Closing

The model is the engine. The harness is the road.

A faster engine on a road full of potholes does not arrive sooner. The frontier model arms race is real and the engine is getting better. Every frontier model release feeds every harness equally. The differentiation, on any given product, is what the road is paved with. Eight rules in a system prompt, one allowlist in a tool wrapper, one watchdog on a stuck call, one length cap on a floating-bar message: these are paving stones, and they all ship in the harness.

The Fazm harness is open source. Every line referenced on this page resolves to a real file on disk in github.com/m13v/fazm. ChatPrompts.swift is 818 lines, mostly routing tables and length rules. acp-bridge/src/index.ts is 2,914 lines, mostly timeouts, retries, and diagnostic dumps. Read either one for ten minutes and the argument on this page stops being a position and starts being a description.

Want to see the harness running on your Mac?

Fifteen minutes with the founder, no slides, walks through the prompt and the bridge live.

Frequently asked questions

What does 'scaffolding matters more than model' actually mean once we get specific?

It means that for any single product on top of a frontier model, the difference between version N and version N plus one of the model is usually smaller than the difference between a careful harness and a sloppy one on the same model. The model picks the next token. The harness decides which tools the model is allowed to see in this turn, what the system prompt asserts and forbids, what gets retried after a stuck call, what the user is shown when a tool fails, where the screen content is read from, and how a long task is resumed after an interrupt. Most of those decisions do not improve when the model gets better. They have to be coded.

What's the most concrete way to see this argument in practice without trusting marketing copy?

Open Desktop/Sources/Chat/ChatPrompts.swift in the open source Fazm repo. The desktopChat template runs from line 14 to line 167 and ends with a tool routing block (lines 93 to 115) that contains eight explicit instructions to the model in the form of NEVER, ALWAYS, MUST, or CRITICAL. Each one is a specific failure mode the model has on this stack: capturing the wrong screen, picking Playwright for Telegram, typing chain-of-thought into the user's open document, opening a fresh tab when one is already open, running find on the home root, calling another tool after ask_followup, replying with paragraphs in a 28-pixel-tall floating bar, writing SQL with backslash-escaped quotes. Eight defeats, eight overrides. None of those are model knowledge. They are harness knowledge.

Why hard-code these rules into a prompt instead of waiting for a smarter model that won't make those mistakes?

Because they are not really mistakes in the model's understanding. They are defaults that are correct for the model's training distribution but wrong for this product. The Telegram MCP for Playwright works fine in a vacuum, it just costs ten times more tokens than the Python telethon path Fazm bundles. browser_take_screenshot is the right tool when you actually want a screenshot of the browser viewport, but in a desktop app the user almost always means the whole screen. ask_followup ending a turn is a Fazm UI contract, not an LLM convention. A smarter model might converge on the right answer eventually, but it will get there by losing two or three turns in production every time, which a shipped product cannot afford.

What is the runtime SQL guardrail at fazm-tools-stdio.ts:617 and why is it on this page?

When the chat observer session lets the model run execute_sql on the local fazm.db, the bridge intercepts every write query and checks the target table against a literal allowlist: const KNOWN_TABLES = new Set(['observer_activity', 'ai_user_profiles', 'chat_messages', 'indexed_files', 'local_kg_nodes', 'local_kg_edges', 'grdb_migrations']). If the INSERT INTO target is anything else, the call is rejected with the message 'Blocked: table "X" does not exist in the app database. To save observations, use your file tools to write memory files or use save_observer_card instead, do NOT write raw SQL.' That message exists because models, including frontier ones, periodically hallucinate plausible-looking table names like 'memories' or 'user_facts' that do not exist in the schema. The system prompt already tells the model to use save_observer_card for that case. The runtime check is what catches the model when the prompt does not.

Is the underlying model irrelevant, then? It feels like the page is overcorrecting.

No, the model is the engine and the engine matters. The claim is narrower: for a fixed problem, on a fixed product, the marginal value of investing in the harness is higher than the marginal value of waiting for the next model release. Across the industry, model gains compound on top of every shipped harness equally. Harness gains compound only on the harness that owns them. A team that puts all its effort into the model bet ends up with the same model everyone else has and a thinner runtime than the next team over. The thesis is about how to spend your hours, not whether the model has a vote.

Where does this leave a small team picking an open source agent today?

Open the candidate's repo and read three files before reading the marketing page. The system prompt is the first one, because that tells you whether the team has been bitten by real users yet (a sloppy prompt with two paragraphs of vague encouragement is a different signal than 818 lines of routing tables and length rules). The tool registry is the second, because that tells you whether the agent can do the work you actually need. The error path on a tool timeout is the third, because that tells you whether the agent recovers from one stuck call or hangs forever. Fazm publishes all three under MIT at github.com/m13v/fazm: Desktop/Sources/Chat/ChatPrompts.swift, acp-bridge/src/index.ts (buildMcpServers around line 992), and acp-bridge/src/index.ts again (the per-tool watchdog around line 72 and the in-flight diagnostic dump around line 165).

What are the eight overrides in the routing table, exactly?

From ChatPrompts.swift desktopChat, lines 97 to 115, in order: (1) ALWAYS use capture_screenshot, NEVER use browser_take_screenshot, because the latter only sees the browser viewport and not the desktop. (2) WhatsApp must follow the explicit search, open chat, verify, send sequence, no shortcuts. (3) NEVER use Playwright for Telegram, use the bundled telegram skill that runs Python telethon scripts via Bash. (4) Use mcp__macos-use__* for Finder, Settings, Mail and other native apps, never Playwright. (5) CRITICAL, never type your reasoning, thoughts, debugging notes, or internal monologue into any application. (6) Use browser_navigate (Playwright) only when the user explicitly asks you to interact with a web page; for opening a URL to view, use the macOS open command. (7) Before navigating, call browser_tabs list and reuse an existing tab; close any tabs you opened when the task is done. (8) NEVER run find ~ or any recursive search on the entire home directory. There is also a hard rule for ask_followup that it MUST be the absolute last tool call in the turn (line 113), and an ABSOLUTE LENGTH RULE in the onboardingChat at line 228 capping every assistant message at one sentence and twenty words. The exact line numbers all resolve in the open source repo.

What's an example of something the model is good at that the harness can't do?

Picking the right next sentence in a 30-line reply when the user asks for advice on a relationship problem or a code review. The harness can constrain how long the reply is, where it gets shown, and which tools were allowed to inform it. The harness cannot supply the judgment that turns 'your boss is being passive aggressive' into a paragraph that is empathetic without being saccharine. That is model work. Likewise, picking the right SQL given a schema and a question is mostly model work, even though the runtime guardrail catches the cases where the model hallucinates a table. The point is not that scaffolding can do the model's job. The point is that good model output through a careless harness produces a worse user experience than worse model output through a careful one, on tasks that are mostly about routing and recovery.

Other harness internals

Keep reading

Harness internals

AI agent harness scaffolding, told as the failure-recovery layer most write-ups skip

Three tool timeout tiers, one synthetic completion event, and an in-flight diagnostic dump on user interrupt, anchored to acp-bridge/src/index.ts.

Read

macOS internals

Open source local desktop agent on macOS, the part nobody writes about

The four-step accessibility permission probe, with file paths and line numbers, in Desktop/Sources/AppState.swift.

Read

Architecture

Local first AI coding agent, when local means the agent and not just the model weights

What 'local first' should mean once you stop conflating it with 'where the model weights live', anchored to the ACP bridge.

Read