Field notes from one shipping harness
The agent scaffolding bottleneck is a lossy pipeline, not a missing feature
Every essay on this topic argues, correctly, that the scaffolding around an LLM matters more than which model is plugged into it. None of them count what the scaffolding throws away. This is that count, taken from one shipping macOS computer-use agent, with file paths and line numbers you can open yourself.
Direct answer (verified 2026-04-30)
The bottleneck is the harness, not the model. Every layer of scaffolding between the model and the world is a lossy filter: screenshots resampled to 1920 pixels before the model sees them, MCP image content dropped on extraction, conversation history capped at 30 messages, session replay capped at 20 entries truncated to 4000 chars each, attachments above 20 MB rejected, skills loaded by name and fetched on demand, the Claude permission gate auto-approved. None of them is a bug. The cumulative loss is the bottleneck.
Source: github.com/m13v/fazm, files acp-bridge/src/index.ts and Desktop/Sources/Providers/ChatProvider.swift.
The framing
When an agent does something stupid, the standard read is “the model hallucinated” or “the prompt was wrong.” Both can be true. Both miss the more common case: the model never had the information needed to do the right thing because the harness threw it away on the way in.
The harness has to throw things away. Context windows are finite, multimodal tokens are expensive, the API has hard limits on image dimensions, long context degrades instruction following. Every shipping harness against a frontier model ends up with the same shape: a stack of filters that drop, downsample, summarize, or skip content before it reaches the model.
fazm is a macOS computer-use agent on top of the Claude Agent SDK. Its harness is one Node bridge plus one Swift host process. It is open source, so the filters are countable. We counted seven filters worth talking about plus one auto-approve gate. Each is below, in source order.
Filter 1, screenshots are downsampled to 1920 px
acp-bridge/src/index.ts, lines 838 to 886
Playwright on a Retina Mac produces PNGs around 2880 by 1800. Claude’s multimodal API rejects anything past 2000 pixels on either axis. The bridge spawns a watcher on /tmp/playwright-mcp/ at startup, and any new PNG or JPEG that exceeds the cap is resampled in place by the macOS sips binary before the model has a chance to read it.
// acp-bridge/src/index.ts
const MAX_SCREENSHOT_DIM = 1920; // stay under 2000px API limit
if (w > MAX_SCREENSHOT_DIM || h > MAX_SCREENSHOT_DIM) {
execSync(`sips --resampleHeightWidthMax ${MAX_SCREENSHOT_DIM} "${filepath}" 2>/dev/null`);
logErr(`Screenshot resized: ${filename} from ${w}x${h} to fit ${MAX_SCREENSHOT_DIM}px`);
}What the model loses: the actual pixel detail. A 14-pixel-tall icon at native resolution becomes a 9-pixel-tall icon after the resize, which is past the threshold where reliable text recognition stops working in current vision models. We have shipped bug reports against this filter. The trade is unavoidable, the API will reject the image otherwise. The point is not that the filter is wrong, the point is that nobody outside the bridge knows it ran.
Filter 2, Playwright never inlines a screenshot
acp-bridge/src/index.ts, line 1172
The Playwright MCP server can return screenshots one of two ways: inline, as base64 inside the tool result content array, or written to disk with only the file path returned. The bridge forces the second mode by hard-coding --image-responses omit in the launch args.
// acp-bridge/src/index.ts:1172 playwrightArgs.push( "--output-mode", "file", "--image-responses", "omit", "--output-dir", "/tmp/playwright-mcp", );
What the model loses: the screenshot, as an immediate observation, on the turn that took it. To actually look at the picture the model has to spend an extra tool call (Read on the file path) on the next turn. The cost saved is real, a single 1920 px PNG is roughly 600 KB of base64, and a session that takes ten screenshots and keeps them inline blows past the cheap input window. The cost paid is one round-trip of latency on every visual decision.
Filter 3, MCP tool results are text-only
acp-bridge/src/index.ts, lines 2678 to 2714
When the bridge unwraps a tool result coming back from any MCP server, it walks the content array, copies out text items, and skips everything else. The comment in the source spells out the trade in one sentence.
// acp-bridge/src/index.ts:2679
// ACP wraps MCP content items as {type:"content", content:{type:"text"|"image", ...}}.
// We extract only text items and skip images to keep context small.
for (const item of contentArr) {
if (item.type === "text" && typeof item.text === "string") {
texts.push(item.text as string);
}
// Skip images entirely
}What the model loses: image output from any MCP server. The macOS-use accessibility binary, for example, can return a screenshot of the window it just interacted with. The bridge drops it. So does any third-party MCP that thought it was being helpful by attaching a visual confirmation. The model sees the text summary and treats the action as opaque. On a turn where the visual would have mattered (a popup that appeared, a state change that did not), the agent continues as if nothing happened.
Filter 4, the system prompt sees the last 30 messages, no more
Desktop/Sources/Providers/ChatProvider.swift, line 1812
On every new chat session the Swift host builds a system prompt that includes a recent-conversation block. That block is hard-capped at the last 30 non-empty messages from the local store, regardless of whether the conversation is six turns or six hundred.
// Desktop/Sources/Providers/ChatProvider.swift:1812
let recent = messages
.filter { !$0.text.isEmpty }
.suffix(30)
// Injected into the system prompt as:
prompt += "\n\n<conversation_history>\nBelow is recent conversation history. " +
"The user can see these messages and expects you to be aware of them. " +
"For older conversations, query chat_messages with execute_sql.\n" +
"\(history)\n</conversation_history>"What the model loses: anything older than the last 30 messages, on every fresh session. The prompt does tell the model that older context exists in a SQL table it can query. The model has to know to ask. In practice, on the failure cases that show up in support, it doesn’t. It assumes the snippet it sees is the whole conversation and fills in plausible context. The user, who can scroll back forever, sees the agent as forgetful.
Filter 5, session replay caps at 20 entries, 4000 chars each
acp-bridge/src/index.ts, lines 1783 to 1810
When the agent SDK loses a session (a process restart, a crashed bridge, a recovered cold start), the bridge tries to rebuild context by replaying recent turns into a single SESSION RESTORED preamble. The replay is capped on both axes.
// acp-bridge/src/index.ts:1783 const MAX_REPLAY = 20; const replay = ctxEntries.slice(-MAX_REPLAY); // ... const text = (e.text ?? "").slice(0, 4000);
What the model loses: 80 percent or more of a long conversation any time a session resets. A turn that contained the actual instruction (a 12 KB system spec the user pasted in, say) gets clipped to its first 4000 characters. The model sees a synopsis of a synopsis and continues from there. This is the layer that produces the “why did the agent restart from scratch” bug reports. The cap is what saves the recovery from being unbounded; it is also what makes the recovery dishonest.
Filter 6, skills are advertised by name, not loaded
Desktop/Sources/Providers/ChatProvider.swift, lines 1707 to 1714
The harness ships skills as separate .skill.md files in ~/.claude/skills/. The system prompt does not include the body of any of them. It includes a comma-separated list of names and a one-line instruction to fetch the body via the Skill tool when one looks relevant.
// Desktop/Sources/Providers/ChatProvider.swift:1712
prompt += "\n\n<available_skills>\nAvailable skills: \(skillNames)\n" +
"Use the Skill tool to load full instructions for any skill before using it.\n" +
"</available_skills>"What the model loses: the actual instructions, until it asks. The trade is deliberate, this is what keeps the boot prompt small enough to leave room for tools and context. The cost is that “does this skill apply to what I am doing” becomes a name-only judgment, which is a worse decision than the one the model would make if it could read the body. On a skill set the model has not seen recently, the wrong one gets loaded, the work gets done in a worse way, and the user sees a slower or weirder result without knowing why.
Filter 7, attachments are gated at 20 MB and 10 MB
acp-bridge/src/index.ts, lines 1892 to 1918
User-attached files go through a MIME-and-size gate before they ever reach the model. Images and PDFs above 20 MB are rejected. Text files above 10 MB are rejected. Binary types that are not image or PDF are not inlined at all, only their path is sent.
// acp-bridge/src/index.ts:1892
const MAX_INLINE_SIZE = 20 * 1024 * 1024; // 20 MB for images/PDFs
const MAX_TEXT_SIZE = 10 * 1024 * 1024; // 10 MB for text files
const sizeLimit = isInlineBinary ? MAX_INLINE_SIZE : MAX_TEXT_SIZE;
if (stats.size > sizeLimit) {
// reject and tell the user to split the file
}What the user loses: the ability to drop a 25 MB design doc, a long log file, or a recorded MP4 into the chat in one shot. The model never sees the file at all and the user gets an error. The gate is there because the API has its own per-request body limits and timing out at the API layer is a worse failure mode than a clear local rejection. It is still a filter, and the agent is dumber for it on the cases where the missing context was the whole point.
Eighth thing, not a filter, the permission gate is auto-approved
acp-bridge/src/index.ts, lines 702 to 714
The Claude Agent SDK ships with a built-in safety mechanism: when a tool call would do something risky, the SDK emits a session/request_permission ACP message and waits for the host to approve or deny. fazm answers every one of those messages with allow_always, before the user sees them.
// acp-bridge/src/index.ts:702
if (method === "session/request_permission") {
// Auto-approve all tool permissions
const allowAlways = options.find((o) => o.kind === "allow_always");
const allowOnce = options.find((o) => o.kind === "allow_once");
const optionId = allowAlways?.optionId ?? allowOnce?.optionId ?? "allow";
logErr(`Auto-approving permission for tool (id=${id})`);
acpStdinWriter?.(JSON.stringify({
jsonrpc: "2.0",
id,
result: { outcome: { outcome: "selected", optionId } },
}));
}This is not a lossy filter on the input. It is an auto-yes on the output side. Listed here because it is the same shape of decision: a piece of the SDK that is designed to slow the agent down, a deliberate choice to short-circuit it. fazm runs as a foreground app the user is watching, with a stop button visible at all times, and the design choice was that an extra modal per tool call would be unusable. Other harnesses make the opposite call. The point is that the decision is in the harness, not the model.
“Filters between you and the model in one shipping harness, before any per-product prompt engineering, before the model has even seen the first user token.”
github.com/m13v/fazm, acp-bridge/src/index.ts and ChatProvider.swift, counted on 2026-04-30
The leak inventory, at a glance
One card per stage. Each is small in isolation. Stack them and the model is reasoning over a heavily abridged copy of reality.
1920px screenshot resize
Every Playwright PNG in /tmp/playwright-mcp/ is downsampled by macOS sips before the model sees it. acp-bridge/src/index.ts:842, MAX_SCREENSHOT_DIM = 1920.
Playwright base64 stripped
Playwright MCP runs with --image-responses omit, so screenshots are written to disk and never inlined as base64. acp-bridge/src/index.ts:1172.
MCP results are text-only
When any MCP tool returns content, the bridge filters to text items and silently drops every image item. acp-bridge/src/index.ts:2678 to 2714.
30-message conversation cap
The system prompt injects only the last 30 messages of local chat history. ChatProvider.swift:1812, .suffix(30) on the message array.
20-message session replay
On session recovery the bridge replays at most 20 entries, each truncated to 4000 characters. acp-bridge/src/index.ts:1783 to 1810, MAX_REPLAY = 20.
Skills loaded by name only
The system prompt lists skill names; full instructions are fetched on demand via a Skill tool round-trip. ChatProvider.swift:1707 to 1714.
20MB / 10MB attachment cap
User attachments above 20 MB (image, PDF) or 10 MB (text) are rejected pre-flight. acp-bridge/src/index.ts:1892 to 1918.
Permission gate auto-approved
Every Anthropic session/request_permission ACP message is auto-approved with allow_always. acp-bridge/src/index.ts:702 to 714.
Why this is the bottleneck and not the model
Take a single Fazm turn where the user asks the agent to click a button on a web page. The Playwright MCP screenshots the page (filter 1, downsampled). Returns a path, not the bytes (filter 2, no inline). The bridge unwraps the result, drops any image content (filter 3, text-only). The model decides the screenshot is worth looking at, calls Read, gets a 1920 px PNG instead of the original 2880 px capture. It picks coordinates against the downsampled image. The action runs. If anything has gone wrong, the diagnostic dump available at the next interrupt has been truncated by filter 5 and capped by filter 4. None of this is the model failing. It is the model succeeding under conditions the harness chose for it.
Now substitute “a 10 percent smarter model” into that flow. The screenshot is still 1920 px. The image content is still dropped. The conversation history still ends at message 30. The only step the smarter model improves is the “pick coordinates” one, and even there the improvement is bounded by the resolution it was given. The next major model gives you maybe 5 percent of the win you would get by removing one filter from the pipeline.
That is the actual bottleneck. The harness is not slow, it is not buggy, it is not poorly written. It is doing exactly what it has to do given its budget. The ceiling on agent quality is the cumulative information loss across every stage of the harness, and you cannot lift the ceiling without rebuilding the stages.
What you actually do about it
Three moves, in order of how much they cost.
- Audit your own harness, file by file. Find the constants. There will be a screenshot cap. There will be a history cap. There will be an image strip somewhere. Write them down. The list is the audit.
- Pick the one that matches your worst failure case. If the agent forgets the conversation, the history cap is the suspect. If the agent “cannot find the button,” the screenshot cap or the image strip is the suspect. If a long task restarts from scratch, the session replay cap is the suspect. Match symptom to filter.
- Loosen exactly that one filter and remeasure. Do not loosen the whole stack. Each filter exists for a reason and the next failure mode is waiting to bite. fazm exposes one knob (
FAZM_TOOL_TIMEOUT_SECONDS) and the rest you patch in source. That is what open source is for.
On a closed harness this audit is impossible from the outside. You can guess which filter is hurting you on a given run, and that is most of what “prompt engineering” turns into in practice: working around a filter you cannot see. The argument for an open scaffolding layer is the same as the argument for an open browser engine. You eventually need to read the source.
0 px
screenshot cap
0
message history cap
0
chars per replayed turn
Want to walk your harness through this audit on a 20-minute call?
Bring your own harness or look at fazm's. We will count the filters together and pick the one that is actually starving your agent.
Questions, answered specifically
Short version: where is the bottleneck in an AI agent's quality, the model or the scaffolding?
The scaffolding, by a wide margin, on any product that runs against a current frontier model. The model picks the next token. The scaffolding decides what the model is allowed to see in the first place. Every layer of the harness between the model and the world (screenshot resize, MCP image stripping, conversation-history cap, replay cap, attachment limits, lazy skill loading, auto-approved permission gates) is a filter that throws information away. A 10 percent smarter model helps less than removing one of those filters, because the model never gets the bits the filter dropped.
Why frame the bottleneck as 'lossy' instead of 'wrong' or 'slow'?
Because each filter is correct under its own constraint. The 1920 pixel screenshot cap exists because Claude's multi-image API rejects anything bigger. The MCP text-only extraction exists because base64 image bytes blow up context size. The 30-message conversation cap exists because session creation has a token budget. None of them is a bug. The bottleneck is that they all run, in series, on every turn, and the cumulative loss is invisible to anyone reading just the chat. The model is reasoning over a quietly degraded world.
Which file in fazm contains most of the lossy stages?
Two files. Most of the runtime filters live in acp-bridge/src/index.ts, the 3216-line Node bridge process between the Swift UI and the Claude agent. The system-prompt-side filters (history, skills) live in Desktop/Sources/Providers/ChatProvider.swift around the buildSystemPrompt path near line 1689. Both are MIT-licensed in the public repo at github.com/m13v/fazm and every constant in this guide can be checked by opening either file.
Are these filters specific to fazm or do other harnesses do the same thing?
The pattern is universal, the constants vary. Any harness that runs against a model with a finite context window and a multimodal token cost has to drop something. Cursor, Claude Code, Codex, Cline, Aider, OpenAI Operator, Devin: all of them have an analogous set of filters. The only difference is whether the filters are visible. fazm's are visible because the source is on GitHub. A closed harness has the same lossy stages, you just cannot count them. That is the actual reason this matters: the bottleneck is real, and unless you can read the harness you cannot tell which stage is starving your agent.
How is this page different from your other harness pages?
We have two adjacent pages. /t/ai-agent-harness-scaffolding goes deep on one corner of the harness, the per-tool wall-clock watchdog and synthetic completion event. /t/agent-scaffolding-vs-model-quality argues the same thesis from the system-prompt side, listing eight explicit overrides in ChatPrompts.swift. This page is the inventory page: it counts the lossy stages, with line numbers, so the reader can see how many filters sit between the model and a single user request.
If every filter loses information, why ship them at all? What happens with a no-filter harness?
Three things happen, in order. First, you blow the model's context window inside three turns and the API starts rejecting requests. Second, your bill 10x's because every Playwright screenshot is 6 to 10 megabytes of base64 multiplied by every turn it stays in context. Third, the model gets worse, not better, because long, dense context degrades instruction following (the 'context rot' effect well documented on million-token models). The filters are not optional. The actual question is which one is costing you the most quality on this run, and you cannot answer that without reading the harness.
Can a user adjust any of these filters in fazm?
A few. The wall-clock per-tool timeout is overridable via the FAZM_TOOL_TIMEOUT_SECONDS env var (Settings, Advanced, Tool Timeout). The MCP server list is editable in ~/.fazm/mcp-servers.json, which can be used to add tools that bypass the Playwright image strip path. The skill loading mechanism can be controlled by adding or removing files from ~/.claude/skills. The conversation-history cap, the screenshot resize, and the per-session image cap are not user-facing settings. The whole bridge file is open source, so if one of those is your bottleneck you can patch it locally.
Are filters the only thing in the scaffolding, or is the loop also a bottleneck?
The loop, the prompt assembly, the tool routing, and the verification pass are all part of the scaffolding. They can also be bottlenecks. We are picking on filters here because they are the most measurable: every one has a constant value, a file, and a line number. A bad loop or a sloppy verification pass is harder to point at because the failure mode is 'the model did the wrong thing more often,' which is not a number. Filters are. If you can only audit one part of a harness, audit the filters first.
Is this only an issue for computer-use agents, or do coding agents have the same problem?
Coding agents have a different filter set but the same shape. A coding harness usually does not resize screenshots (no screenshots), does not strip MCP image content (no images), and does not cap conversation history at 30 messages (terminals scroll forever). It does cap diff context, summarize file reads past a size threshold, evict tool results from older turns, and apply a verification pass that drops or rewrites the agent's edits. Same lossy pipeline, different stages. If you are building or evaluating a coding agent, run the same audit: list the filters, count what each one drops, and see which one is starving the model on the failures you actually care about.
Where do I read the actual code if I want to verify any of this?
Clone github.com/m13v/fazm. The runtime filters are in acp-bridge/src/index.ts (3216 lines, MIT). The screenshot watcher and resize is lines 838 to 886, the Playwright image-omit flag is line 1172, the MCP text-only extraction is lines 2678 to 2714, the auto-approve permission handler is lines 702 to 714, the session recovery cap and per-entry truncation is lines 1783 to 1810, the attachment size gates are lines 1892 to 1918. The system-prompt filters are in Desktop/Sources/Providers/ChatProvider.swift, the conversation-history cap is line 1812, the lazy skill list is lines 1707 to 1714. Every line number in this page can be checked against the file.
Adjacent