Browser automation architecture--image-responses omitSource-verified, April 19 2026

Every top SERP result for this keyword treats a screenshot as the agent's perception channel. Fazm's bridge deliberately strips screenshots out of the model's context and feeds a 691-character text snapshot instead.

OpenAI Operator, Anthropic Computer Use, Microsoft OmniParser, Browser Use, Playwright, Selenium, LambdaTest, Axiom, Roborabbit, Scrapfly. Ten results, one assumption: the LLM needs the PNG. Fazm's Mac-side bridge makes the opposite bet and ships it to production. This guide walks the exact five-argv-element push at acp-bridge/src/index.ts line 1033, the two-branch content-filter loop at lines 2280-2292, and the sips-backed fs.watch at lines 709-758 that together reduce a 500 KB Retina PNG per turn to zero PNGs per turn.

Fazm

Published April 19, 202611 min read

Download Fazm for Mac

4.9from 200+

acp-bridge/src/index.ts line 1033 pushes --image-responses omit onto Playwright MCP argv

Lines 2280-2292 filter tool result content to text-only; type:'image' items are dropped

Lines 709-758 fs.watch /tmp/playwright-mcp and sips-resize any PNG > 1920 px in place

Screenshots sit on disk. Text snapshots go into the context.

Fazm's text-first, screenshot-last architecture for browser automation agents.

Every SERP result: screenshot → vision model → pixel coords

Fazm: text snapshot → LLM → [ref=e12] → click

~691 chars per turn instead of ~500 KB of base64 PNG

32 K local-model context fits 40+ turns of snapshots

Zero screenshots in context by default

0:00 / 0:05

0 KB

Typical size of one Retina Chrome screenshot as base64 in the model context

Characters a Playwright text snapshot takes up after Fazm's bridge strips images

0 px

Max pixel dimension sips resizes screenshots to, so fallback images stay under Claude's 2000 px cap

MAX_IMAGE_TURNS: the hard cap on image-bearing turns per session before the bridge stops sending PNGs

Where the screenshot goes, where the text goes

Four input channels (the agent, real Chrome over CDP, a native Mac app through macos-use, and Playwright MCP itself) all converge on one filter in the bridge. The PNG lands on disk. Only text lands in the context.

The path a browser_snapshot takes through the bridge

0 PNGs

“The literal statement at acp-bridge/src/index.ts line 1033 is: playwrightArgs.push('--output-mode', 'file', '--image-responses', 'omit', '--output-dir', '/tmp/playwright-mcp'). Five argv elements. Lines 2280-2292 walk the tool result content array with two branches that only keep type:'text' items; type:'image' items are silently dropped. The net result: a browser_snapshot call that would have returned a 500 KB base64 PNG returns a ~691-char YAML text snapshot instead. Screenshots still get written to /tmp/playwright-mcp, they just do not enter the model context.”

/Users/matthewdi/fazm/acp-bridge/src/index.ts

What every top result for this keyword actually covers

OpenAI Operator (CUA)Anthropic Computer UseMicrosoft OmniParserMicrosoft Foundry browser toolPlaywright screenshot()Selenium WebDriver snapshotsLambdaTest visual diffsAxiom.ai screenshot recipesRoborabbit scheduled capturesScrapfly headless screenshotsBrowser Use (YC S24)Vercel agent-browser CLIHuggingFace smolagents web browserSet-of-Mark annotated screenshotsdeepsense browser AI agents

The top 10 all agree on one thing

Different vocabulary, same architecture. The screenshot is the observation; the LLM (or a vision parser on top of an LLM) reads the PNG; the agent gets back a pixel coordinate or a SoM index. Every page on the first SERP assumes this pipeline.

OpenAI Operator / Anthropic Computer Use

Both ship a Computer-Using Agent that takes a screenshot of the browser, sends the PNG as an image block to a vision LLM, and gets back a pixel coordinate plus an action. Screenshot is the perception channel by design.

Microsoft OmniParser

Vision model that parses UI screenshots into a list of interactive boxes. Input is a PNG, output is annotated rectangles. The pipeline does not work without the screenshot.

Playwright / Selenium / LambdaTest

Ship first-class screenshot APIs (page.screenshot(), TakesScreenshot, visual diff runners). Screenshots are a QA artifact at the end of a step, not stripped from it.

Axiom.ai / Roborabbit / Scrapfly

Recorders and scrapers whose top tutorials are literally 'how to automate website screenshots'. The screenshot is the deliverable, not a cost to minimize.

Browser Use / agent-browser / smolagents

Open-source agent frameworks that pass annotated screenshots (Set-of-Mark overlays) into the LLM each turn. Accessibility snapshots exist as a fallback, not the default.

What none of them mention

The five facts below describe the text-first, screenshot-last architecture Fazm ships in production. No page on the first SERP describes any of them, because their product, framework, or vision model requires the PNG to be in context. Fazm does not.

Specifically, what none of the top SERP pages cover

None of the top 10 results describe how to architecturally remove screenshots from the perception channel while keeping them on disk for verification.
None of them discuss the per-step context cost of a Retina Chrome PNG (~500 KB of base64 tokens) against a 32 K local-model context window.
None of them show the exact MCP server argv flags that neutralize image responses from Playwright.
None of them describe how to watch /tmp for oversized screenshots and in-place resize them with a built-in macOS tool (sips) before they hit the API.
None of them wire both the image-stripping AND the accessibility tree for a native Mac app (macos-use) into a single MCP server array so the same agent turn can read a browser snapshot and a Finder window without a screenshot ever entering the model context.

Two architectures, side by side

Same task, same Chrome tab, same question (“what should I click next?”). The difference is what enters the model context on each turn.

Screenshot perception vs text snapshot perception

// OpenAI Operator / Anthropic Computer Use style turn
// (what every top SERP result describes)

// Step 1: take a PNG of the current browser viewport
const png = await page.screenshot({ fullPage: false });
// png is ~500 KB of bytes on a Retina Mac

// Step 2: send the PNG as an image block to the vision LLM
const response = await llm.messages.create({
  model: "claude-sonnet-4-6",
  messages: [{
    role: "user",
    content: [
      { type: "image", source: { type: "base64", media_type: "image/png", data: png.toString("base64") } },
      { type: "text", text: "What should I click next?" },
    ],
  }],
});

// Step 3: model returns a pixel coordinate. Click it.
// Every single turn pays ~500 KB of image tokens.
// Local model with 32 K context? One screenshot blows it.

5% fewer lines

Four layers of defense against a PNG in the context

The bridge does not trust any single switch. Each layer catches a different failure mode, so a misconfigured MCP server, a stray image in the content array, a Retina-sized fallback, or a runaway session all get handled.

Layer 1: Playwright MCP argv

acp-bridge/src/index.ts line 1033 pushes --output-mode file --image-responses omit --output-dir /tmp/playwright-mcp onto the playwrightArgs array. The MCP server still writes the PNG to disk, but its response to the agent contains a file-path reference, not inline base64.

Layer 2: tool-result content extraction

index.ts lines 2280-2292 iterate the ACP tool result content array with two branches (item.type==='text' and the ACP-wrapped inner.type==='text'). Any item with type:'image' is skipped. The rawOutput fallback at 2294-2303 does the same.

Layer 3: Retina size cap

Lines 709-758: a fs.watch on /tmp/playwright-mcp that uses `sips -g pixelWidth -g pixelHeight` to detect any PNG > 1920 px and resamples it in-place with `sips --resampleHeightWidthMax 1920`. Exists as a safety valve for the rare case the model asks to Read a PNG.

Layer 4: per-session image budget

Lines 791-793 declare imageTurnCounts plus MAX_IMAGE_TURNS = 20 so sessions that do take images cannot drift past Claude's stricter 2000 px/image limit. On every session delete, the counter resets.

Layer one, in code

This is the exact block from acp-bridge/src/index.ts that neutralizes Playwright MCP's screenshot response. Five argv elements do the work: --output-mode file routes the PNG to disk, --image-responses omit prevents base64 from being included in the tool result at all, and --output-dir /tmp/playwright-mcp gives the bridge a known location to watch.

acp-bridge/src/index.ts (buildMcpServers, ~line 1027)

Layer two, in code

Belt and braces. Even if an MCP server returned a type:'image' item anyway, the bridge drops it here before the session/prompt update becomes part of the agent's context. Two branches handle the ACP-wrapped format and the direct MCP format. No branch exists for images.

acp-bridge/src/index.ts (session/prompt handler, lines 2271-2307)

Layer three, in code

When the agent does deliberately Read a PNG for visual verification, it hits a file that has already been resized to ≤ 1920 px by a filesystem watcher. sips is built into macOS, so the bridge has no runtime dependency and no extra process to ship.

acp-bridge/src/index.ts (startScreenshotResizeWatcher, lines 709-758)

What this looks like on disk during a real run

The bridge writes snapshots into /tmp/playwright-mcp as it runs. Each browser_snapshot call creates a matched .yml + .png pair. The agent sees the .yml content; the .png sits on disk, resized to 1920 px if it came in bigger.

/tmp/playwright-mcp after three snapshot calls

One browser_snapshot turn, end to end

Five actors, nine messages. The image-drop happens entirely inside the acp-bridge process; neither Playwright MCP nor the LLM changes its protocol. The bridge is the component that enforces text-first perception for every agent turn.

Agent turn under text-first perception

Walking the same turn as steps

Same trace, unwound into human steps. Each step maps to real code in acp-bridge/src/index.ts on the Fazm repo, not a marketing abstraction.

From browser_snapshot to browser_click

Agent calls browser_snapshot

MCP tool call from the Fazm agent. Playwright MCP runs inside the acp-bridge subprocess and talks to the user's real Chrome over CDP (or to its own Chromium if the user is not in extension mode).

Playwright serializes the accessibility tree

Chrome exposes the a11y tree: for every interactive node, role (button, textbox, link), name (the visible label), and a [ref=e_] handle the agent can click or type into later. This is what Playwright writes to the .yml snapshot file.

Playwright MCP returns content items

Without the omit flag, the default is to return BOTH the text snapshot and a base64 PNG as separate content items. With --image-responses omit on argv (line 1033), the MCP skips the base64 altogether; the PNG ends up on disk at /tmp/playwright-mcp.

acp-bridge strips any remaining image items

Defense in depth: the session/prompt update handler (lines 2271-2307) walks the content array, pushing only text items onto the output string. A stray type:'image' item from any MCP server is silently dropped before it enters the agent's context.

Agent receives the text snapshot

The agent's tool result for browser_snapshot is the .yml text: a ~691-char YAML block listing every interactive element by role, label, and [ref=e_] handle. The LLM uses labels like 'Sign in' and 'Email' to pick targets, not pixels.

If pixels are actually needed, Read the file

Fazm's agent system prompt tells the model that when visual verification matters, Read the PNG path from the .yml's screenshot field. The fs.watch at line 715 ensures that PNG is already resized to ≤ 1920 px by sips, so a single visual check is safe without re-triggering the full image-per-turn cycle.

What the next LLM request actually contains

The difference is not a tuning choice. It is a different architecture for what perception even means in a browser automation agent.

Model context at turn N, same Chrome tab

A 500 KB base64 PNG inlined as an image block, plus a text instruction. Every turn pays the same image cost. Vision LLM is required. Local models with 32 K context are not viable: one observation overflows the window.

~500 KB base64 per turn
~350 K input tokens once tokenized
One observation blows a 32 K local-model context
Retina PNGs trigger 'image too large' 413s on providers

Text-first vs screenshot-first, feature by feature

The “competitor” column is the consensus architecture across the top 10 SERP results. Not one specific product, but the shared assumption that screenshots are the observation channel.

Feature	Screenshot-first agents	Fazm (text-first)
What the agent sees on a browser_snapshot call	A ~500 KB base64 PNG attached as an image block in the next LLM request, plus model-specific Set-of-Mark overlays.	A ~691-char YAML listing roles, labels, and [ref=e_] handles. type:'image' items dropped by acp-bridge lines 2280-2292.
Where screenshots go when captured	Sent directly into the model context every turn, as the primary observation. Vision LLM is required.	Written to /tmp/playwright-mcp as .png files. The agent can Read them on demand. Default path never touches them.
Per-step cost against a 32 K local context	~500 KB base64 → ~350 K input tokens per screenshot. One observation overflows a 32 K local-model context window.	~691 chars (~200 tokens). A 32 K context holds 40+ turns of snapshot-plus-reasoning before anything rolls off.
How oversize Retina PNGs are handled	Either passes the full Retina PNG (API error at high volume) or resizes via Pillow/Sharp inside the Python/JS agent loop.	fs.watch on /tmp/playwright-mcp triggers sips --resampleHeightWidthMax 1920 in-place, keeping PNGs under Claude's 2000 px/image API cap (index.ts 709-758).
Hard cap on image-bearing turns per session	No hard cap; the session carries the full screenshot history until it either overflows or the provider returns a 413.	MAX_IMAGE_TURNS = 20 (index.ts line 793). A session that hits the cap stops sending images entirely and falls back to text-only tool results.
Cross-process perception (browser + native apps)	Screenshot agents typically treat 'browser' and 'desktop' as two pipelines with two vision passes.	macos-use MCP is registered alongside Playwright MCP in the same servers array (index.ts 1056-1064). Same agent turn, same text-first perception model.
What the bridge actually forwards to the model	Whatever the MCP returned, often including inline base64 images, goes straight to the LLM message array.	Only items where item.type === 'text' or inner.type === 'text'. Anything else is dropped before the session/prompt update becomes part of the agent's context.
Default against a consumer Retina Mac	High-DPI PNGs ~2880 × 1800 that regularly trigger 'image too large' errors on Anthropic and OpenAI vision endpoints.	Zero PNGs in the context. Zero API 413s for oversized images.

The argv, the filter, the watcher, the budget

0argv elements pushed onto Playwright MCP to kill screenshots: --output-mode, file, --image-responses, omit, --output-dir

0places the bridge drops type:image items (the content array loop and the rawOutput fallback)

0 watcherfs.watch on /tmp/playwright-mcp that auto-resizes any PNG > 1920 px in-place via sips

0screenshots the agent sees by default when it calls browser_snapshot on the page

Read the exact file yourself.

acp-bridge/src/index.ts is in the open Fazm repo. Line 1033 is the --image-responses omit push; lines 2271-2307 are the content array filter; lines 709-758 are the sips resize watcher. If you download Fazm, you can also watch /tmp/playwright-mcp fill up with .yml/.png pairs while the agent runs.

Download Fazm →

Run text-first perception against your own site, on your own Mac

Book 20 minutes and we'll point Fazm at a page you pick, then show the ~691-char snapshot the agent gets back, side by side with the 500 KB PNG a screenshot agent would have sent.

Book a call →

Browser automation agents and screenshot technology, answered against the Fazm source

What is different about browser automation agents and screenshot technology inside Fazm compared to OpenAI Operator, Anthropic Computer Use, or Browser Use?

All three of those feed screenshots to the LLM as the primary observation each turn. Fazm does not. Fazm's acp-bridge spawns Playwright MCP with --image-responses omit on argv (acp-bridge/src/index.ts line 1033) and, as a second line of defense, strips every type:'image' item out of the tool result content array at lines 2280-2292 before the update is handed to the agent. The agent's browser_snapshot tool result is a ~691-character YAML file listing interactive elements by role, label, and [ref=e_] handle. Screenshots still get written to /tmp/playwright-mcp for optional visual verification, but they are never inlined into the model context on the default path.

Why does Fazm strip screenshots from the perception loop at all?

Two reasons. First, a full Retina Chrome screenshot is roughly 500 KB of base64 per step, which is about 350 K input tokens once the vision model tokenizes it. That alone overflows a 32 K context window on a local Ollama model after one observation. Second, the exact same perception pattern has to work for native Mac apps through macos-use (index.ts lines 1056-1064) where the accessibility tree is genuinely richer than a screenshot, so keeping both surfaces text-first lets the same agent turn walk a Chrome tab and a Finder window with one architecture.

Where exactly are the argv elements that make Playwright MCP stop returning base64 screenshots?

acp-bridge/src/index.ts, inside buildMcpServers(sessionKey), around line 1033. The literal statement is playwrightArgs.push('--output-mode', 'file', '--image-responses', 'omit', '--output-dir', '/tmp/playwright-mcp'). This tells the @playwright/mcp binary to write any screenshot to disk and return a file path reference in its response, rather than inlining the PNG as a content item. The five elements land on the Playwright MCP child process argv when the session is warmed up; you can confirm with `ps aux | grep playwright` after Fazm starts.

What does the agent actually see if a screenshot is stripped, and how is that enough to click the right thing?

It sees the text snapshot Playwright writes next to the PNG: a .yml file under /tmp/playwright-mcp with one line per interactive element in the current tab, of the form `- role [ref=e_] 'label'`. For example, `- button [ref=e12] 'Sign in'`. The model picks a target by label and role instead of by pixel, then calls browser_click({ref:'e12'}). Playwright translates that ref back to a real DOM node. On sites with good semantic HTML this is strictly more reliable than pixel matching because a button labeled 'Sign in' stays labeled 'Sign in' even when the site redesigns and the pixel coordinates change.

What happens on a Retina Mac where the screenshots Playwright takes are ~2880 x 1800 and blow through Claude's 2000 px/image API cap?

The bridge ships a safety valve for the rare cases where the agent does ask to Read a PNG. Lines 709-758 of index.ts register an fs.watch on /tmp/playwright-mcp that fires on every new .png or .jpeg file. It runs `sips -g pixelWidth -g pixelHeight` to measure the image, and if either dimension exceeds MAX_SCREENSHOT_DIM (1920), it calls `sips --resampleHeightWidthMax 1920` to shrink the file in place. sips is built into macOS, so there is no dependency. The resized PNG is then small enough for the API even on high-DPI machines.

How many image-bearing turns will a Fazm session actually tolerate before it refuses more screenshots?

MAX_IMAGE_TURNS = 20 (index.ts line 793). The bridge keeps a Map<sessionKey, number> at imageTurnCounts (line 791) and increments it every time a turn genuinely does include an image. When a session passes 20, the bridge stops including screenshots for that session, because Claude's API gets stricter on image dimensions once the session history contains many images. The counter is cleared whenever the session is deleted, so a fresh session starts clean.

Does the same architecture work for native Mac apps, or is this only a browser trick?

It is not a browser trick. The same buildMcpServers function that registers Playwright MCP also registers mcp-server-macos-use at lines 1056-1064 on the same servers array. macos-use is a 21 MB arm64 Mach-O at Fazm.app/Contents/MacOS/mcp-server-macos-use that walks the AXUIElementCreateApplication(pid) tree for any frontmost app and returns structured text: `[AXButton] 'Send' x:842 y:712 w:68 h:32 visible`. The bridge's type:'image' drop at lines 2280-2292 applies identically to both servers, so a single agent turn can call browser_click on a web page AND click_and_traverse on Slack without any PNG ever entering the model context.

Is any of this visible from the outside without reading Fazm's source?

Yes. After Fazm starts, `ls /tmp/playwright-mcp/` shows PNG and YML pairs; the agent's tool results in /tmp/fazm-dev.log show text-only content blocks for browser_snapshot; `ps aux | grep playwright` lists the --image-responses omit argv. The logErr line at index.ts 2551 also prints a canonical 'Playwright MCP config: ... outputMode=file, imageResponses=omit, outputDir=/tmp/playwright-mcp' to the bridge's stderr on boot, which the Fazm desktop app writes to the same log.

What happens if I want the agent to actually use a screenshot for something visual, like confirming a toast message appeared?

The agent calls its built-in Read tool on the file path from the .yml snapshot's screenshot: field. That PNG has already been resized to ≤ 1920 px by the fs.watch on /tmp/playwright-mcp, so it is safely under the API limit. Because you picked when a PNG enters the context instead of every turn including one by default, a vision verification step is a cheap one-off, not a constant tax. In a 32 K local-model context, one deliberate visual verification costs a fraction of what a screenshot-per-turn agent burns in the first three steps.

How does this compare to Set-of-Mark (SoM) annotated screenshots used by Browser Use and smolagents?

SoM overlays numbered labels on every interactive element in a PNG, so a vision model can say 'click element 12' and the framework maps 12 back to a DOM node. The text snapshot Fazm uses skips the overlay step entirely because the [ref=e12] token is already in the Playwright snapshot's YAML. The agent saw `- button [ref=e12] 'Sign in'` and calls browser_click({ref:'e12'}). No rendering pipeline is needed, and the perception is semantic rather than pixel-anchored, so a site redesign that moves the button does not break the reference. SoM costs you the PNG tokens PLUS the overlay CPU; text snapshots cost neither.

Could I build the same text-first perception on top of raw Playwright and an LLM?

Yes, and this is exactly what Fazm does in production. The building blocks are the @playwright/mcp binary (which already supports --image-responses omit and --output-mode file), a thin TypeScript process that spawns it and filters tool-result content items, and an agent loop that calls browser_snapshot at the start of each turn. The relevant source in Fazm is acp-bridge/src/index.ts lines 1027-1064 for server registration, lines 2271-2307 for the content item filter, and lines 709-758 for the screenshot resize watcher. The Fazm repo is open for inspection at github.com/mediar-ai/fazm; you can read the exact architecture rather than approximating from a marketing page.

More on the shape of Mac-native browser and desktop automation