ACCESSIBILITY SNAPSHOTS / NO PIXEL BLOAT / 40+ TURN SESSIONS

Chrome browser automation that does not eat your context

Every tutorial on this topic tells you to pick a framework, write locators, and run the script. That is fine until the thing driving Chrome is an LLM agent, at which point the whole game becomes context economics. A single retina screenshot is roughly half a megabyte of base64, so a tab that needs ten visual confirmations collapses the model's 200K window before the task finishes. Fazm ships three hard defenses against that problem in one directory, /tmp/playwright-mcp. This page walks through all three, with line numbers.

M
Matthew Diakonov
11 min read
4.9from Written from the Fazm source tree
Accessibility snapshots, not screenshots
Playwright MCP in extension mode
sips auto-resize watcher
20 image turn per-session cap
Open source plumbing

THE PROBLEM NO TUTORIAL ADMITS

An LLM driving Chrome is bottlenecked on context, not on Chrome

When the driver is a human with a Playwright script, the cost of knowing what is on the page is zero; the script already knows what it is looking for. When the driver is an LLM agent, the cost is non-zero and paid in tokens. Every time the agent wants to confirm what is on the current tab, one of two things happens. Either the tool sends back a visual, which on a retina Mac is a PNG 2880 pixels wide encoded as base64 inside a tool result that can be 500 kilobytes on the wire. Or the tool sends back a structured description, which is 3 to 30 kilobytes as YAML.

Claude Sonnet has a 200K token window. That is large. It is also instantly consumed by half-megabyte screenshots. A task that involves ten page confirmations, plus the agent's own thinking and replies, plus the system prompt, plus the tool schemas, is cutting it close. A task that involves forty page confirmations never finishes.

Fazm's chrome browser automation assumes this economy. The default tool call is browser_snapshot, which returns an ARIA tree as YAML and writes it to disk. The visual tool, browser_take_screenshot, is reserved for cases where the structure actually fails to describe the page. That one substitution, made at the MCP wiring layer rather than asked of the agent, is the difference between a session that completes and a session that does not.

0pxMAX_SCREENSHOT_DIM pixels
0MAX_IMAGE_TURNS per session
0Defenses in /tmp/playwright-mcp
~0KB base64 a single retina PNG costs

THE ANCHOR FACT

One filesystem watcher, one macOS binary, one dimension cap

The middle layer of the defense is a function called startScreenshotResizeWatcher, which opens at line 715 of acp-bridge/src/index.ts. It registers a Node watch() on /tmp/playwright-mcp and runs sips (the image utility built into every Mac since 10.3) against any PNG or JPEG Playwright writes there. If either pixel dimension exceeds 0px, the file is re-encoded in place with sips --resampleHeightWidthMax 1920. No libraries, no dependencies, and no chance of the model receiving a screenshot that Claude's API would reject for being over its 2000-pixel-per-side limit.

acp-bridge/src/index.ts (lines 709 to 758, abridged)

THE THREE LAYERS

What actually protects the context window

Layer 1: Strip inline image responses at spawn time

When the Playwright MCP server is launched, the bridge appends --image-responses omit. That flag tells Playwright to route any screenshot it generates to disk and return only the file path in the tool result, never the inline base64. A tool result that used to be 500 kilobytes becomes a 60-character path. The agent can still open the file if it needs to, but the default path does not burn context on pixels the model never looked at.

Layer 2: Auto-resize anything that does end up on disk

When the agent does call browser_take_screenshot, the PNG lands in /tmp/playwright-mcp at whatever resolution Chrome produces. On retina Macs that is 2880 pixels wide, which fails Claude's 2000-pixel-per-side API check. startScreenshotResizeWatcher catches each new file within 200 milliseconds and runs sips --resampleHeightWidthMax 1920 on it before the agent ever references it.

Layer 3: Hard cap at 20 image turns per session

Claude's API tightens its image budget as a session accumulates history. The bridge tracks a per-session counter in a Map and stops passing screenshots through once MAX_IMAGE_TURNS = 20 is hit. After that, the session continues on accessibility snapshots only. Fresh sessions start with a clean counter.

Why all three are needed

Layer 1 handles the common case (agent asks for the page state, gets a tiny YAML back). Layer 2 handles the edge case (agent explicitly asks for a screenshot and Chrome returns a 2880-wide PNG). Layer 3 handles the degenerate case (an agent that, for whatever reason, keeps asking for screenshots even when snapshots would do). Each one on its own leaks; stacked, they make 40+ turn chrome browser automation reliable.

MAX_SCREENSHOT_DIM = 1920MAX_IMAGE_TURNS = 20--image-responses omit--output-mode file/tmp/playwright-mcpbrowser_snapshotbrowser_click(ref=e12)sips --resampleHeightWidthMaxARIA role:buttonPLAYWRIGHT_USE_EXTENSION=truetestPlaywrightConnection()Playwright MCP BridgeextensionId=mmlmfjhmonkocbjadbfplnigmagldckm

WHAT THE DIRECTORY LOOKS LIKE MID-SESSION

A real /tmp/playwright-mcp during a 30-step task

This is what the working directory actually looks like part-way through a chrome browser automation task that involves filling a form, waiting for a redirect, confirming a modal, and pulling a value out of a rendered chart. Note the ratio: lots of tiny YAML snapshots, a handful of PNGs, every PNG already logged as resized.

inside /tmp/playwright-mcp during an active session

THE AGENT LOOP

Three tool calls per step, forever

On top of the image pipeline sits a repeating loop that every chrome browser automation step goes through. The agent asks Chrome what is on the page, reads the structured answer, identifies the element it needs, and acts on it. No selectors, no XPath, no CSS classes. The only interface is the ARIA tree, and refs change on every snapshot so the loop is forced to re-verify before every action.

Snapshot ref click loop, one step

AgentPlaywright MCP/tmp/playwright-mcpYour Chromebrowser_snapshot()read ARIA tree from tabtree with role+name+refwrite 1747...ymlpath to yml (no inline image)scan yml for [ref=eN] matching intentbrowser_click(ref=e12)dispatch click to element e12okok (agent plans next step)

HOW THE PIECES FIT

What feeds /tmp/playwright-mcp, and who reads it

The working directory is the single integration point. Chrome tabs feed it (through the Playwright MCP Bridge extension if installed, or through a Playwright-launched Chromium otherwise), the resize watcher normalizes it, and the agent reads from it. Every component in the chain speaks files, not in-memory objects, which is why the context-saving is durable even across restarts.

Data flow into the working directory

Chrome tab A
Chrome tab B
Chrome tab C
/tmp/playwright-mcp
Fazm agent
sips watcher
Your log

AGAINST THE GRAIN OF EVERY OTHER GUIDE

Agent-driven vs script-driven, compared honestly

Most writing on this topic walks through picking a framework (Selenium, Puppeteer, Playwright), installing its driver, writing selectors, and running the script. That is the right answer when the driver is a human who knows exactly what they want to automate. It is the wrong answer when the driver is an LLM that needs to figure out what it wants mid-task.

FeatureScripted framework (Puppeteer, Selenium, bare Playwright)Fazm (agent driven)
How the driver identifies elementsCSS selectors, XPath, or data-testid hand-authored per siteNumbered ARIA refs from the accessibility tree on every snapshot
How a page describes itself to the driverDoesn't — the script already knows what it is looking forYAML snapshot written to /tmp/playwright-mcp, read by the agent
What a site redesign costsBroken selectors; you rewrite the scriptNothing; the agent reads the new ARIA tree and adapts
Default context budget per stepN/A (no model in the loop)Kilobytes (YAML) not megabytes (PNG base64)
When it actually needs a screenshotRarely — the script knows where to clickRouted to disk, auto-resized to 1920px, capped at 20 per session
Where the task can continue outside the browserIt can't — scripts are browser-boundmacos-use MCP takes over for any Mac app with an accessibility tree
What the user writesHundreds of lines of code per flowA single English sentence per flow

THE ONE-LINE PROOF

The shortest chrome browser automation prompt Fazm can run

The connection check Fazm sends after you paste the Playwright MCP Bridge token is a one-shot version of the same loop. It ships as a real prompt, not a marketing demo. You can read it in Desktop/Sources/Chat/ACPBridge.swift at line 1503 inside a function called testPlaywrightConnection.

Desktop/Sources/Chat/ACPBridge.swift (lines 1503 to 1513)

That is the whole setup: one English instruction that names one tool, one systemPrompt that tells the agent how to respond, one round trip through the three-layer pipeline, and a boolean back. If it returns true, your Chrome is now drivable by an English sentence, with every turn protected by the same image budget guardrails as the connection test itself.

What you get from this architecture

  • Chrome automation that runs against the tab you are already signed into, with your real cookies and session
  • Tool results that fit in kilobytes of context, not megabytes
  • Automatic PNG downscaling the moment a screenshot lands on disk
  • A hard cap on image-bearing turns per session so one bad flow can't break the context budget
  • Zero selectors, zero XPath, zero data-testid maintenance
  • A fallback to a fresh Chromium when the extension is not installed, same image pipeline either way
  • An escape hatch to macos-use when the task leaves Chrome for Finder, Mail, or a native Mac app
  • Every constant and line number in this page is grep-able in the Fazm source tree

WHY THIS PAGE IS HARD TO CLONE

The numbers are real, and they are on disk

Every other guide about this is written from the outside, walking through a framework API. This one walks through three constants that decide whether a long chrome browser automation session actually completes: MAX_SCREENSHOT_DIM = 1920, MAX_IMAGE_TURNS = 20, and the --image-responses omit flag.

You do not need to trust any of this. Install Fazm, run a task that touches Chrome, then ls -lh /tmp/playwright-mcp. You will see mostly YAML, a few PNGs, every PNG already under 1920 in both dimensions. That is chrome browser automation sized for an LLM, not for a human-written script.

Want to see a 30-step Chrome task complete without the agent drowning in screenshots?

Fifteen minutes. I will open /tmp/playwright-mcp on a live session, show you the YAML snapshots accumulating, and demo a task that would have collapsed the context on any screenshot-first agent.

Frequently asked questions

Why does chrome browser automation by an AI agent run out of context so quickly?

Because the default way to tell a model what is on the page is to hand it a screenshot, and a full-page PNG from a retina Mac is between 400 and 800 kilobytes as base64 inline in the model's tool result. Claude's 200K context window sounds large, but after a handful of screenshots plus the model's own replies plus the system prompt, you are out. Fazm avoids this entirely by driving Chrome through its accessibility tree instead of its pixels. The Playwright MCP server is launched with --output-mode file --image-responses omit, which routes snapshots to disk as YAML and strips any inline base64 image responses before they reach the model. You can see it in acp-bridge/src/index.ts on line 1033.

What actually lives in /tmp/playwright-mcp and why does it matter?

That directory is the working set for every chrome browser automation turn. When the agent asks Chrome what is on the screen, Playwright writes a YAML file there like 1747881132103_snapshot.yml. The YAML is the page's ARIA tree, flattened: each interactive element has a role, an accessible name, and a numbered ref like [ref=e47]. The agent reads the YAML, finds the ref for the thing it wants to click, and calls browser_click with that ref. When Playwright does produce a PNG — usually for a browser_take_screenshot call — it lands in the same directory. Fazm watches the directory on boot (acp-bridge/src/index.ts line 715, startScreenshotResizeWatcher) and re-encodes anything over 1920 pixels with the built-in macOS sips binary before the model ever sees it.

What is MAX_IMAGE_TURNS = 20 and why is it there?

It is the third line of defense. Even with screenshots written to disk instead of returned inline, the model will occasionally request a real image (a captcha, a chart, a page that does not expose a good accessibility tree). Claude's API enforces a stricter per-session 2000-pixel-per-image limit once a session has served many images, and after enough image turns the session fails in ways that are hard to recover from. The constant lives at acp-bridge/src/index.ts line 793 as MAX_IMAGE_TURNS = 20. After twenty image-bearing turns in a given session, the bridge suppresses further screenshots so the agent can keep driving the tab on accessibility snapshots alone. Sessions reset the counter when they are deleted.

How is this different from writing a Puppeteer or Selenium script?

A script is deterministic and brittle. You write CSS selectors, the site redesigns in three months, your script breaks. chrome browser automation by an agent is the opposite shape: the agent reads the live accessibility tree and picks the right element on the fly. There is no selector to maintain. Fazm takes that further by letting you type the goal in English, running Playwright MCP in extension mode so the automation happens in the Chrome you are already signed into, and scoping the same agent loop beyond Chrome (any macOS app with an accessibility tree) through a second MCP called macos-use. If you are coming from Puppeteer, you are not replacing your script, you are replacing the act of writing scripts.

Does Fazm actually use screenshots at all, or is everything accessibility snapshots?

It prefers accessibility snapshots. browser_snapshot is the first call on every turn and it returns the ARIA tree as YAML written to disk. When the agent explicitly needs a visual (for example, to verify a chart actually rendered, or to pass a visual captcha, or when a page sets aria-hidden on everything interactive so the snapshot is useless), it can call browser_take_screenshot, which produces a PNG in the same directory. The screenshot resize watcher kicks in the moment the file appears: it reads pixelWidth and pixelHeight with sips -g pixelWidth -g pixelHeight, and if either dimension exceeds 1920 it runs sips --resampleHeightWidthMax 1920 in place. That keeps the image under Claude's 2000-pixel-per-side API limit without the agent having to plan around it.

What is the 'snapshot-ref-click' loop exactly?

Three tool calls per step. One: browser_snapshot writes the YAML tree to /tmp/playwright-mcp as a file. Two: the agent reads the file, scans for the element whose role and accessible name match its intent, and pulls out the ref (for example [ref=e12]). Three: the agent calls browser_click with that ref, or browser_type if it needs to fill a field, or browser_navigate if it needs a new URL. Because refs are stable within a snapshot but re-issued on the next snapshot, the loop has to take a fresh snapshot after any action that mutates the page. The upside is that the agent never has to reason about pixel coordinates, CSS classes, or XPath; those are all problems of the snapshot, not the agent. Fazm's connection test, which lives at Desktop/Sources/Chat/ACPBridge.swift line 1503, is literally the minimal version of this loop: one browser_snapshot call, one pass or fail.

If the extension is not installed, what does Fazm do instead?

It falls back to the standard Playwright chromium launch. The gate is the environment variable PLAYWRIGHT_USE_EXTENSION, read at acp-bridge/src/index.ts line 1029. When it is the string 'true', the bridge appends --extension to the Playwright args, which tells Playwright MCP to connect to the Playwright MCP Bridge Chrome extension (ID mmlmfjhmonkocbjadbfplnigmagldckm). When it is unset or false, Playwright launches its own Chromium instance. Both code paths use the same --output-mode file --image-responses omit plumbing, so the three-layer image defense works identically either way. The difference is whether the agent is driving the Chrome you are already signed into (extension mode) or a fresh Chromium with empty cookies (default mode).

How does this affect tasks that really do need to look at a screenshot?

They still work. The three layers are stacking defenses, not a ban. --image-responses omit strips inline base64 from MCP responses — the screenshot still exists as a PNG on disk, referenced by path in the tool output. The resize watcher scales it down if it is oversized, so Claude's API does not reject it. MAX_IMAGE_TURNS caps how many image-bearing turns per session. Together they mean a 40-step chrome browser automation task spends its context budget on accessibility snapshots (cheap, structured, tiny) and only burns image budget when the task genuinely needs pixels. That is the asymmetry that makes long tasks complete.

Is this specific to Chrome, or does it work for other browsers?

The extension side is Chrome-only right now. The Playwright MCP Bridge is a Chrome Web Store extension, so the 'attach to my real browser' mode requires Chrome or a Chromium fork that can install Chrome Web Store extensions. The fallback path uses Playwright's default chromium download and works on any Mac. If your task involves a different browser (Safari, Firefox) Fazm switches to the macos-use MCP and drives that browser through the macOS accessibility tree instead, which is a different code path but the same underlying 'no screenshots, structured snapshots only' philosophy.

How do I verify any of this on my own machine?

Install Fazm, grant Accessibility, install the Playwright MCP Bridge extension, and run any task that touches a Chrome tab. While it is running, ls /tmp/playwright-mcp — you will see a growing set of YAML files (one per snapshot) and a smaller number of PNGs (only when the agent asked for an image). Run sips -g pixelWidth -g pixelHeight on any PNG in there; none will exceed 1920 in either dimension. Open the Fazm log (~/Library/Logs/Fazm/app.log on a packaged build, /tmp/fazm-dev.log on a dev build) and grep for 'Screenshot resized' — you will see the watcher logging each re-encoding. If the source is open, you can also read the exact code in acp-bridge/src/index.ts lines 709 to 758.