Browser Automation: Accessibility Snapshots vs Screenshots - Saving Tokens by Skipping Pixels

Matthew Diakonov

Updated March 19, 2026

browser-automation accessibility tokens optimization playwright

Browser Automation: Accessibility Snapshots vs Screenshots - Saving Tokens by Skipping Pixels

When we first built browser automation into our agent, we did what everyone does - take a screenshot, send it to a vision model, and ask "what do you see?" It worked. It was also absurdly expensive.

A single screenshot consumes thousands of tokens when processed by a vision model. A typical web task might need 10 to 20 screenshots. That is tens of thousands of tokens burned just on understanding what is on the screen, before the model even starts reasoning about what to do.

The Accessibility Snapshot Approach

Playwright MCP introduced us to accessibility snapshots. Instead of capturing pixels, you capture the accessibility tree - a structured representation of every element on the page. Buttons, links, text fields, headings, and their labels all come through as clean structured data.

An accessibility snapshot for a typical page is a few hundred tokens. Compare that to a screenshot of the same page at several thousand tokens. The savings are dramatic - often 10x or more per step.

Why It Works Better

Structured data is not just cheaper. It is also more reliable. A vision model looking at a screenshot might misread a button label or miss a small link. An accessibility snapshot gives you the exact text of every interactive element, along with its role and state.

This means fewer errors, fewer retries, and faster task completion. The agent spends less time figuring out what is on the page and more time deciding what to do next.

When Screenshots Still Win

Accessibility snapshots do not capture visual layout or styling. If the task requires understanding how something looks - like verifying a design or reading a chart - you still need screenshots. The practical approach is to default to accessibility snapshots and fall back to screenshots only when visual context is genuinely needed.

The lesson is simple - do not send pixels when structured data will do. Your token budget will thank you.

Fazm is an open source macOS AI agent. Open source on GitHub.

Browser Automation: Accessibility Snapshots vs Screenshots - Saving Tokens by Skipping Pixels

Browser Automation: Accessibility Snapshots vs Screenshots - Saving Tokens by Skipping Pixels

The Accessibility Snapshot Approach

Why It Works Better

When Screenshots Still Win

More on This Topic

Related Posts

Browser Automation AI Agent with Playwright and Puppeteer

Playwright vs Puppeteer vs Selenium for AI Agents in 2026

A/B Testing Claude Code Hooks - Optimizing Token Usage