Benchmarked 4 AI Browser Tools - Native APIs Are More Token-Efficient

Fazm Team··3 min read

Benchmarked 4 AI Browser Tools for Token Efficiency

Not all browser automation approaches cost the same. We tested four common methods for AI browser control and measured token consumption per task. The results strongly favor native accessibility APIs over screenshot-based approaches.

The Four Approaches

  1. Screenshot + Vision Model - Take a screenshot, send it to a vision-capable LLM, get back coordinates to click
  2. DOM Extraction - Pull the full DOM tree, send it as text to the LLM, get back CSS selectors
  3. Accessibility Tree - Read the browser's accessibility tree, send a structured representation to the LLM
  4. Hybrid (Accessibility + Selective Screenshots) - Use the accessibility tree for most interactions, fall back to screenshots only when needed

Token Consumption per Task

For a standard task - "log into a website, navigate to settings, change a preference" - the token counts varied dramatically:

  • Screenshot-based: ~15,000-25,000 tokens per step (image tokens are expensive, and you need a new screenshot after every action)
  • Full DOM: ~8,000-40,000 tokens per step (depends heavily on page complexity - a simple form is fine, a complex SPA can blow past context limits)
  • Accessibility tree: ~1,500-4,000 tokens per step (structured, compact, includes only interactive elements)
  • Hybrid: ~2,000-5,000 tokens per step (accessibility tree baseline plus occasional screenshot)

Over a 10-step task, the difference between screenshot-based and accessibility-tree approaches is 150,000 tokens versus 25,000 tokens. That is a 6x cost difference.

Why Accessibility APIs Win

The accessibility tree contains exactly what an agent needs: interactive elements, their labels, their states, and their relationships. A screenshot contains everything visible on the page - text, images, decorations, ads - and the model has to parse all of it.

Think of it this way: reading a menu versus looking at a photo of a menu. The structured data is faster to process and less ambiguous.

Reliability Advantage

Beyond token efficiency, accessibility APIs are more reliable. They return semantic labels ("Submit button, enabled") rather than pixel coordinates that break on window resize. The agent can reference elements by name rather than position, making actions reproducible across sessions.

The tradeoff: some web applications have poor accessibility markup. For those, screenshots remain a necessary fallback. The hybrid approach handles this gracefully - try accessibility first, fall back to vision only when the tree is insufficient.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts