Benchmarked 4 AI Browser Tools - Native APIs Are More Token-Efficient
Benchmarked 4 AI Browser Tools for Token Efficiency
Not all browser automation approaches cost the same. We tested four common methods for AI browser control and measured token consumption per task. The results strongly favor native accessibility APIs over screenshot-based approaches.
The Four Approaches
- Screenshot + Vision Model - Take a screenshot, send it to a vision-capable LLM, get back coordinates to click
- DOM Extraction - Pull the full DOM tree, send it as text to the LLM, get back CSS selectors
- Accessibility Tree - Read the browser's accessibility tree, send a structured representation to the LLM
- Hybrid (Accessibility + Selective Screenshots) - Use the accessibility tree for most interactions, fall back to screenshots only when needed
Token Consumption per Task
For a standard task - "log into a website, navigate to settings, change a preference" - the token counts varied dramatically:
- Screenshot-based: ~15,000-25,000 tokens per step (image tokens are expensive, and you need a new screenshot after every action)
- Full DOM: ~8,000-40,000 tokens per step (depends heavily on page complexity - a simple form is fine, a complex SPA can blow past context limits)
- Accessibility tree: ~1,500-4,000 tokens per step (structured, compact, includes only interactive elements)
- Hybrid: ~2,000-5,000 tokens per step (accessibility tree baseline plus occasional screenshot)
Over a 10-step task, the difference between screenshot-based and accessibility-tree approaches is 150,000 tokens versus 25,000 tokens. That is a 6x cost difference.
Why Accessibility APIs Win
The accessibility tree contains exactly what an agent needs: interactive elements, their labels, their states, and their relationships. A screenshot contains everything visible on the page - text, images, decorations, ads - and the model has to parse all of it.
Think of it this way: reading a menu versus looking at a photo of a menu. The structured data is faster to process and less ambiguous.
Reliability Advantage
Beyond token efficiency, accessibility APIs are more reliable. They return semantic labels ("Submit button, enabled") rather than pixel coordinates that break on window resize. The agent can reference elements by name rather than position, making actions reproducible across sessions.
The tradeoff: some web applications have poor accessibility markup. For those, screenshots remain a necessary fallback. The hybrid approach handles this gracefully - try accessibility first, fall back to vision only when the tree is insufficient.
Fazm is an open source macOS AI agent. Open source on GitHub.