Back to Blog

Accessibility APIs vs OCR - Two Approaches to Desktop Agent Vision

Fazm Team··2 min read
accessibility-apiocrdesktop-agentvisionautomation

Accessibility APIs vs OCR - Two Approaches to Desktop Agent Vision

When building a desktop agent that interacts with applications, the fundamental question is: how does the agent see what is on screen? There are two main approaches, and they are more different than they look.

The OCR/Screenshot Approach

Take a screenshot, run it through a vision model or OCR engine, and extract text and element positions from the pixels. This is how most early computer-use agents work - send a screenshot to a multimodal LLM and ask it to identify buttons, text fields, and other interactive elements.

The advantage is universality. It works with any application, any custom UI framework, any weird rendering engine. If a human can see it, the OCR approach can see it too.

The downside is speed and reliability. Screenshots are expensive to process, especially through a vision API. And pixel-based element detection is inherently fuzzy - you get approximate bounding boxes, not exact click targets. A button that says "Submit" next to another button that says "Cancel" can be hard to distinguish at lower resolutions.

The Accessibility API Approach

macOS, Windows, and Linux all expose accessibility APIs that give you the actual UI tree - every button, text field, label, and menu item with its exact position, role, and state. On macOS this is the Accessibility framework (AXUIElement), originally built for screen readers.

This gives you structured data instead of pixels. You know exactly where every element is, what it does, and whether it is enabled or focused. Clicking is precise because you have exact coordinates.

The trade-off is coverage. Not every app exposes a complete accessibility tree. Custom-rendered content like canvas-based apps or games are invisible to accessibility APIs. And some apps have broken or incomplete accessibility metadata.

The Practical Answer

Use both. Start with accessibility APIs for the structural information - element positions, types, and states. Fall back to OCR when the accessibility tree is missing information or when you need to read content from custom-rendered views. The combination is more reliable than either approach alone.

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts