Choosing Native Accessibility APIs Over OCR - The Decision Everyone Said Was Wrong

Fazm Team··2 min read

Everyone Said OCR Was the Future

When we started building desktop automation, the industry consensus was clear: screenshot the screen, run OCR or a vision model, identify elements, click on coordinates. OpenAI was pushing this with computer use demos. Every blog post said vision-based agents were the path forward.

We went with native accessibility APIs instead. People thought we were making a mistake.

Why OCR Seemed Like the Right Choice

OCR-based approaches have an appealing property: they work on any visual interface. Take a screenshot, send it to a vision model, and you can theoretically control any application on any platform. No special API access needed.

This universality is seductive. Why limit yourself to apps that expose accessibility trees when you could control everything through screenshots?

Why OCR Falls Apart in Practice

Screenshots create a cascade of problems. First, you need a vision model to interpret every frame - that's expensive in tokens and slow in latency. Second, coordinates are fragile - a window resize or display scaling change breaks everything. Third, overlapping windows and notifications create ambiguity that vision models handle poorly.

But the biggest problem is confidence. When an OCR system identifies a "Submit" button, it's making a probabilistic guess. When the accessibility API returns a button with the label "Submit," it's returning a fact. The difference between 95% confidence and 100% confidence matters enormously when you're chaining 20 actions together.

The Compound Reliability Effect

If each step in a 20-step workflow has 95% reliability, the overall success rate is 36%. With 99% reliability per step, it's 82%. With 99.9%, it's 98%.

Accessibility APIs give you near-perfect reliability per step. OCR gives you good-but-not-great reliability that compounds into frequent failures over multi-step workflows.

What We Gained

Speed - no screenshot processing delay. Privacy - no images sent to cloud APIs. Reliability - semantic references that survive UI changes. Lower cost - no vision model tokens burned per action.

The one trade-off: we only work on macOS and iOS where accessibility APIs are excellent. But building something that works perfectly on one platform beats building something that works unreliably everywhere.

More on This Topic

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts