ChatGPT Can Use Your Computer - Screenshot vs Accessibility API Approaches
ChatGPT Can Use Your Computer - Screenshot vs Accessibility API Approaches
OpenAI's computer use feature lets ChatGPT see your screen and interact with it. It is impressive as a demo, but it reveals a fundamental design decision that matters more than most people realize: how should an AI agent perceive and interact with your desktop?
The Screenshot Approach
Screenshot-based control works by capturing your screen as an image, sending it to a vision model, and getting back coordinates to click. This is what most AI computer control tools use today, including OpenAI's implementation.
The appeal is obvious - it works with any application on any platform, no special integration needed. The AI sees what you see.
The problems are also obvious once you use it. Each interaction takes 1-3 seconds because you are round-tripping a full image to a cloud model. The AI misidentifies UI elements constantly - similar-looking buttons, overlapping menus, dynamic content that changed between screenshot and click. Dark mode breaks things. High-DPI displays confuse coordinate mapping.
The Accessibility API Approach
macOS exposes a complete tree of every UI element in every application through its accessibility framework. Instead of looking at pixels, an agent using accessibility APIs gets structured data - button labels, text field values, menu hierarchies, available actions.
"Press the button labeled Submit" is fundamentally more reliable than "click at pixel coordinates (847, 423)." It works regardless of screen resolution, window position, or visual theme.
The tradeoff is that accessibility APIs are platform-specific and some apps expose incomplete accessibility trees. But on macOS, coverage is excellent because Apple enforces accessibility compliance.
Why This Matters
The industry will converge on hybrid approaches - using accessibility APIs as the primary interface with vision as a fallback. Pure screenshot approaches are a dead end for production-quality automation.
Speed matters too. Local accessibility API calls take milliseconds. Cloud vision model calls take seconds. For multi-step workflows, this difference compounds into minutes of waiting versus instant execution.
Fazm is an open source macOS AI agent. Open source on GitHub.