ChatGPT Can Use Your Computer Now - But Screenshot-Based Control Is Still Fragile
ChatGPT Can Use Your Computer Now - But Screenshot-Based Control Is Still Fragile
ChatGPT can now see your screen and click things. It takes a screenshot, feeds it to a vision model, identifies UI elements, and clicks coordinates. Impressive demo. Terrible in practice.
The problem is fundamental. Screenshots are pixels. Pixels change constantly. A button that was at coordinates (340, 220) moves to (340, 280) when a notification banner appears. A dropdown menu overlaps the element you wanted to click. Dark mode changes the visual appearance of everything. The vision model has to re-identify every element from scratch each time.
Why This Breaks
Screenshot-based control has a cascading failure mode. The agent takes a screenshot, identifies a button, clicks slightly wrong, gets a different screen than expected, takes another screenshot, and now it is completely lost. Each step compounds the error.
We have seen this happen with form filling - the agent clicks a text field, starts typing, but a tooltip appears and shifts the field down by 20 pixels. The next click lands on the wrong element. The agent does not know it clicked the wrong thing because the screenshot looks "close enough" to what it expected.
Accessibility APIs Solve This Differently
macOS provides an accessibility API that gives you the actual UI tree. Every button, text field, menu item, and label has a programmatic identity. You do not need to visually locate a "Save" button - you ask the system for the button with the role "AXButton" and title "Save" and it gives you a direct reference.
This approach does not care about:
- Screen resolution or scaling
- Dark mode vs light mode
- Overlapping windows or tooltips
- Where elements are positioned on screen
The API returns structured data with roles, labels, values, and available actions. You interact with the element directly rather than clicking coordinates.
The Practical Difference
Screenshot-based agents need retries, error recovery, and visual verification at every step. Accessibility-based agents just ask for the element and interact with it. One approach fights the UI. The other works with it.
The screenshot approach will keep improving as vision models get better. But accessibility APIs already work reliably today - no vision model needed.
Fazm is an open source macOS AI agent. Open source on GitHub.