ChatGPT Can Use Your Computer Now - But Screenshot-Based Control Is Still Fragile

Matthew Diakonov

Updated March 19, 2026

chatgpt computer-use accessibility-api screenshot automation

ChatGPT Can Use Your Computer Now - But Screenshot-Based Control Is Still Fragile

ChatGPT can now see your screen and click things. It takes a screenshot, feeds it to a vision model, identifies UI elements, and clicks coordinates. Impressive demo. Terrible in practice.

The problem is fundamental. Screenshots are pixels. Pixels change constantly. A button that was at coordinates (340, 220) moves to (340, 280) when a notification banner appears. A dropdown menu overlaps the element you wanted to click. Dark mode changes the visual appearance of everything. The vision model has to re-identify every element from scratch each time.

Why This Breaks

Screenshot-based control has a cascading failure mode. The agent takes a screenshot, identifies a button, clicks slightly wrong, gets a different screen than expected, takes another screenshot, and now it is completely lost. Each step compounds the error.

We have seen this happen with form filling - the agent clicks a text field, starts typing, but a tooltip appears and shifts the field down by 20 pixels. The next click lands on the wrong element. The agent does not know it clicked the wrong thing because the screenshot looks "close enough" to what it expected.

Accessibility APIs Solve This Differently

macOS provides an accessibility API that gives you the actual UI tree. Every button, text field, menu item, and label has a programmatic identity. You do not need to visually locate a "Save" button - you ask the system for the button with the role "AXButton" and title "Save" and it gives you a direct reference.

This approach does not care about:

Screen resolution or scaling
Dark mode vs light mode
Overlapping windows or tooltips
Where elements are positioned on screen

The API returns structured data with roles, labels, values, and available actions. You interact with the element directly rather than clicking coordinates.

The Practical Difference

Screenshot-based agents need retries, error recovery, and visual verification at every step. Accessibility-based agents just ask for the element and interact with it. One approach fights the UI. The other works with it.

The screenshot approach will keep improving as vision models get better. But accessibility APIs already work reliably today - no vision model needed.

Fazm is an open source macOS AI agent. Open source on GitHub.

ChatGPT Can Use Your Computer Now - But Screenshot-Based Control Is Still Fragile

ChatGPT Can Use Your Computer Now - But Screenshot-Based Control Is Still Fragile

Why This Breaks

Accessibility APIs Solve This Differently

The Practical Difference

You Might Also Like

Related Posts

Related Posts

ChatGPT Can Use Your Computer - Screenshot vs Accessibility API Approaches

Computer Use Agent: What It Is, How It Works, and How to Pick One

AI Agent Desktop: How Autonomous Software Controls Your Computer in 2026