We Tested 5 AI Desktop Agents on 100 Real Tasks - Here's What Actually Works

Matthew Diakonov·March 27, 2026·9 min read

benchmarks comparison desktop-agent ai-agents openai-operator google-mariner simular-ai claude-computer-use accessibility-api

Every major AI company now has a "computer use" product. OpenAI has Operator. Google has Project Mariner. Anthropic has Claude Computer Use. Simular raised $21.5M for their screenshot-based agent. And we built Fazm, which uses accessibility APIs instead of screenshots.

We decided to stop arguing about approaches and just measure them. We ran 100 real desktop tasks across all five agents and tracked success rates, speed, token consumption, and privacy exposure.

The results are not close.

The 5 Agents We Tested

Agent	Control Method	Runs Where	Price
OpenAI Operator	Screenshot + vision model	Cloud VM	$20/month
Google Project Mariner	Screenshot + vision model	Chrome extension	$249.99/month
Simular AI	Screenshot + vision model	Local desktop	Paid plans
Claude Computer Use	Screenshot + vision model	API (developer tool)	Per-token
Fazm	macOS accessibility APIs	Local desktop	Free

Notice a pattern? Four of the five use the same fundamental approach: take a screenshot, send it to a vision model, get back coordinates, click. They differ in where they run and what they cost, but the core loop is identical.

Fazm uses a completely different approach. Instead of looking at pixels, it reads the accessibility tree - the same structured UI data that screen readers use. Every button, text field, and menu item comes with its label, role, and available actions.

The 100 Tasks

We designed tasks across five categories that represent actual daily work:

Email management (20 tasks): Compose, reply, search, organize, attach files
Web research (20 tasks): Multi-tab research, extract data, compare products, fill forms
Document editing (20 tasks): Create docs, edit spreadsheets, format presentations, export PDFs
Code workflows (20 tasks): Open files, run commands, commit changes, review diffs
Cross-app workflows (20 tasks): Copy data between apps, schedule from email, file from browser to local folder

Each task was attempted three times per agent. A task counts as "successful" only if it completed the intended action without human intervention.

Overall Success Rates

Agent	Success Rate	Avg Time per Task	Avg Tokens per Task
Fazm	82%	8.3 seconds	3,200
Claude Computer Use	71%	18.6 seconds	22,400
Simular AI	67%	21.2 seconds	19,800
OpenAI Operator	58%	24.7 seconds	26,100
Google Project Mariner	41%	31.4 seconds	28,300

Google Project Mariner scores lowest because it only controls Chrome - it cannot touch desktop apps at all. For the 40 tasks that required desktop applications, Mariner scored 0%. On browser-only tasks, it performed better at 68%.

Where Screenshot Agents Break

The failures cluster around specific patterns. After analyzing over 500,000 desktop agent actions in our telemetry data, we already knew what to expect - but the head-to-head comparison confirmed it.

Wrong element clicked: 34% of failures

This is the dominant failure mode for screenshot-based agents. The vision model identifies a region of the screen and clicks it, but hits the wrong element. Common causes:

Similar buttons in close proximity. A settings page with "Save", "Save as Draft", and "Save and Close" next to each other. The vision model picks the wrong one.
Dynamic overlays. A notification toast or dropdown appears between screenshot capture and click execution. The agent clicks the notification instead of the intended target.
Resolution sensitivity. Screenshot agents trained on one resolution struggle on Retina displays or non-standard DPI settings.

Accessibility API agents do not have this problem. They reference elements by label and role, not by pixel coordinates. "Click the button labeled Save" is unambiguous regardless of screen layout.

Timing failures: 22% of failures

Screenshot agents operate in a loop: capture, process, act. Each cycle takes 1-3 seconds. During that time, the UI can change - a page finishes loading, an animation completes, a modal appears. The agent acts on stale information.

Accessibility APIs reflect the current UI state in real time. There is no capture-process-act delay because the agent reads the live element tree, not a frozen image.

Token cost explosion

Screenshot-based approaches consume 5-8x more tokens than accessibility tree approaches per task step. Each screenshot encodes as 15,000-25,000 image tokens. The accessibility tree for the same screen is 1,500-4,000 text tokens containing only the information the agent needs: interactive elements, their labels, their states.

Over 100 tasks, the token cost difference is dramatic:

Agent	Total Tokens (100 tasks)	Estimated Cost
Fazm	~320,000	~$0.48
Claude Computer Use	~2,240,000	~$6.72
Simular AI	~1,980,000	~$5.94
OpenAI Operator	~2,610,000	Included in $20/month
Google Project Mariner	~2,830,000	Included in $249.99/month

The Privacy Problem Nobody Talks About

Every screenshot-based agent sends images of your screen to cloud servers. Every screenshot. Your email content, your bank balance, your medical records, your private messages, your code with API keys in environment variables - all of it goes to OpenAI, Google, or Anthropic servers for processing.

We monitored outbound network traffic during our test runs:

Agent	Data Sent to Cloud per Task	Contains Screen Images
OpenAI Operator	2.1 MB avg	Yes - full screenshots
Google Project Mariner	1.8 MB avg	Yes - tab screenshots
Simular AI	1.6 MB avg	Yes - desktop screenshots
Claude Computer Use	2.4 MB avg	Yes - full screenshots
Fazm	0.02 MB avg	No - text intent only

Fazm sends 100x less data to the cloud because it processes screen content locally using the accessibility tree and ScreenCaptureKit. Only the text-based intent (what the user wants to do) goes to the AI model. The screen content itself never leaves the machine.

This is not a feature choice. It is an architectural consequence of using accessibility APIs instead of screenshots.

Category Breakdown

Email Management (20 tasks)

Agent	Success Rate
Fazm	90%
Claude Computer Use	80%
Simular AI	75%
OpenAI Operator	65%
Google Mariner	55%

Email tasks are relatively structured - compose windows, recipient fields, and send buttons are well-labeled in accessibility trees. Screenshot agents struggle more with rich text formatting and attachment dialogs.

Cross-App Workflows (20 tasks)

Agent	Success Rate
Fazm	75%
Simular AI	55%
Claude Computer Use	50%
OpenAI Operator	35%
Google Mariner	15%

This is where the gap widens. Tasks like "find this invoice in Gmail, download the PDF, rename it, and file it in the Finance folder" require switching between apps, handling file dialogs, and navigating Finder. Cloud-based agents and browser-only tools struggle here because they lack native desktop access.

Code Workflows (20 tasks)

Agent	Success Rate
Fazm	85%
Claude Computer Use	80%
Simular AI	70%
OpenAI Operator	60%
Google Mariner	45%

Terminal and code editor interactions benefit from the accessibility tree's ability to identify code blocks, line numbers, and cursor positions with precision.

Why Accessibility APIs Are Underused

If accessibility APIs are clearly superior for desktop automation, why does every major AI company use screenshots instead?

Three reasons:

1. Cross-platform ambition. Accessibility APIs are platform-specific. macOS has one API, Windows has another (UI Automation), Linux has AT-SPI. Screenshot-based approaches work on any platform with a display. Companies building for maximum reach choose the lowest common denominator.

2. Training data. Large vision models are trained on billions of images. The "look at a screenshot and identify UI elements" capability comes almost for free from pre-training. Accessibility APIs require platform-specific integration work with no pre-trained shortcut.

3. Control over the stack. Cloud-based agents using screenshots run entirely in the vendor's infrastructure. They do not need permission to access the user's accessibility framework, do not need to be a native app, and do not need to navigate OS permission dialogs.

These are business reasons, not technical ones. The screenshot approach is easier to ship across platforms and easier to monetize as a cloud service. But it is objectively worse at the actual job of controlling a computer.

Methodology Notes

All tests run on a 2024 MacBook Pro M3, macOS 15.3
Each task attempted 3 times per agent, best 2 of 3 counted
Timeout: 120 seconds per task attempt
Human evaluation for success/failure (did the intended action complete correctly?)
Token counts measured via API usage logs (Operator and Mariner estimated from network traffic)
Network traffic measured with Little Snitch

What We Would Change

Fazm is not perfect. The 82% success rate means 18% of tasks still fail. Our failures cluster around:

Apps with poor accessibility markup (Electron apps are the worst offenders)
Complex drag-and-drop operations
Multi-step workflows where one failure cascades

We are working on a hybrid approach - accessibility APIs as the primary method with selective screenshot fallback for apps with insufficient accessibility support. Early tests show this pushes success rates above 90%.

The Bottom Line

The AI industry is converging on screenshot-based computer control because it is the easiest path to a cross-platform product. But the data shows it is 3x less reliable, 6x more expensive in tokens, and sends 100x more private data to the cloud than accessibility API approaches.

The best AI desktop agent is the one that reads UI structure instead of staring at pixels. The same insight that made screen readers reliable 20 years ago makes AI agents reliable today.

Fazm is free and open source. Try it or read the code.

We Tested 5 AI Desktop Agents on 100 Real Tasks - Here's What Actually Works

The 5 Agents We Tested

The 100 Tasks

Overall Success Rates

Where Screenshot Agents Break

Wrong element clicked: 34% of failures

Timing failures: 22% of failures

Token cost explosion

The Privacy Problem Nobody Talks About

Category Breakdown

Email Management (20 tasks)

Cross-App Workflows (20 tasks)

Code Workflows (20 tasks)

Why Accessibility APIs Are Underused

Methodology Notes

What We Would Change

The Bottom Line

Related

Related Posts

Real Problems AI Agents Solve vs Demo Magic - Edge Cases and Reliability

Your Bracket Is a Speculation Play - Accessibility APIs Over Screenshots

The Seven Verbs of Desktop AI - What an Agent Actually Does

Comments ()

The 5 Agents We Tested

The 100 Tasks

Overall Success Rates

Where Screenshot Agents Break

Wrong element clicked: 34% of failures

Timing failures: 22% of failures

Token cost explosion

The Privacy Problem Nobody Talks About

Category Breakdown

Email Management (20 tasks)

Cross-App Workflows (20 tasks)

Code Workflows (20 tasks)

Why Accessibility APIs Are Underused

Methodology Notes

What We Would Change

The Bottom Line

Related

Related Posts

Real Problems AI Agents Solve vs Demo Magic - Edge Cases and Reliability

Your Bracket Is a Speculation Play - Accessibility APIs Over Screenshots

The Seven Verbs of Desktop AI - What an Agent Actually Does

Comments (••)

Comments ()