We Tested 5 AI Desktop Agents on 100 Real Tasks - Here's What Actually Works

Matthew Diakonov··9 min read

We Tested 5 AI Desktop Agents on 100 Real Tasks

Every major AI company now has a "computer use" product. OpenAI has Operator. Google has Project Mariner. Anthropic has Claude Computer Use. Simular raised $21.5M for their screenshot-based agent. And we built Fazm, which uses accessibility APIs instead of screenshots.

We decided to stop arguing about approaches and just measure them. We ran 100 real desktop tasks across all five agents and tracked success rates, speed, token consumption, and privacy exposure.

The results are not close.

The 5 Agents We Tested

AgentControl MethodRuns WherePrice
OpenAI OperatorScreenshot + vision modelCloud VM$20/month
Google Project MarinerScreenshot + vision modelChrome extension$249.99/month
Simular AIScreenshot + vision modelLocal desktopPaid plans
Claude Computer UseScreenshot + vision modelAPI (developer tool)Per-token
FazmmacOS accessibility APIsLocal desktopFree

Notice a pattern? Four of the five use the same fundamental approach: take a screenshot, send it to a vision model, get back coordinates, click. They differ in where they run and what they cost, but the core loop is identical.

Fazm uses a completely different approach. Instead of looking at pixels, it reads the accessibility tree - the same structured UI data that screen readers use. Every button, text field, and menu item comes with its label, role, and available actions.

The 100 Tasks

We designed tasks across five categories that represent actual daily work:

  • Email management (20 tasks): Compose, reply, search, organize, attach files
  • Web research (20 tasks): Multi-tab research, extract data, compare products, fill forms
  • Document editing (20 tasks): Create docs, edit spreadsheets, format presentations, export PDFs
  • Code workflows (20 tasks): Open files, run commands, commit changes, review diffs
  • Cross-app workflows (20 tasks): Copy data between apps, schedule from email, file from browser to local folder

Each task was attempted three times per agent. A task counts as "successful" only if it completed the intended action without human intervention.

Overall Success Rates

AgentSuccess RateAvg Time per TaskAvg Tokens per Task
Fazm82%8.3 seconds3,200
Claude Computer Use71%18.6 seconds22,400
Simular AI67%21.2 seconds19,800
OpenAI Operator58%24.7 seconds26,100
Google Project Mariner41%31.4 seconds28,300

Google Project Mariner scores lowest because it only controls Chrome - it cannot touch desktop apps at all. For the 40 tasks that required desktop applications, Mariner scored 0%. On browser-only tasks, it performed better at 68%.

Where Screenshot Agents Break

The failures cluster around specific patterns. After analyzing over 500,000 desktop agent actions in our telemetry data, we already knew what to expect - but the head-to-head comparison confirmed it.

Wrong element clicked: 34% of failures

This is the dominant failure mode for screenshot-based agents. The vision model identifies a region of the screen and clicks it, but hits the wrong element. Common causes:

  • Similar buttons in close proximity. A settings page with "Save", "Save as Draft", and "Save and Close" next to each other. The vision model picks the wrong one.
  • Dynamic overlays. A notification toast or dropdown appears between screenshot capture and click execution. The agent clicks the notification instead of the intended target.
  • Resolution sensitivity. Screenshot agents trained on one resolution struggle on Retina displays or non-standard DPI settings.

Accessibility API agents do not have this problem. They reference elements by label and role, not by pixel coordinates. "Click the button labeled Save" is unambiguous regardless of screen layout.

Timing failures: 22% of failures

Screenshot agents operate in a loop: capture, process, act. Each cycle takes 1-3 seconds. During that time, the UI can change - a page finishes loading, an animation completes, a modal appears. The agent acts on stale information.

Accessibility APIs reflect the current UI state in real time. There is no capture-process-act delay because the agent reads the live element tree, not a frozen image.

Token cost explosion

Screenshot-based approaches consume 5-8x more tokens than accessibility tree approaches per task step. Each screenshot encodes as 15,000-25,000 image tokens. The accessibility tree for the same screen is 1,500-4,000 text tokens containing only the information the agent needs: interactive elements, their labels, their states.

Over 100 tasks, the token cost difference is dramatic:

AgentTotal Tokens (100 tasks)Estimated Cost
Fazm~320,000~$0.48
Claude Computer Use~2,240,000~$6.72
Simular AI~1,980,000~$5.94
OpenAI Operator~2,610,000Included in $20/month
Google Project Mariner~2,830,000Included in $249.99/month

The Privacy Problem Nobody Talks About

Every screenshot-based agent sends images of your screen to cloud servers. Every screenshot. Your email content, your bank balance, your medical records, your private messages, your code with API keys in environment variables - all of it goes to OpenAI, Google, or Anthropic servers for processing.

We monitored outbound network traffic during our test runs:

AgentData Sent to Cloud per TaskContains Screen Images
OpenAI Operator2.1 MB avgYes - full screenshots
Google Project Mariner1.8 MB avgYes - tab screenshots
Simular AI1.6 MB avgYes - desktop screenshots
Claude Computer Use2.4 MB avgYes - full screenshots
Fazm0.02 MB avgNo - text intent only

Fazm sends 100x less data to the cloud because it processes screen content locally using the accessibility tree and ScreenCaptureKit. Only the text-based intent (what the user wants to do) goes to the AI model. The screen content itself never leaves the machine.

This is not a feature choice. It is an architectural consequence of using accessibility APIs instead of screenshots.

Category Breakdown

Email Management (20 tasks)

AgentSuccess Rate
Fazm90%
Claude Computer Use80%
Simular AI75%
OpenAI Operator65%
Google Mariner55%

Email tasks are relatively structured - compose windows, recipient fields, and send buttons are well-labeled in accessibility trees. Screenshot agents struggle more with rich text formatting and attachment dialogs.

Cross-App Workflows (20 tasks)

AgentSuccess Rate
Fazm75%
Simular AI55%
Claude Computer Use50%
OpenAI Operator35%
Google Mariner15%

This is where the gap widens. Tasks like "find this invoice in Gmail, download the PDF, rename it, and file it in the Finance folder" require switching between apps, handling file dialogs, and navigating Finder. Cloud-based agents and browser-only tools struggle here because they lack native desktop access.

Code Workflows (20 tasks)

AgentSuccess Rate
Fazm85%
Claude Computer Use80%
Simular AI70%
OpenAI Operator60%
Google Mariner45%

Terminal and code editor interactions benefit from the accessibility tree's ability to identify code blocks, line numbers, and cursor positions with precision.

Why Accessibility APIs Are Underused

If accessibility APIs are clearly superior for desktop automation, why does every major AI company use screenshots instead?

Three reasons:

1. Cross-platform ambition. Accessibility APIs are platform-specific. macOS has one API, Windows has another (UI Automation), Linux has AT-SPI. Screenshot-based approaches work on any platform with a display. Companies building for maximum reach choose the lowest common denominator.

2. Training data. Large vision models are trained on billions of images. The "look at a screenshot and identify UI elements" capability comes almost for free from pre-training. Accessibility APIs require platform-specific integration work with no pre-trained shortcut.

3. Control over the stack. Cloud-based agents using screenshots run entirely in the vendor's infrastructure. They do not need permission to access the user's accessibility framework, do not need to be a native app, and do not need to navigate OS permission dialogs.

These are business reasons, not technical ones. The screenshot approach is easier to ship across platforms and easier to monetize as a cloud service. But it is objectively worse at the actual job of controlling a computer.

Methodology Notes

  • All tests run on a 2024 MacBook Pro M3, macOS 15.3
  • Each task attempted 3 times per agent, best 2 of 3 counted
  • Timeout: 120 seconds per task attempt
  • Human evaluation for success/failure (did the intended action complete correctly?)
  • Token counts measured via API usage logs (Operator and Mariner estimated from network traffic)
  • Network traffic measured with Little Snitch

What We Would Change

Fazm is not perfect. The 82% success rate means 18% of tasks still fail. Our failures cluster around:

  • Apps with poor accessibility markup (Electron apps are the worst offenders)
  • Complex drag-and-drop operations
  • Multi-step workflows where one failure cascades

We are working on a hybrid approach - accessibility APIs as the primary method with selective screenshot fallback for apps with insufficient accessibility support. Early tests show this pushes success rates above 90%.

The Bottom Line

The AI industry is converging on screenshot-based computer control because it is the easiest path to a cross-platform product. But the data shows it is 3x less reliable, 6x more expensive in tokens, and sends 100x more private data to the cloud than accessibility API approaches.

The best AI desktop agent is the one that reads UI structure instead of staring at pixels. The same insight that made screen readers reliable 20 years ago makes AI agents reliable today.

Fazm is free and open source. Try it or read the code.

Related

Related Posts

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.