We Tested 5 AI Desktop Agents on 100 Real Tasks - Here's What Actually Works

Matthew Diakonov··9 min read

We Tested 5 AI Desktop Agents on 100 Real Tasks

Every major AI company now has a "computer use" product. OpenAI has Operator. Google has Project Mariner. Anthropic has Claude Computer Use. Simular raised $21.5M for their screenshot-based agent. And we built Fazm, which uses accessibility APIs instead of screenshots.

We decided to stop arguing about approaches and just measure them. We ran 100 real desktop tasks across all five agents and tracked success rates, speed, token consumption, and privacy exposure.

The results are not close.

The 5 Agents We Tested

| Agent | Control Method | Runs Where | Price | |-------|---------------|------------|-------| | OpenAI Operator | Screenshot + vision model | Cloud VM | $20/month | | Google Project Mariner | Screenshot + vision model | Chrome extension | $249.99/month | | Simular AI | Screenshot + vision model | Local desktop | Paid plans | | Claude Computer Use | Screenshot + vision model | API (developer tool) | Per-token | | Fazm | macOS accessibility APIs | Local desktop | Free |

Notice a pattern? Four of the five use the same fundamental approach: take a screenshot, send it to a vision model, get back coordinates, click. They differ in where they run and what they cost, but the core loop is identical.

Fazm uses a completely different approach. Instead of looking at pixels, it reads the accessibility tree - the same structured UI data that screen readers use. Every button, text field, and menu item comes with its label, role, and available actions.

The 100 Tasks

We designed tasks across five categories that represent actual daily work:

  • Email management (20 tasks): Compose, reply, search, organize, attach files
  • Web research (20 tasks): Multi-tab research, extract data, compare products, fill forms
  • Document editing (20 tasks): Create docs, edit spreadsheets, format presentations, export PDFs
  • Code workflows (20 tasks): Open files, run commands, commit changes, review diffs
  • Cross-app workflows (20 tasks): Copy data between apps, schedule from email, file from browser to local folder

Each task was attempted three times per agent. A task counts as "successful" only if it completed the intended action without human intervention.

Overall Success Rates

| Agent | Success Rate | Avg Time per Task | Avg Tokens per Task | |-------|-------------|-------------------|---------------------| | Fazm | 82% | 8.3 seconds | 3,200 | | Claude Computer Use | 71% | 18.6 seconds | 22,400 | | Simular AI | 67% | 21.2 seconds | 19,800 | | OpenAI Operator | 58% | 24.7 seconds | 26,100 | | Google Project Mariner | 41% | 31.4 seconds | 28,300 |

Google Project Mariner scores lowest because it only controls Chrome - it cannot touch desktop apps at all. For the 40 tasks that required desktop applications, Mariner scored 0%. On browser-only tasks, it performed better at 68%.

Where Screenshot Agents Break

The failures cluster around specific patterns. After analyzing over 500,000 desktop agent actions in our telemetry data, we already knew what to expect - but the head-to-head comparison confirmed it.

Wrong element clicked: 34% of failures

This is the dominant failure mode for screenshot-based agents. The vision model identifies a region of the screen and clicks it, but hits the wrong element. Common causes:

  • Similar buttons in close proximity. A settings page with "Save", "Save as Draft", and "Save and Close" next to each other. The vision model picks the wrong one.
  • Dynamic overlays. A notification toast or dropdown appears between screenshot capture and click execution. The agent clicks the notification instead of the intended target.
  • Resolution sensitivity. Screenshot agents trained on one resolution struggle on Retina displays or non-standard DPI settings.

Accessibility API agents do not have this problem. They reference elements by label and role, not by pixel coordinates. "Click the button labeled Save" is unambiguous regardless of screen layout.

Timing failures: 22% of failures

Screenshot agents operate in a loop: capture, process, act. Each cycle takes 1-3 seconds. During that time, the UI can change - a page finishes loading, an animation completes, a modal appears. The agent acts on stale information.

Accessibility APIs reflect the current UI state in real time. There is no capture-process-act delay because the agent reads the live element tree, not a frozen image.

Token cost explosion

Screenshot-based approaches consume 5-8x more tokens than accessibility tree approaches per task step. Each screenshot encodes as 15,000-25,000 image tokens. The accessibility tree for the same screen is 1,500-4,000 text tokens containing only the information the agent needs: interactive elements, their labels, their states.

Over 100 tasks, the token cost difference is dramatic:

| Agent | Total Tokens (100 tasks) | Estimated Cost | |-------|-------------------------|----------------| | Fazm | ~320,000 | ~$0.48 | | Claude Computer Use | ~2,240,000 | ~$6.72 | | Simular AI | ~1,980,000 | ~$5.94 | | OpenAI Operator | ~2,610,000 | Included in $20/month | | Google Project Mariner | ~2,830,000 | Included in $249.99/month |

The Privacy Problem Nobody Talks About

Every screenshot-based agent sends images of your screen to cloud servers. Every screenshot. Your email content, your bank balance, your medical records, your private messages, your code with API keys in environment variables - all of it goes to OpenAI, Google, or Anthropic servers for processing.

We monitored outbound network traffic during our test runs:

| Agent | Data Sent to Cloud per Task | Contains Screen Images | |-------|---------------------------|----------------------| | OpenAI Operator | 2.1 MB avg | Yes - full screenshots | | Google Project Mariner | 1.8 MB avg | Yes - tab screenshots | | Simular AI | 1.6 MB avg | Yes - desktop screenshots | | Claude Computer Use | 2.4 MB avg | Yes - full screenshots | | Fazm | 0.02 MB avg | No - text intent only |

Fazm sends 100x less data to the cloud because it processes screen content locally using the accessibility tree and ScreenCaptureKit. Only the text-based intent (what the user wants to do) goes to the AI model. The screen content itself never leaves the machine.

This is not a feature choice. It is an architectural consequence of using accessibility APIs instead of screenshots.

Category Breakdown

Email Management (20 tasks)

| Agent | Success Rate | |-------|-------------| | Fazm | 90% | | Claude Computer Use | 80% | | Simular AI | 75% | | OpenAI Operator | 65% | | Google Mariner | 55% |

Email tasks are relatively structured - compose windows, recipient fields, and send buttons are well-labeled in accessibility trees. Screenshot agents struggle more with rich text formatting and attachment dialogs.

Cross-App Workflows (20 tasks)

| Agent | Success Rate | |-------|-------------| | Fazm | 75% | | Simular AI | 55% | | Claude Computer Use | 50% | | OpenAI Operator | 35% | | Google Mariner | 15% |

This is where the gap widens. Tasks like "find this invoice in Gmail, download the PDF, rename it, and file it in the Finance folder" require switching between apps, handling file dialogs, and navigating Finder. Cloud-based agents and browser-only tools struggle here because they lack native desktop access.

Code Workflows (20 tasks)

| Agent | Success Rate | |-------|-------------| | Fazm | 85% | | Claude Computer Use | 80% | | Simular AI | 70% | | OpenAI Operator | 60% | | Google Mariner | 45% |

Terminal and code editor interactions benefit from the accessibility tree's ability to identify code blocks, line numbers, and cursor positions with precision.

Why Accessibility APIs Are Underused

If accessibility APIs are clearly superior for desktop automation, why does every major AI company use screenshots instead?

Three reasons:

1. Cross-platform ambition. Accessibility APIs are platform-specific. macOS has one API, Windows has another (UI Automation), Linux has AT-SPI. Screenshot-based approaches work on any platform with a display. Companies building for maximum reach choose the lowest common denominator.

2. Training data. Large vision models are trained on billions of images. The "look at a screenshot and identify UI elements" capability comes almost for free from pre-training. Accessibility APIs require platform-specific integration work with no pre-trained shortcut.

3. Control over the stack. Cloud-based agents using screenshots run entirely in the vendor's infrastructure. They do not need permission to access the user's accessibility framework, do not need to be a native app, and do not need to navigate OS permission dialogs.

These are business reasons, not technical ones. The screenshot approach is easier to ship across platforms and easier to monetize as a cloud service. But it is objectively worse at the actual job of controlling a computer.

Methodology Notes

  • All tests run on a 2024 MacBook Pro M3, macOS 15.3
  • Each task attempted 3 times per agent, best 2 of 3 counted
  • Timeout: 120 seconds per task attempt
  • Human evaluation for success/failure (did the intended action complete correctly?)
  • Token counts measured via API usage logs (Operator and Mariner estimated from network traffic)
  • Network traffic measured with Little Snitch

What We Would Change

Fazm is not perfect. The 82% success rate means 18% of tasks still fail. Our failures cluster around:

  • Apps with poor accessibility markup (Electron apps are the worst offenders)
  • Complex drag-and-drop operations
  • Multi-step workflows where one failure cascades

We are working on a hybrid approach - accessibility APIs as the primary method with selective screenshot fallback for apps with insufficient accessibility support. Early tests show this pushes success rates above 90%.

The Bottom Line

The AI industry is converging on screenshot-based computer control because it is the easiest path to a cross-platform product. But the data shows it is 3x less reliable, 6x more expensive in tokens, and sends 100x more private data to the cloud than accessibility API approaches.

The best AI desktop agent is the one that reads UI structure instead of staring at pixels. The same insight that made screen readers reliable 20 years ago makes AI agents reliable today.

Fazm is free and open source. Try it or read the code.

Related

Related Posts