The Seven Verbs of Desktop AI - What an Agent Actually Does

Matthew Diakonov·March 17, 2026·2 min read

ai-agent ui-automation accessibility-api desktop-agent macos

Strip away the marketing and an AI desktop agent does exactly seven things: click, scroll, type, read, open, press, and traverse. That is the entire vocabulary. Every workflow you can imagine - from filing expenses to organizing your desktop - reduces to some combination of these primitives.

This matters because it grounds the conversation. When someone says "my AI agent automated my invoice workflow," what actually happened is the agent opened an app, read some text fields, typed values into a form, clicked submit, and traversed to the next screen.

Why Primitives Matter

Most AI agent frameworks try to abstract away these operations. They talk about "goals" and "plans" and "reasoning." But the actual interface with your computer is always physical - a click at coordinates, a keystroke, a scroll event.

The agents that work well are the ones that execute these primitives reliably through native APIs rather than trying to guess from screenshots. On macOS, the accessibility API provides structured access to every UI element - buttons, text fields, menus, labels. The agent does not need to "see" a button. It reads the accessibility tree and knows exactly where every interactive element is.

The Traversal Step

The most underrated verb is "traverse." Before an agent can click or type, it needs to build a map of what is on screen. Every action triggers a fresh traversal of the accessibility tree - reading every element, its role, its value, its position. This is perception, not reasoning. The agent is constantly rebuilding its understanding of the UI state.

What This Means for Reliability

Keeping the vocabulary small and well-defined is what makes desktop agents reliable. Each primitive can be tested independently. Failures are easy to diagnose - either the click landed or it did not. Either the text field contained the expected value or it did not.

Compare this to browser agents that try to interpret rendered pixels or parse HTML. The accessibility API gives you structured, semantic data. A button is labeled "Submit." A text field has a value. There is no guessing.

Fazm is an open source macOS AI agent. Open source on GitHub.

The Seven Verbs of Desktop AI - What an Agent Actually Does

Why Primitives Matter

The Traversal Step

What This Means for Reliability

More on This Topic

Related Posts

macOS AI Agent: How Desktop Agents Work on Mac in 2026

Route Claude API Through a Custom Endpoint with ANTHROPIC_BASE_URL

We Tested 5 AI Desktop Agents on 100 Real Tasks - Here's What Actually Works

Comments ()

Why Primitives Matter

The Traversal Step

What This Means for Reliability

More on This Topic

Related Posts

macOS AI Agent: How Desktop Agents Work on Mac in 2026

Route Claude API Through a Custom Endpoint with ANTHROPIC_BASE_URL

We Tested 5 AI Desktop Agents on 100 Real Tasks - Here's What Actually Works

Comments (••)

Comments ()