Back to Blog

The Seven Verbs of Desktop AI - What an Agent Actually Does

Fazm Team··2 min read
ai-agentui-automationaccessibility-apidesktop-agentmacos

The Seven Verbs

Strip away the marketing and an AI desktop agent does exactly seven things: click, scroll, type, read, open, press, and traverse. That is the entire vocabulary. Every workflow you can imagine - from filing expenses to organizing your desktop - reduces to some combination of these primitives.

This matters because it grounds the conversation. When someone says "my AI agent automated my invoice workflow," what actually happened is the agent opened an app, read some text fields, typed values into a form, clicked submit, and traversed to the next screen.

Why Primitives Matter

Most AI agent frameworks try to abstract away these operations. They talk about "goals" and "plans" and "reasoning." But the actual interface with your computer is always physical - a click at coordinates, a keystroke, a scroll event.

The agents that work well are the ones that execute these primitives reliably through native APIs rather than trying to guess from screenshots. On macOS, the accessibility API provides structured access to every UI element - buttons, text fields, menus, labels. The agent does not need to "see" a button. It reads the accessibility tree and knows exactly where every interactive element is.

The Traversal Step

The most underrated verb is "traverse." Before an agent can click or type, it needs to build a map of what is on screen. Every action triggers a fresh traversal of the accessibility tree - reading every element, its role, its value, its position. This is perception, not reasoning. The agent is constantly rebuilding its understanding of the UI state.

What This Means for Reliability

Keeping the vocabulary small and well-defined is what makes desktop agents reliable. Each primitive can be tested independently. Failures are easy to diagnose - either the click landed or it did not. Either the text field contained the expected value or it did not.

Compare this to browser agents that try to interpret rendered pixels or parse HTML. The accessibility API gives you structured, semantic data. A button is labeled "Submit." A text field has a value. There is no guessing.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts