FM Agent: How Foundation Model Agents Actually Work on Your Desktop

Matthew Diakonov·April 5, 2026·11 min read

fm-agent foundation-model ai-agent macos desktop-automation

FM Agent: How Foundation Model Agents Actually Work on Your Desktop

An FM agent is a software system that uses a foundation model (GPT, Claude, Gemini, Llama, or similar) as its reasoning core while interacting with real applications on your computer. Unlike chatbots that just generate text, FM agents can click buttons, fill forms, run terminal commands, and chain multi-step workflows across apps. Here is how they work, where they fail, and what you can build with one today.

What Makes an Agent an "FM Agent"

The term "FM agent" distinguishes agents powered by foundation models from older rule-based automation (RPA, AppleScript macros, Selenium scripts). The difference is not cosmetic.

| Property | Rule-based agent | FM agent | |---|---|---| | Input handling | Fixed selectors, pixel coordinates | Natural language, accessibility trees, screenshots | | Failure mode | Breaks when UI changes | Adapts to layout shifts, new button labels | | Task specification | Scripted step-by-step | Describe the goal, agent figures out steps | | Learning | Manual script updates | Context window, memory files, tool discovery | | Setup cost | Hours of recording/scripting per workflow | Minutes of natural language instruction |

Rule-based agents are brittle but predictable. FM agents are flexible but probabilistic. Neither is strictly better; the right choice depends on whether your workflow changes frequently.

The Architecture of an FM Agent

Every FM agent follows the same basic loop: perceive, reason, act, verify.

Perception layer

The agent needs to understand what is on screen. Two main approaches exist:

Screenshot analysis: Send a screenshot to a vision-capable FM. Works everywhere but is slow (500ms+ per frame) and imprecise for small UI elements.
Accessibility tree: Query the OS accessibility API to get a structured representation of every button, text field, and label. Much faster (under 50ms on macOS) and more reliable for clicking.

Most production FM agents combine both. The accessibility tree handles structured interactions while screenshots catch visual context the tree misses (charts, images, notification badges).

Reasoning layer

This is where the foundation model earns its name. Given the current screen state and the user's goal, the FM:

Identifies which element to interact with
Decides what type of action to take (click, type, scroll, wait)
Generates the specific parameters (coordinates, text content, key combination)
Evaluates whether the current state requires a different plan

The quality of this reasoning determines everything. A model that reliably identifies "the Send button in the lower right of the compose window" versus "any blue button on screen" is the difference between a useful agent and a frustrating one.

Action layer

FM agents execute actions through OS-level APIs:

// macOS: click via accessibility API
let element = findElement(role: .button, title: "Send")
element.performAction(.press)

// macOS: type text
let textField = findElement(role: .textField, identifier: "search-input")
textField.setValue("fm agent architecture")

On macOS, this means AXUIElement from the Accessibility framework. On Windows, it is UI Automation. On Linux, AT-SPI2. The browser equivalent is the DOM, which is why browser agents matured faster: the DOM is a cleaner API than any OS accessibility tree.

Verification layer

After every action, the agent checks whether it worked. This is the step most toy implementations skip and production agents cannot afford to. You re-read the screen, compare against the expected state, and decide whether to proceed, retry, or ask for help.

Where FM Agents Break Down

Warning

FM agents are not magic. They fail in predictable ways, and understanding those failure modes is the difference between a useful tool and a time sink.

Authentication walls. OAuth popups, 2FA prompts, and CAPTCHAs require human intervention. An FM agent can navigate to the login page but cannot solve a reCAPTCHA or read your authenticator app. Design workflows that handle auth separately.

Dynamic content timing. The agent clicks "Load More" and immediately reads the page before new content renders. Race conditions between action and observation are the single most common failure mode. The fix is explicit waits: re-read the screen after each action and confirm the expected change appeared.

Ambiguous instructions. "Clean up my desktop" could mean organize files into folders, delete old screenshots, or close open windows. FM agents guess, and they guess wrong roughly 30% of the time on ambiguous commands. Be specific: "Move all .png files from Desktop to ~/Screenshots and delete files older than 30 days."

Cost at scale. Each perceive-reason-act cycle costs tokens. A simple 5-step workflow might use 10,000-50,000 tokens. Running that hourly adds up. For high-frequency automations, consider whether a traditional script handles the stable parts while the FM agent handles only the variable decisions.

Running an FM Agent Locally

Local execution matters for FM agents because it keeps your screen data, keystrokes, and file contents off remote servers. Here is what a minimal setup looks like on macOS:

# Install Fazm (open source macOS FM agent)
brew install m13v/tap/fazm

# Grant accessibility permissions
# System Settings > Privacy & Security > Accessibility > Enable Fazm

# Run with local model (Ollama)
fazm --model ollama/llama3.2 --voice

# Or with Claude API
export ANTHROPIC_API_KEY="sk-ant-..."
fazm --model claude-sonnet --voice

The key requirements:

macOS accessibility permission (required for any desktop agent)

Screen recording permission (for screenshot-based perception)

An FM provider: local (Ollama, llama.cpp) or API (Claude, GPT)

Microphone permission (optional, for voice control)

FM Agent vs. Browser Agent vs. RPA

People often conflate these three categories. They solve different problems:

| Capability | FM agent (desktop) | Browser agent | Traditional RPA | |---|---|---|---| | Scope | Any app on the OS | Browser tabs only | Scripted apps only | | UI understanding | Accessibility tree + vision | DOM + selectors | Pixel matching / selectors | | Handles UI changes | Yes, via FM reasoning | Partially (DOM shifts) | No, breaks immediately | | Speed per action | 200-800ms | 100-500ms | 10-50ms | | Setup effort | Low (natural language) | Low (natural language) | High (recording/scripting) | | Reliability (stable UI) | 85-92% | 90-95% | 99%+ | | Reliability (changing UI) | 80-88% | 70-85% | 0-20% | | Privacy | Can run fully local | Usually cloud-based | Local |

The sweet spot for FM agents is cross-app workflows on a desktop where the UI changes periodically. If your task lives entirely in the browser, a browser agent is more reliable. If the UI never changes, a traditional script is faster and cheaper.

Building Your Own FM Agent

If you want to build an FM agent from scratch rather than using an existing one, here is the minimal architecture:

import subprocess
import json

def perceive():
    """Get structured UI state via accessibility API."""
    result = subprocess.run(
        ["swift", "-e", """
        import Cocoa
        let app = NSWorkspace.shared.frontmostApplication!
        let appRef = AXUIElementCreateApplication(app.processIdentifier)
        // ... traverse tree, output JSON
        """],
        capture_output=True, text=True
    )
    return json.loads(result.stdout)

def reason(state, goal, history):
    """Ask the FM what to do next."""
    prompt = f"""
    Current screen state: {json.dumps(state)}
    Goal: {goal}
    Actions taken so far: {history}
    What is the next single action to take?
    Respond as JSON: {{"action": "click|type|scroll|done", "target": "...", "value": "..."}}
    """
    return call_foundation_model(prompt)

def act(action):
    """Execute the action via OS APIs."""
    if action["action"] == "click":
        click_element(action["target"])
    elif action["action"] == "type":
        type_text(action["target"], action["value"])

def run_agent(goal, max_steps=20):
    history = []
    for step in range(max_steps):
        state = perceive()
        action = reason(state, goal, history)
        if action["action"] == "done":
            return True
        act(action)
        history.append(action)
    return False

This is a skeleton. A production agent adds retry logic, screenshot fallbacks, permission handling, cost tracking, and persistent memory. But the core loop is always perceive, reason, act.

Common Pitfalls

Skipping verification after actions. The agent clicks a button, assumes it worked, and moves on. Ten steps later the workflow fails because the click didn't register. Always re-read the screen after acting.
Oversized context windows. Sending the full accessibility tree (thousands of elements) to the FM wastes tokens and confuses the model. Filter to the active window and visible elements only.
No cost tracking. A runaway FM agent can burn through $50 of API credits in an hour if it gets stuck in a retry loop. Add per-session spending caps.
Ignoring the model's uncertainty. When the FM says "I'm not sure which button to click," that is a signal to ask the human, not to guess. Surface confidence scores and halt on low confidence.

Checklist for Deploying an FM Agent

Tip

Start with a single, well-defined workflow before expanding. An FM agent that reliably handles one task is more valuable than one that occasionally handles twenty.

Define the workflow in plain language with clear start and end conditions
Grant the minimum OS permissions needed (accessibility, screen recording)
Choose an FM provider (local for privacy, API for capability)
Set a spending cap per session and per day
Add logging for every action the agent takes
Test on your actual apps with your actual data
Run supervised for the first week before enabling autonomous mode

Wrapping Up

FM agents combine foundation model reasoning with OS-level control to automate real desktop workflows. They are not replacements for scripts or RPA when those tools work, but they handle the messy, variable, cross-app tasks that traditional automation cannot touch. Start with a specific workflow, verify every action, and keep a human in the loop until you trust the results.

Fazm is an open source macOS AI agent that runs FM-powered automation locally. Open source on GitHub.

FM Agent: How Foundation Model Agents Actually Work on Your Desktop

FM Agent: How Foundation Model Agents Actually Work on Your Desktop

What Makes an Agent an "FM Agent"

The Architecture of an FM Agent

Perception layer

Reasoning layer

Action layer

Verification layer

Where FM Agents Break Down

Running an FM Agent Locally

FM Agent vs. Browser Agent vs. RPA

Building Your Own FM Agent

Common Pitfalls

Checklist for Deploying an FM Agent

Wrapping Up

Related Posts

Why Desktop Agents Hit the Same Logic Error Problem as Code Review

Wearing a Mic So Your AI Agent Acts as Chief of Staff

Why the Accessibility Tree Makes AI Agents Transparent