FM Agent: How Foundation Model Agents Actually Work on Your Desktop
FM Agent: How Foundation Model Agents Actually Work on Your Desktop
An FM agent is a software system that uses a foundation model (GPT, Claude, Gemini, Llama, or similar) as its reasoning core while interacting with real applications on your computer. Unlike chatbots that just generate text, FM agents can click buttons, fill forms, run terminal commands, and chain multi-step workflows across apps. Here is how they work, where they fail, and what you can build with one today.
What Makes an Agent an "FM Agent"
The term "FM agent" distinguishes agents powered by foundation models from older rule-based automation (RPA, AppleScript macros, Selenium scripts). The difference is not cosmetic.
| Property | Rule-based agent | FM agent | |---|---|---| | Input handling | Fixed selectors, pixel coordinates | Natural language, accessibility trees, screenshots | | Failure mode | Breaks when UI changes | Adapts to layout shifts, new button labels | | Task specification | Scripted step-by-step | Describe the goal, agent figures out steps | | Learning | Manual script updates | Context window, memory files, tool discovery | | Setup cost | Hours of recording/scripting per workflow | Minutes of natural language instruction |
Rule-based agents are brittle but predictable. FM agents are flexible but probabilistic. Neither is strictly better; the right choice depends on whether your workflow changes frequently.
The Architecture of an FM Agent
Every FM agent follows the same basic loop: perceive, reason, act, verify.
Perception layer
The agent needs to understand what is on screen. Two main approaches exist:
- Screenshot analysis: Send a screenshot to a vision-capable FM. Works everywhere but is slow (500ms+ per frame) and imprecise for small UI elements.
- Accessibility tree: Query the OS accessibility API to get a structured representation of every button, text field, and label. Much faster (under 50ms on macOS) and more reliable for clicking.
Most production FM agents combine both. The accessibility tree handles structured interactions while screenshots catch visual context the tree misses (charts, images, notification badges).
Reasoning layer
This is where the foundation model earns its name. Given the current screen state and the user's goal, the FM:
- Identifies which element to interact with
- Decides what type of action to take (click, type, scroll, wait)
- Generates the specific parameters (coordinates, text content, key combination)
- Evaluates whether the current state requires a different plan
The quality of this reasoning determines everything. A model that reliably identifies "the Send button in the lower right of the compose window" versus "any blue button on screen" is the difference between a useful agent and a frustrating one.
Action layer
FM agents execute actions through OS-level APIs:
// macOS: click via accessibility API
let element = findElement(role: .button, title: "Send")
element.performAction(.press)
// macOS: type text
let textField = findElement(role: .textField, identifier: "search-input")
textField.setValue("fm agent architecture")
On macOS, this means AXUIElement from the Accessibility framework. On Windows, it is UI Automation. On Linux, AT-SPI2. The browser equivalent is the DOM, which is why browser agents matured faster: the DOM is a cleaner API than any OS accessibility tree.
Verification layer
After every action, the agent checks whether it worked. This is the step most toy implementations skip and production agents cannot afford to. You re-read the screen, compare against the expected state, and decide whether to proceed, retry, or ask for help.
Where FM Agents Break Down
Warning
FM agents are not magic. They fail in predictable ways, and understanding those failure modes is the difference between a useful tool and a time sink.
Authentication walls. OAuth popups, 2FA prompts, and CAPTCHAs require human intervention. An FM agent can navigate to the login page but cannot solve a reCAPTCHA or read your authenticator app. Design workflows that handle auth separately.
Dynamic content timing. The agent clicks "Load More" and immediately reads the page before new content renders. Race conditions between action and observation are the single most common failure mode. The fix is explicit waits: re-read the screen after each action and confirm the expected change appeared.
Ambiguous instructions. "Clean up my desktop" could mean organize files into folders, delete old screenshots, or close open windows. FM agents guess, and they guess wrong roughly 30% of the time on ambiguous commands. Be specific: "Move all .png files from Desktop to ~/Screenshots and delete files older than 30 days."
Cost at scale. Each perceive-reason-act cycle costs tokens. A simple 5-step workflow might use 10,000-50,000 tokens. Running that hourly adds up. For high-frequency automations, consider whether a traditional script handles the stable parts while the FM agent handles only the variable decisions.
Running an FM Agent Locally
Local execution matters for FM agents because it keeps your screen data, keystrokes, and file contents off remote servers. Here is what a minimal setup looks like on macOS:
# Install Fazm (open source macOS FM agent)
brew install m13v/tap/fazm
# Grant accessibility permissions
# System Settings > Privacy & Security > Accessibility > Enable Fazm
# Run with local model (Ollama)
fazm --model ollama/llama3.2 --voice
# Or with Claude API
export ANTHROPIC_API_KEY="sk-ant-..."
fazm --model claude-sonnet --voice
The key requirements:
FM Agent vs. Browser Agent vs. RPA
People often conflate these three categories. They solve different problems:
| Capability | FM agent (desktop) | Browser agent | Traditional RPA | |---|---|---|---| | Scope | Any app on the OS | Browser tabs only | Scripted apps only | | UI understanding | Accessibility tree + vision | DOM + selectors | Pixel matching / selectors | | Handles UI changes | Yes, via FM reasoning | Partially (DOM shifts) | No, breaks immediately | | Speed per action | 200-800ms | 100-500ms | 10-50ms | | Setup effort | Low (natural language) | Low (natural language) | High (recording/scripting) | | Reliability (stable UI) | 85-92% | 90-95% | 99%+ | | Reliability (changing UI) | 80-88% | 70-85% | 0-20% | | Privacy | Can run fully local | Usually cloud-based | Local |
The sweet spot for FM agents is cross-app workflows on a desktop where the UI changes periodically. If your task lives entirely in the browser, a browser agent is more reliable. If the UI never changes, a traditional script is faster and cheaper.
Building Your Own FM Agent
If you want to build an FM agent from scratch rather than using an existing one, here is the minimal architecture:
import subprocess
import json
def perceive():
"""Get structured UI state via accessibility API."""
result = subprocess.run(
["swift", "-e", """
import Cocoa
let app = NSWorkspace.shared.frontmostApplication!
let appRef = AXUIElementCreateApplication(app.processIdentifier)
// ... traverse tree, output JSON
"""],
capture_output=True, text=True
)
return json.loads(result.stdout)
def reason(state, goal, history):
"""Ask the FM what to do next."""
prompt = f"""
Current screen state: {json.dumps(state)}
Goal: {goal}
Actions taken so far: {history}
What is the next single action to take?
Respond as JSON: {{"action": "click|type|scroll|done", "target": "...", "value": "..."}}
"""
return call_foundation_model(prompt)
def act(action):
"""Execute the action via OS APIs."""
if action["action"] == "click":
click_element(action["target"])
elif action["action"] == "type":
type_text(action["target"], action["value"])
def run_agent(goal, max_steps=20):
history = []
for step in range(max_steps):
state = perceive()
action = reason(state, goal, history)
if action["action"] == "done":
return True
act(action)
history.append(action)
return False
This is a skeleton. A production agent adds retry logic, screenshot fallbacks, permission handling, cost tracking, and persistent memory. But the core loop is always perceive, reason, act.
Common Pitfalls
- Skipping verification after actions. The agent clicks a button, assumes it worked, and moves on. Ten steps later the workflow fails because the click didn't register. Always re-read the screen after acting.
- Oversized context windows. Sending the full accessibility tree (thousands of elements) to the FM wastes tokens and confuses the model. Filter to the active window and visible elements only.
- No cost tracking. A runaway FM agent can burn through $50 of API credits in an hour if it gets stuck in a retry loop. Add per-session spending caps.
- Ignoring the model's uncertainty. When the FM says "I'm not sure which button to click," that is a signal to ask the human, not to guess. Surface confidence scores and halt on low confidence.
Checklist for Deploying an FM Agent
Tip
Start with a single, well-defined workflow before expanding. An FM agent that reliably handles one task is more valuable than one that occasionally handles twenty.
- Define the workflow in plain language with clear start and end conditions
- Grant the minimum OS permissions needed (accessibility, screen recording)
- Choose an FM provider (local for privacy, API for capability)
- Set a spending cap per session and per day
- Add logging for every action the agent takes
- Test on your actual apps with your actual data
- Run supervised for the first week before enabling autonomous mode
Wrapping Up
FM agents combine foundation model reasoning with OS-level control to automate real desktop workflows. They are not replacements for scripts or RPA when those tools work, but they handle the messy, variable, cross-app tasks that traditional automation cannot touch. Start with a specific workflow, verify every action, and keep a human in the loop until you trust the results.
Fazm is an open source macOS AI agent that runs FM-powered automation locally. Open source on GitHub.