Fazm AI Desktop Agent: Open Source Automation That Controls Your Entire Computer
Fazm AI Desktop Agent
An AI desktop agent is software that sees your screen, understands what you are doing, and takes action on your behalf. It clicks buttons, types text, navigates between apps, and handles multi-step workflows. Fazm is an open source AI desktop agent built specifically for macOS that does all of this through native system APIs, not browser hacks or cloud screenshots.
This post explains what makes Fazm different from other desktop agents, how the architecture works, what you can actually automate with it, and how to get it running on your Mac in under five minutes.
Why a Desktop Agent Instead of Browser Automation
Most AI automation tools live inside the browser. They can fill forms, click links, scrape data. But your actual work happens across dozens of native apps: Finder, Mail, Calendar, Slack, Figma, Xcode, Terminal, Preview. A browser agent cannot touch any of them.
A desktop agent operates at the OS level. It can move files, compose emails, manage calendar events, control Spotify, resize images in Preview, and run terminal commands. All from a single voice command.
| Capability | Browser Agent | Desktop Agent (Fazm) | |---|---|---| | Web form filling | Yes | Yes (via browser control) | | Native app control | No | Yes (Accessibility API) | | File management | No | Yes (Finder, filesystem) | | Multi-app workflows | No | Yes (cross-app coordination) | | Voice activation | Rarely | Yes (always-on hotkey) | | Screen understanding | DOM only | Full screen (ScreenCaptureKit) | | Works offline | No | Partial (local models via Ollama) |
How Fazm Works Under the Hood
Fazm combines three macOS subsystems to perceive and act on your desktop.
Voice Input runs WhisperKit on your Apple Silicon chip. No audio leaves your machine. Latency is around 200-400ms for a typical command depending on your hardware (M1 sits at the slower end, M3 Pro and above are near-instant).
ScreenCaptureKit grabs a frame of your display at the moment you speak. This gives the LLM visual context: what app is open, what content is visible, where UI elements are positioned.
Accessibility API provides the structured data layer. Every button, text field, menu item, and label in every running app exposes its role, value, and position through the macOS accessibility tree. This is how Fazm knows exactly which element to click, not by guessing from pixels, but by reading the semantic structure of the UI.
The LLM planner receives the screenshot, the accessibility tree, and your transcribed command. It produces a sequence of actions: click this button at coordinates (x, y), type this text, press this keyboard shortcut. The action executor carries them out through CGEvent and the AXUIElement API.
What You Can Automate
Here are workflows people actually run with Fazm daily:
Email triage: "Read my unread emails and draft replies to anything from the engineering team." Fazm opens Mail, scans the inbox, identifies senders, composes contextual replies, and leaves them as drafts for your review.
File organization: "Move all PDFs from Downloads to a folder called Tax 2025 on the Desktop." Fazm opens Finder, filters by file type, creates the folder if it does not exist, and moves the files.
Research workflows: "Open Safari, search for the latest App Store review guidelines, and paste a summary into my Notes app." This crosses three apps in a single command.
Development tasks: "Open Terminal, run the test suite, and if any tests fail, open the failing test file in VS Code." Fazm can chain conditional logic across apps.
Meeting prep: "Check my calendar for today, find the Zoom link for the next meeting, and open it 2 minutes before start." Calendar reading plus browser automation plus timing.
Tip
Start with simple, single-app commands to build confidence. "Open Finder and create a new folder called Projects" is a good first test. Once you see that working, chain multi-app workflows.
Fazm vs Other AI Desktop Agents
The desktop agent space is growing. Here is how Fazm compares to alternatives:
| Feature | Fazm | Manus Desktop | Perplexity Comet | ChatGPT Atlas | |---|---|---|---|---| | Open source | Yes (MIT) | No | No | No | | Runs locally | Yes | Partial | No (cloud VM) | No | | Voice control | Built-in | No | No | No | | macOS native APIs | Yes (AX + SCKit) | Limited | Screenshot only | Screenshot only | | Privacy | On-device audio, local screen | Cloud processing | Cloud VM recording | Cloud processing | | Custom LLM support | Yes (Ollama, any API) | GPT only | Proprietary | GPT only | | Price | Free | Subscription | Subscription | Plus/Pro |
The main differentiator is that Fazm uses both the accessibility tree and screen capture together. Screenshot-only agents guess where to click based on pixel recognition. That breaks when the UI changes slightly, when dark mode is on, or when a modal covers part of the screen. The accessibility tree gives Fazm the exact coordinates and roles of every interactive element, making actions reliable even when the visual layout shifts.
Getting Started in Five Minutes
Prerequisites
- macOS 14.0 (Sonoma) or later
- Apple Silicon (M1 or newer)
- An API key for Claude, GPT, or a local Ollama model
Install
brew install m13v/tap/fazm
Or clone and build from source:
git clone https://github.com/m13v/fazm.git
cd fazm
swift build -c release
Grant Permissions
Fazm needs two macOS permissions to function:
- Accessibility: System Settings > Privacy & Security > Accessibility > enable Fazm
- Screen Recording: System Settings > Privacy & Security > Screen & Audio Recording > enable Fazm
Without these, the agent can hear you but cannot see or interact with your desktop.
Configure Your LLM
On first launch, Fazm opens a settings panel. Pick your provider:
# For Claude (recommended)
export ANTHROPIC_API_KEY="sk-ant-..."
# For local inference via Ollama
ollama pull llama3.2
# Then select "Ollama" in Fazm settings
Your First Command
Press the hotkey (default: Cmd+Shift+Space), speak your command, and watch Fazm work. The overlay shows each planned action before execution, so you can cancel if something looks wrong.
Common Pitfalls
- Forgetting to grant Screen Recording permission: Fazm will still launch but every screenshot will be blank. The LLM will hallucinate actions because it has no visual context. Always check permissions first.
- Using a slow LLM for complex tasks: A 7B local model works for simple file operations but struggles with multi-step cross-app workflows. For anything involving more than 3 steps, use Claude or GPT-4 class models.
- Running Fazm on Intel Macs: WhisperKit requires Apple Silicon. On Intel, voice input will not work. You can still use text-based commands through the menu bar interface, but the voice experience is Apple Silicon only.
- Expecting pixel-perfect clicks immediately: On first run, Fazm may take a second to index the accessibility tree of unfamiliar apps. If a click misses, try the same command again. The second attempt usually lands because the tree is now cached.
Privacy and Security
Every AI desktop agent has access to sensitive data. Fazm handles this by keeping as much as possible on-device:
The source code is MIT licensed and fully auditable. You can inspect exactly what data is sent where by reading the network layer in Sources/Fazm/LLM/.
Warning
Any desktop agent with Accessibility and Screen Recording permissions can see and do everything on your machine. Only install agents you trust. Fazm is open source so you can verify the code before granting access.
Wrapping Up
Fazm is an AI desktop agent that gives you voice-controlled automation across every app on your Mac. It combines ScreenCaptureKit for vision, the Accessibility API for precise interaction, and your choice of LLM for planning. Open source, local-first, and designed for real daily workflows, not demos.
Fazm is an open source macOS AI agent. Open source on GitHub.