Fazm AI Desktop Agent: Open Source Automation That Controls Your Entire Computer

Matthew Diakonov··10 min read

Fazm AI Desktop Agent

An AI desktop agent is software that sees your screen, understands what you are doing, and takes action on your behalf. It clicks buttons, types text, navigates between apps, and handles multi-step workflows. Fazm is an open source AI desktop agent built specifically for macOS that does all of this through native system APIs, not browser hacks or cloud screenshots.

This post explains what makes Fazm different from other desktop agents, how the architecture works, what you can actually automate with it, and how to get it running on your Mac in under five minutes.

Why a Desktop Agent Instead of Browser Automation

Most AI automation tools live inside the browser. They can fill forms, click links, scrape data. But your actual work happens across dozens of native apps: Finder, Mail, Calendar, Slack, Figma, Xcode, Terminal, Preview. A browser agent cannot touch any of them.

A desktop agent operates at the OS level. It can move files, compose emails, manage calendar events, control Spotify, resize images in Preview, and run terminal commands. All from a single voice command.

| Capability | Browser Agent | Desktop Agent (Fazm) | |---|---|---| | Web form filling | Yes | Yes (via browser control) | | Native app control | No | Yes (Accessibility API) | | File management | No | Yes (Finder, filesystem) | | Multi-app workflows | No | Yes (cross-app coordination) | | Voice activation | Rarely | Yes (always-on hotkey) | | Screen understanding | DOM only | Full screen (ScreenCaptureKit) | | Works offline | No | Partial (local models via Ollama) |

How Fazm Works Under the Hood

Fazm combines three macOS subsystems to perceive and act on your desktop.

Voice InputWhisperKit (on-device)Screen CaptureScreenCaptureKitAccessibility TreeAX API (structured)LLM PlannerClaude / GPT / OllamaAction Executorclicks, types, keystrokes

Voice Input runs WhisperKit on your Apple Silicon chip. No audio leaves your machine. Latency is around 200-400ms for a typical command depending on your hardware (M1 sits at the slower end, M3 Pro and above are near-instant).

ScreenCaptureKit grabs a frame of your display at the moment you speak. This gives the LLM visual context: what app is open, what content is visible, where UI elements are positioned.

Accessibility API provides the structured data layer. Every button, text field, menu item, and label in every running app exposes its role, value, and position through the macOS accessibility tree. This is how Fazm knows exactly which element to click, not by guessing from pixels, but by reading the semantic structure of the UI.

The LLM planner receives the screenshot, the accessibility tree, and your transcribed command. It produces a sequence of actions: click this button at coordinates (x, y), type this text, press this keyboard shortcut. The action executor carries them out through CGEvent and the AXUIElement API.

What You Can Automate

Here are workflows people actually run with Fazm daily:

Email triage: "Read my unread emails and draft replies to anything from the engineering team." Fazm opens Mail, scans the inbox, identifies senders, composes contextual replies, and leaves them as drafts for your review.

File organization: "Move all PDFs from Downloads to a folder called Tax 2025 on the Desktop." Fazm opens Finder, filters by file type, creates the folder if it does not exist, and moves the files.

Research workflows: "Open Safari, search for the latest App Store review guidelines, and paste a summary into my Notes app." This crosses three apps in a single command.

Development tasks: "Open Terminal, run the test suite, and if any tests fail, open the failing test file in VS Code." Fazm can chain conditional logic across apps.

Meeting prep: "Check my calendar for today, find the Zoom link for the next meeting, and open it 2 minutes before start." Calendar reading plus browser automation plus timing.

Tip

Start with simple, single-app commands to build confidence. "Open Finder and create a new folder called Projects" is a good first test. Once you see that working, chain multi-app workflows.

Fazm vs Other AI Desktop Agents

The desktop agent space is growing. Here is how Fazm compares to alternatives:

| Feature | Fazm | Manus Desktop | Perplexity Comet | ChatGPT Atlas | |---|---|---|---|---| | Open source | Yes (MIT) | No | No | No | | Runs locally | Yes | Partial | No (cloud VM) | No | | Voice control | Built-in | No | No | No | | macOS native APIs | Yes (AX + SCKit) | Limited | Screenshot only | Screenshot only | | Privacy | On-device audio, local screen | Cloud processing | Cloud VM recording | Cloud processing | | Custom LLM support | Yes (Ollama, any API) | GPT only | Proprietary | GPT only | | Price | Free | Subscription | Subscription | Plus/Pro |

The main differentiator is that Fazm uses both the accessibility tree and screen capture together. Screenshot-only agents guess where to click based on pixel recognition. That breaks when the UI changes slightly, when dark mode is on, or when a modal covers part of the screen. The accessibility tree gives Fazm the exact coordinates and roles of every interactive element, making actions reliable even when the visual layout shifts.

Getting Started in Five Minutes

Prerequisites

  • macOS 14.0 (Sonoma) or later
  • Apple Silicon (M1 or newer)
  • An API key for Claude, GPT, or a local Ollama model

Install

brew install m13v/tap/fazm

Or clone and build from source:

git clone https://github.com/m13v/fazm.git
cd fazm
swift build -c release

Grant Permissions

Fazm needs two macOS permissions to function:

  1. Accessibility: System Settings > Privacy & Security > Accessibility > enable Fazm
  2. Screen Recording: System Settings > Privacy & Security > Screen & Audio Recording > enable Fazm

Without these, the agent can hear you but cannot see or interact with your desktop.

Configure Your LLM

On first launch, Fazm opens a settings panel. Pick your provider:

# For Claude (recommended)
export ANTHROPIC_API_KEY="sk-ant-..."

# For local inference via Ollama
ollama pull llama3.2
# Then select "Ollama" in Fazm settings

Your First Command

Press the hotkey (default: Cmd+Shift+Space), speak your command, and watch Fazm work. The overlay shows each planned action before execution, so you can cancel if something looks wrong.

Common Pitfalls

  • Forgetting to grant Screen Recording permission: Fazm will still launch but every screenshot will be blank. The LLM will hallucinate actions because it has no visual context. Always check permissions first.
  • Using a slow LLM for complex tasks: A 7B local model works for simple file operations but struggles with multi-step cross-app workflows. For anything involving more than 3 steps, use Claude or GPT-4 class models.
  • Running Fazm on Intel Macs: WhisperKit requires Apple Silicon. On Intel, voice input will not work. You can still use text-based commands through the menu bar interface, but the voice experience is Apple Silicon only.
  • Expecting pixel-perfect clicks immediately: On first run, Fazm may take a second to index the accessibility tree of unfamiliar apps. If a click misses, try the same command again. The second attempt usually lands because the tree is now cached.

Privacy and Security

Every AI desktop agent has access to sensitive data. Fazm handles this by keeping as much as possible on-device:

Voice transcription runs on-device via WhisperKit. No audio leaves your Mac.
Screenshots are processed in memory and discarded after each action cycle.
When using Ollama, the entire pipeline is offline. Nothing touches any server.
When using cloud LLMs (Claude, GPT), screenshots are sent to the API for reasoning. Use local models if this is a concern for your workflow.

The source code is MIT licensed and fully auditable. You can inspect exactly what data is sent where by reading the network layer in Sources/Fazm/LLM/.

Warning

Any desktop agent with Accessibility and Screen Recording permissions can see and do everything on your machine. Only install agents you trust. Fazm is open source so you can verify the code before granting access.

Wrapping Up

Fazm is an AI desktop agent that gives you voice-controlled automation across every app on your Mac. It combines ScreenCaptureKit for vision, the Accessibility API for precise interaction, and your choice of LLM for planning. Open source, local-first, and designed for real daily workflows, not demos.

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts