Fazm macOS AI Agent: Open Source Desktop Automation That Actually Works

Matthew Diakonov··11 min read

Fazm macOS AI Agent

Fazm is an open source macOS AI agent that controls your desktop through native Apple APIs. It reads your screen with ScreenCaptureKit, interacts with apps through the Accessibility framework, and responds to voice commands, all running locally on your Mac.

If you have tried other desktop AI tools and found them locked to a single browser or unable to interact with native macOS apps, Fazm takes a different approach. It operates at the OS level, which means it can work with any application that macOS exposes through its accessibility tree.

How Fazm Works on macOS

Fazm combines three macOS technologies to give the AI agent full desktop awareness:

  1. ScreenCaptureKit captures what is on screen in real time, allowing the agent to "see" exactly what you see
  2. Accessibility APIs provide structured data about every UI element (buttons, text fields, menus) and the ability to interact with them programmatically
  3. Speech recognition (on-device via Apple's Speech framework) converts your voice commands into actions

This combination means Fazm does not rely on pixel matching or fragile coordinate clicking. It understands the semantic structure of the UI and can target elements by their role and label.

Voice InputSpeech APIFazm AgentLLM reasoningAction planningTool executionScreenCaptureKit (vision)AccessibilityAPI (interact)AnymacOSApp

What Makes Fazm Different from Browser-Only Agents

Most AI agent tools either live inside a browser extension or require you to share your screen with a cloud service. Fazm works differently because it runs as a native macOS application with direct access to system frameworks.

| Feature | Fazm (macOS native) | Browser-based agents | Cloud screen-sharing agents | |---|---|---|---| | Works with native macOS apps | Yes | No | Partially (via screenshots) | | Accessibility tree access | Direct, structured | DOM only | None | | Voice control | On-device Speech API | Varies | Usually cloud-based | | Privacy | Local processing | Extension sandboxed | Screen sent to cloud | | App interaction method | Accessibility actions | DOM manipulation | Coordinate clicking | | Performance overhead | Low (native APIs) | Medium | High (streaming) |

The key difference is that Fazm can interact with Finder, Terminal, Xcode, Preview, Mail, and any other macOS app because it operates through the accessibility framework that Apple provides for assistive technology.

Setting Up Fazm on macOS

Getting started takes a few steps because macOS requires explicit permissions for the APIs Fazm uses.

1. Install from source

git clone https://github.com/m13v/fazm
cd fazm
open fazm.xcodeproj

Build and run from Xcode (Cmd+R). Fazm targets macOS 14.0 (Sonoma) and later.

2. Grant permissions

macOS will prompt you for two critical permissions on first launch:

Important

Both permissions are required. Without Accessibility access, Fazm cannot click buttons or read UI elements. Without Screen Recording, it cannot see what is on your display. Go to System Settings, then Privacy and Security to enable both.

  • Accessibility: System Settings > Privacy & Security > Accessibility > enable Fazm
  • Screen Recording: System Settings > Privacy & Security > Screen Recording > enable Fazm

3. Configure your LLM provider

Fazm needs an LLM backend to reason about what it sees on screen. You can use OpenAI, Anthropic, or a local model through Ollama. Set your API key in the app's settings panel or via environment variable:

export ANTHROPIC_API_KEY="sk-ant-..."

Core Capabilities

Screen Reading and Understanding

Fazm captures the current screen state using ScreenCaptureKit, Apple's modern screen capture framework introduced in macOS 12.3. Unlike the older CGWindowListCreateImage API, ScreenCaptureKit provides:

  • Per-window capture without capturing the entire display
  • Hardware-accelerated encoding with minimal CPU impact
  • Content filtering (exclude specific windows or apps)
  • Capture at up to 60fps when needed, though Fazm typically samples at 1-2fps to conserve resources

The captured frames are sent to the LLM for visual understanding, giving the agent context about what is currently displayed.

Accessibility-Based Interaction

Rather than clicking at pixel coordinates (which breaks when windows move or resolution changes), Fazm uses the macOS Accessibility API to:

  • Enumerate all UI elements in the focused application
  • Read element properties (role, title, value, enabled state)
  • Perform actions (press buttons, set text field values, select menu items)
  • Navigate the element hierarchy to find specific controls
// Example: finding and clicking a button by its label
let app = AXUIElementCreateApplication(pid)
var children: CFTypeRef?
AXUIElementCopyAttributeValue(app, kAXChildrenAttribute as CFString, &children)

// Walk the tree to find a button with title "Send"
// Then perform the press action
AXUIElementPerformAction(element, kAXPressAction as CFString)

This approach is the same one VoiceOver uses, so any app that works with VoiceOver works with Fazm.

Voice Commands

Fazm accepts natural language voice commands through Apple's on-device speech recognition. You can say things like:

  • "Open Safari and go to my calendar"
  • "Find the latest email from Alex and reply saying I will be there"
  • "Close all Finder windows"
  • "Take a screenshot of this window and save it to Desktop"

The speech-to-text runs entirely on-device (no audio leaves your Mac), and the recognized text is passed to the LLM for intent parsing and action planning.

Real-World Use Cases

Automating Repetitive Workflows

Fazm excels at multi-app workflows that would normally require switching between several applications manually. For example, you can ask it to:

  1. Open a CSV file in Numbers
  2. Copy specific columns
  3. Switch to a web form
  4. Fill in the form fields from the copied data
  5. Submit and move to the next row

Because Fazm understands both the spreadsheet UI (through accessibility) and the web form (through the browser's accessibility tree), it can bridge applications that have no built-in integration.

Testing macOS Applications

Developers building macOS apps can use Fazm as an automated testing agent. Point it at your app, describe a user flow in natural language, and let it execute the steps while reporting what it observes. This is particularly useful for testing accessibility compliance since Fazm uses the same APIs that assistive technology relies on.

System Administration Tasks

For tasks like batch-renaming files in Finder, configuring System Settings, or managing multiple Terminal sessions, Fazm can handle multi-step operations that would otherwise require either manual clicking or writing AppleScript.

Fazm vs Other macOS AI Agents

| Agent | Open source | Native macOS APIs | Voice control | Local processing | Active development | |---|---|---|---|---|---| | Fazm | Yes (MIT) | ScreenCaptureKit + Accessibility | Yes (on-device) | Yes | Yes | | Highlight AI | No | Screenshot-based | No | No (cloud) | Yes | | Apple Intelligence | No | System-level | Siri | Partial | Yes | | Generic browser agents | Varies | No (browser only) | Varies | Varies | Varies |

For a deeper comparison with Highlight AI specifically, see our detailed breakdown.

Common Pitfalls

  • Forgetting to grant permissions after updates: macOS sometimes resets Screen Recording and Accessibility permissions after app updates or re-signing. If Fazm stops working after rebuilding from source, check System Settings > Privacy & Security and re-enable both permissions.

  • Trying to control apps without accessibility support: Some apps (particularly Electron apps with custom renderers, or Java-based tools) expose minimal accessibility information. Fazm will fall back to screenshot-based interaction in these cases, but results are less reliable. Check if VoiceOver can navigate the app first; if VoiceOver struggles, Fazm will too.

  • Running on macOS versions below 14.0: ScreenCaptureKit's content filtering APIs improved significantly in Sonoma. While basic capture works on Monterey (12.3+), the full Fazm feature set requires macOS 14.0 or later.

  • Setting capture framerate too high: Capturing at 30fps+ and sending every frame to an LLM will burn through your API budget fast and create latency. The default 1-2fps sampling rate is deliberate. Only increase it for time-sensitive automation tasks.

Quick Start Checklist

Here is everything you need to get Fazm running as your macOS AI agent:

Clone the repo: git clone https://github.com/m13v/fazm
Open in Xcode and build (Cmd+R)
Grant Accessibility and Screen Recording permissions
Set your LLM API key (Anthropic, OpenAI, or Ollama)
Start talking or typing commands

Wrapping Up

Fazm gives you a macOS AI agent that works with every application on your Mac through native Apple APIs. Instead of being limited to browser tabs or relying on cloud screen sharing, it uses ScreenCaptureKit for vision and the Accessibility framework for interaction, the same tools Apple built for assistive technology. If you want an AI agent that treats your entire desktop as its workspace, Fazm is the open source option that does exactly that.

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts