Fazm AI Mac Agent - Open Source Desktop Automation for macOS
Fazm AI Mac Agent
Fazm is an open source AI agent that runs on your Mac and controls your desktop. You talk to it, it sees your screen, understands what you are working on, and takes action through native macOS APIs. No browser extensions, no cloud screenshots, no sandboxed environment. It operates your Mac the same way you do: clicking, typing, navigating between apps, reading what is on screen.
This post covers what Fazm does, how it works under the hood, what sets it apart from other Mac AI agents, and how to get started.
What Fazm Actually Does
Fazm sits in your menu bar and listens for a keyboard shortcut. When you activate it, you speak a command in natural language. The agent captures your screen, interprets it, plans a sequence of actions, and executes them.
Some real examples:
- "Open Safari and search for the latest Next.js release notes"
- "Move all the screenshots from my Desktop to a new folder called Screenshots April"
- "Reply to the last email from Sarah saying I will be 10 minutes late"
- "Find the Figma file I was working on yesterday and export the hero section as PNG"
Each of these involves multiple apps, multiple steps, and context the agent gathers from your screen in real time. You do not need to script anything. You do not need to configure per-app integrations.
Architecture: How It Works
Fazm combines three macOS technologies to see, understand, and act on your desktop.
ScreenCaptureKit for Vision
Fazm uses Apple's ScreenCaptureKit framework to capture what is on your screen. This is the same API that powers screen sharing in FaceTime and screen recording in QuickTime. It captures at the compositor level, which means it sees exactly what you see, including overlays, transparency, and multi-window layouts.
The captured frames go to the LLM for visual understanding. The model identifies UI elements, reads text, and understands spatial relationships between windows and controls.
Accessibility APIs for Action
Once the LLM decides what to do, Fazm executes actions through macOS accessibility APIs. These are the same APIs that screen readers like VoiceOver use. They let Fazm:
- Click buttons and menu items in any application
- Type text into fields
- Read the contents of UI elements programmatically
- Navigate app hierarchies (menus, toolbars, sidebars)
- Trigger keyboard shortcuts
This is different from mouse/keyboard simulation. Accessibility APIs interact with the semantic structure of an app, not just pixel coordinates. If a button moves because the window resized, the accessibility approach still finds it by its role and label.
Local Speech Recognition
Voice input runs through a local Whisper model. Your audio never leaves your machine. Fazm downloads the model on first launch (the large-v3-turbo variant, around 1.5GB) and runs inference on your Mac's GPU via Core ML.
Recognition latency is typically under 500ms for short commands on Apple Silicon.
Fazm vs Other Mac AI Agents
The Mac AI agent space has grown fast. Here is how Fazm compares to the main alternatives.
| Feature | Fazm | Apple Intelligence / Siri | Highlight AI | OpenAI Operator | |---|---|---|---|---| | Open source | Yes (MIT) | No | No | No | | Works offline | Partial (local models) | Partial | No | No | | Controls any Mac app | Yes (AX APIs) | Limited to Siri intents | Screen-based | Browser only | | Voice input | Yes (local Whisper) | Yes | No | No | | Screen understanding | Yes (ScreenCaptureKit) | Limited | Yes (screenshots) | Browser DOM | | Runs locally | Yes | Partially | Cloud-dependent | Cloud-only | | Price | Free | Free (with Mac) | Subscription | Subscription | | Multi-step workflows | Yes | Very limited | Yes | Yes (browser) |
Tip
Fazm works alongside Siri, not instead of it. Siri handles quick system commands (volume, timers, HomeKit) well. Fazm picks up where Siri stops: multi-app workflows, screen context, and tasks that require actually seeing what is on your display.
Getting Started
Fazm requires macOS 14 (Sonoma) or later and an Apple Silicon Mac (M1 or newer). Intel Macs can run it but speech recognition will be significantly slower without the Neural Engine.
Installation
# Clone the repository
git clone https://github.com/m13v/fazm.git
cd fazm
# Open in Xcode and build
open fazm.xcodeproj
# Or build from the command line
xcodebuild -scheme fazm -configuration Release
On first launch, macOS will ask for two permissions:
- Screen Recording (for ScreenCaptureKit to capture your display)
- Accessibility (for the agent to interact with UI elements in other apps)
Both are required. Without Screen Recording, the agent cannot see your screen. Without Accessibility, it cannot click or type.
First Command
After granting permissions and waiting for the Whisper model to download:
- Press the activation shortcut (default:
Option + Space) - Say your command: "Open Notes and create a new note called Meeting Agenda"
- Watch the agent navigate to Notes, create the note, and type the title
The first command after launch may take a second or two longer while the models warm up. Subsequent commands run faster because the models stay loaded in memory.
How Fazm Plans Multi-Step Tasks
When you say "move all PDFs from Downloads to a folder called Invoices on the Desktop," the agent does not just blindly execute a file move. It plans:
- Open Finder (or switch to it if already open)
- Navigate to the Downloads folder
- Identify all PDF files by reading the file list through accessibility APIs
- Select them
- Check if a folder called "Invoices" exists on the Desktop
- If not, create it
- Move the selected files to that folder
Each step includes a screen capture to verify the previous action completed successfully. If something unexpected happens (a dialog box appears, the folder already exists with files in it), the agent re-plans based on what it sees.
This observe-plan-act loop runs until the task is complete or the agent determines it cannot proceed (for example, if a required app is not installed).
Common Pitfalls
-
Denying accessibility permissions after the first prompt. macOS only asks once. If you click "Don't Allow," you need to manually enable it in System Settings, then restart Fazm. The agent will not remind you; it will simply fail silently when trying to interact with apps.
-
Running on macOS 13 or earlier. ScreenCaptureKit's content filter APIs that Fazm relies on were introduced in macOS 14. The app will build on Ventura but screen capture will fail at runtime with a cryptic
SCStreamErrorcode 2. -
Expecting instant execution on complex tasks. Each step involves a screen capture, an LLM inference call, and an accessibility action. A 10-step workflow takes 15 to 30 seconds depending on model speed. This is not a scripting engine; it is an agent that thinks between steps.
-
Using Fazm for repetitive batch operations. If you need to rename 500 files or process a CSV, use a shell script. Fazm is built for ad-hoc workflows where you do not want to write code, not for batch processing where a
forloop is faster.
Privacy and Security
Fazm processes everything locally by default. Screen captures stay on your machine. Voice audio is transcribed locally by Whisper.
The one exception is the LLM planner. If you use Claude or another cloud model for planning, screen captures are sent to the API. Fazm strips sensitive regions (password fields, banking apps) using accessibility metadata before sending, but you should be aware that screen content reaches the model provider.
For fully offline operation, you can configure Fazm to use a local model. Performance depends on your hardware: an M1 Pro handles 7B parameter models comfortably, while M2 Ultra or M3 Max can run larger models at reasonable speed.
| Configuration | Privacy | Performance | Setup Complexity | |---|---|---|---| | Cloud LLM (Claude) | Screenshots sent to API | Fastest planning (~1s) | Lowest (API key only) | | Local 7B model | Fully offline | Moderate (~3s per step) | Medium (model download) | | Local 70B model | Fully offline | Fast (~1.5s per step) | High (needs 48GB+ RAM) |
Extending Fazm
Fazm is MIT licensed and built in Swift/SwiftUI. The codebase is structured around a few core modules:
fazm/
├── Capture/ # ScreenCaptureKit integration
├── Accessibility/ # AX API wrappers for app interaction
├── Voice/ # Whisper speech recognition
├── Agent/ # LLM planning and action execution
├── UI/ # SwiftUI menu bar interface
└── Models/ # Data types and configuration
If you want to add support for a specific workflow or app integration, the Agent/ module is where planning logic lives. Custom tools can be registered there to give the LLM new capabilities beyond the default click/type/read primitives.
Quick Start Checklist
Here is the minimum path from zero to a working Fazm setup:
- Verify macOS 14+ and Apple Silicon (
sw_versanduname -min Terminal) - Clone the repo:
git clone https://github.com/m13v/fazm.git - Build with Xcode or
xcodebuild - Grant Screen Recording permission when prompted
- Grant Accessibility permission when prompted
- Wait for Whisper model download (about 1.5GB, one time)
- Press
Option + Spaceand speak your first command - Watch the agent work
Wrapping Up
Fazm turns your Mac into something closer to what Siri promised but never delivered: an AI that actually sees your screen, understands context, and takes action across any app. It is open source, runs locally, and does not require an account or subscription.
Fazm is an open source macOS AI agent. Open source on GitHub.