Back to Blog

What We Learned Building a macOS AI Agent in Swift (ScreenCaptureKit, Accessibility APIs, Async Pipelines)

Fazm Team··5 min read
swiftscreencapturekitaccessibility-apiengineeringmacos

What We Learned Building a macOS AI Agent in Swift

We have been building Fazm for about six months - a macOS desktop agent that uses voice input to control your computer. Along the way we ran into a lot of Swift-specific challenges that we did not see documented anywhere. Here is what we learned.

ScreenCaptureKit for Real-Time Screen Capture

The first problem is capturing what is on the screen. We needed something fast enough for real-time use but lightweight enough to run continuously without killing battery life.

ScreenCaptureKit was the answer. Apple introduced it in macOS 12.3 as a modern replacement for the older CGWindowListCreateImage approach. The key advantages for an AI agent:

  • Per-window filtering. You can capture specific windows instead of the entire screen, which means less data to process and better privacy.
  • Hardware-accelerated. The capture pipeline runs on the GPU, so CPU overhead stays minimal even at high frame rates.
  • CMSampleBuffer output. You get raw pixel buffers that you can feed directly to vision models without image format conversion overhead.

The tricky part was getting the update frequency right. Too slow and the agent misses UI state changes. Too fast and you waste compute processing identical frames. We settled on adaptive capture - high frequency during active automation, low frequency during idle monitoring.

Accessibility APIs Beat Screenshots Every Time

Early on we tried the screenshot-and-OCR approach that most AI computer agents use. Take a picture of the screen, send it to a vision model, get back coordinates of where to click. It works in demos. It breaks in production.

The problems are fundamental:

  • Resolution dependence. A button that is 40px wide on a Retina display looks different at every scale factor.
  • Visual ambiguity. Two buttons that look identical to a vision model might do completely different things.
  • Fragility. Every UI update, theme change, or notification banner shifts pixel coordinates.

macOS accessibility APIs solve all of this. Instead of guessing what a UI element is based on its appearance, you query it directly by role, label, and position in the element hierarchy. A "Save" button is always AXButton with AXTitle: "Save", regardless of what it looks like.

The accessibility tree gives you:

  • Element roles (button, text field, menu item)
  • Labels and values
  • Hierarchical relationships (this button is inside this toolbar inside this window)
  • Actionable operations (press, set value, select)

This is the same interface screen readers use, which means it is designed to be stable across visual changes. When an app redesigns their UI, the accessibility tree usually stays the same. We wrote a detailed comparison of this approach versus screenshot-based vision in how AI agents see your screen.

Swift Concurrency for the Pipeline

A desktop AI agent has a natural pipeline: capture screen, process with LLM, execute actions, verify result. Each stage has different latency characteristics and failure modes.

Swift's structured concurrency turned out to be a great fit:

  • AsyncStream for the capture pipeline - frames flow continuously from ScreenCaptureKit.
  • Task groups for parallel action execution - clicking a button and verifying the result can happen concurrently.
  • Actor isolation for state management - the agent's understanding of the current screen state needs to be thread-safe.

The biggest lesson was to not fight the concurrency model. Early versions tried to use completion handlers and dispatch queues, and the code was a mess. Rewriting to async/await made the pipeline logic readable and the error handling straightforward. For a broader look at how voice, LLMs, and local inference fit together in a desktop agent, see how LLMs can control your computer.

Voice Input with WhisperKit

For voice control, we use WhisperKit running locally on Apple Silicon. The key decision was local vs cloud transcription. Cloud gives you better accuracy. Local gives you zero latency and complete privacy.

For a desktop agent, privacy wins. The agent already sees everything on your screen - sending audio to a cloud service on top of that was a non-starter. WhisperKit on an M1 gives us usable transcription speed with good-enough accuracy for command input. We go deeper into Apple Silicon's unified memory advantages for local AI in on-device AI on Apple Silicon.

The Architecture Summary

The full pipeline looks like:

  1. WhisperKit captures and transcribes voice input locally
  2. ScreenCaptureKit grabs the current screen state
  3. Accessibility APIs build a structured representation of the UI
  4. LLM (Claude or local model via Ollama) plans the actions
  5. Accessibility APIs execute the actions (click, type, select)
  6. ScreenCaptureKit verifies the result

Each step is an async operation in a structured concurrency pipeline. Failures at any stage cancel downstream work cleanly. For more on how we used Claude to write much of this pipeline, see building a macOS desktop agent with Claude.


This post is based on our experience shared in r/swift. Fazm is open source on GitHub - MIT licensed, written entirely in Swift/SwiftUI.

Related Posts