Building a macOS AI Agent with Accessibility APIs and ScreenCaptureKit

Fazm Team··2 min read

Building a macOS AI Agent with Accessibility APIs and ScreenCaptureKit

When we set out to build a macOS AI agent, we had two main options for how the agent interacts with the operating system - screenshot-based (take a picture, analyze it, click coordinates) or API-based (use Accessibility APIs to read UI elements directly). We chose the API approach, supplemented by ScreenCaptureKit for visual context. Here is why and how.

Accessibility APIs for Control

The macOS Accessibility API (AX API) exposes every UI element as a structured object. Buttons have labels, text fields have values, menus have items. The agent reads this tree to understand what is on screen and what actions are available, then performs actions by calling methods on those elements.

The advantage over screenshot analysis is precision. The agent does not need to guess where a button is based on pixel coordinates. It knows exactly which element it wants to interact with and can click it reliably every time.

ScreenCaptureKit for Context

While Accessibility APIs provide structure, they miss visual context. A chart, an image, the overall layout of a page - these matter for understanding what the user is looking at. ScreenCaptureKit captures specific windows or screen regions efficiently, giving the agent visual context without the overhead of full-screen recording.

We use ScreenCaptureKit selectively - only when the agent needs visual understanding that the accessibility tree cannot provide. This keeps resource usage low while still enabling visual reasoning when needed.

The Technical Stack

Our agent combines:

  • AX API for reading UI elements and performing actions
  • ScreenCaptureKit for targeted visual captures
  • CoreML for on-device processing of visual data
  • Swift for native macOS integration with minimal overhead

Why Native Matters

Browser-based agents run in a sandboxed environment and can only interact with web content. A native macOS agent can interact with any application - Finder, Terminal, Xcode, Figma, anything that runs on your Mac. This is the difference between automating a website and automating your entire computer.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts