Building a macOS AI Agent with Accessibility APIs and ScreenCaptureKit

Matthew Diakonov

Updated March 19, 2026

macos accessibility-api screencapturekit desktop-agent swift native

Building a macOS AI Agent with Accessibility APIs and ScreenCaptureKit

When we set out to build a macOS AI agent, we had two main options for how the agent interacts with the operating system - screenshot-based (take a picture, analyze it, click coordinates) or API-based (use Accessibility APIs to read UI elements directly). We chose the API approach, supplemented by ScreenCaptureKit for visual context. Here is why and how.

Accessibility APIs for Control

The macOS Accessibility API (AX API) exposes every UI element as a structured object. Buttons have labels, text fields have values, menus have items. The agent reads this tree to understand what is on screen and what actions are available, then performs actions by calling methods on those elements.

The advantage over screenshot analysis is precision. The agent does not need to guess where a button is based on pixel coordinates. It knows exactly which element it wants to interact with and can click it reliably every time.

ScreenCaptureKit for Context

While Accessibility APIs provide structure, they miss visual context. A chart, an image, the overall layout of a page - these matter for understanding what the user is looking at. ScreenCaptureKit captures specific windows or screen regions efficiently, giving the agent visual context without the overhead of full-screen recording.

We use ScreenCaptureKit selectively - only when the agent needs visual understanding that the accessibility tree cannot provide. This keeps resource usage low while still enabling visual reasoning when needed.

The Technical Stack

Our agent combines:

AX API for reading UI elements and performing actions
ScreenCaptureKit for targeted visual captures
CoreML for on-device processing of visual data
Swift for native macOS integration with minimal overhead

Why Native Matters

Browser-based agents run in a sandboxed environment and can only interact with web content. A native macOS agent can interact with any application - Finder, Terminal, Xcode, Figma, anything that runs on your Mac. This is the difference between automating a website and automating your entire computer.

This post was inspired by a discussion on r/swift (20 comments) by u/Swiftapple.

Fazm is an open source macOS AI agent. Open source on GitHub.

Building a macOS AI Agent with Accessibility APIs and ScreenCaptureKit

Building a macOS AI Agent with Accessibility APIs and ScreenCaptureKit

Accessibility APIs for Control

ScreenCaptureKit for Context

The Technical Stack

Why Native Matters

More on This Topic

Related Posts

What We Learned Building a macOS AI Agent in Swift (ScreenCaptureKit, Accessibility APIs, Async Pipelines)

ScreenCaptureKit: Complete Swift API Guide for macOS

macOS AI Agent: How Desktop Agents Work on Mac in 2026