Building an MCP Server That Combines macOS Accessibility APIs With Screen Capture

Matthew Diakonov

Updated March 19, 2026

mcp accessibility-api screen-capture macos swift

Building an MCP Server for macOS Accessibility and Screen Capture

The biggest unlock for building Fazm was creating an MCP server that wraps macOS accessibility APIs together with screen capture. The AI can literally see what is on screen and click things.

This is the bridge between "AI that talks about your computer" and "AI that uses your computer."

Why Both APIs Together

Accessibility APIs alone give you structured data - element names, types, positions, hierarchy. But they miss visual context. An accessibility tree tells you there is a button labeled "Submit" but does not tell you it is grayed out, partially hidden behind a modal, or surrounded by a red error border.

Screen capture alone gives you visual context but no structure. A screenshot shows you exactly what the screen looks like, but the AI has to use vision models to figure out what is clickable and where.

Combining them gives the agent the best of both:

Structure from accessibility. The agent knows every interactive element, its type, label, and position.
Visual context from screen capture. The agent can verify what it sees matches what the accessibility tree reports.
Confidence scoring. When the accessibility tree says a button is at position (200, 300) and the screenshot confirms a button is visually present at that location, the agent can act with high confidence.

The MCP Architecture

The MCP server is written in Swift (because the macOS accessibility and screen capture APIs are native frameworks) and exposes tools like:

get_accessibility_tree(app: String) - returns the full UI element hierarchy for an app
capture_screen(region: Rect?) - captures a screenshot of the full screen or a specific region
click_element(app: String, element: String) - clicks a specific element by its accessibility identifier
type_text(text: String) - types text into the currently focused field

The Key Insight

Neither API alone is sufficient for reliable desktop automation. Accessibility APIs miss visual state. Screenshots miss semantic structure. The combination is what makes the agent actually reliable.

Building an MCP Server That Combines macOS Accessibility APIs With Screen Capture

Building an MCP Server for macOS Accessibility and Screen Capture

Why Both APIs Together

The MCP Architecture

The Key Insight

More on This Topic

Related Posts

Why Swift Is the Right Choice for MCP Servers That Need macOS System APIs

ScreenCaptureKit: Complete Swift API Guide for macOS

ScreenCaptureKit Demo App: Build a Working Screen Capture Tool on macOS