Building an MCP Server That Combines macOS Accessibility APIs With Screen Capture
Building an MCP Server for macOS Accessibility and Screen Capture
The biggest unlock for building Fazm was creating an MCP server that wraps macOS accessibility APIs together with screen capture. The AI can literally see what is on screen and click things.
This is the bridge between "AI that talks about your computer" and "AI that uses your computer."
Why Both APIs Together
Accessibility APIs alone give you structured data - element names, types, positions, hierarchy. But they miss visual context. An accessibility tree tells you there is a button labeled "Submit" but does not tell you it is grayed out, partially hidden behind a modal, or surrounded by a red error border.
Screen capture alone gives you visual context but no structure. A screenshot shows you exactly what the screen looks like, but the AI has to use vision models to figure out what is clickable and where.
Combining them gives the agent the best of both:
- Structure from accessibility. The agent knows every interactive element, its type, label, and position.
- Visual context from screen capture. The agent can verify what it sees matches what the accessibility tree reports.
- Confidence scoring. When the accessibility tree says a button is at position (200, 300) and the screenshot confirms a button is visually present at that location, the agent can act with high confidence.
The MCP Architecture
The MCP server is written in Swift (because the macOS accessibility and screen capture APIs are native frameworks) and exposes tools like:
get_accessibility_tree(app: String)- returns the full UI element hierarchy for an appcapture_screen(region: Rect?)- captures a screenshot of the full screen or a specific regionclick_element(app: String, element: String)- clicks a specific element by its accessibility identifiertype_text(text: String)- types text into the currently focused field
The Key Insight
Neither API alone is sufficient for reliable desktop automation. Accessibility APIs miss visual state. Screenshots miss semantic structure. The combination is what makes the agent actually reliable.
- Building a macOS AI Agent With Swift and ScreenCaptureKit
- MCP Server Debugging - Initialize Handshake
- How AI Agents See Your Screen - DOM vs Screenshots
Fazm is an open source macOS AI agent. Open source on GitHub.