Back to Blog

Building an MCP Server That Combines macOS Accessibility APIs With Screen Capture

Fazm Team··2 min read
mcpaccessibility-apiscreen-capturemacosswift

Building an MCP Server for macOS Accessibility and Screen Capture

The biggest unlock for building Fazm was creating an MCP server that wraps macOS accessibility APIs together with screen capture. The AI can literally see what is on screen and click things.

This is the bridge between "AI that talks about your computer" and "AI that uses your computer."

Why Both APIs Together

Accessibility APIs alone give you structured data - element names, types, positions, hierarchy. But they miss visual context. An accessibility tree tells you there is a button labeled "Submit" but does not tell you it is grayed out, partially hidden behind a modal, or surrounded by a red error border.

Screen capture alone gives you visual context but no structure. A screenshot shows you exactly what the screen looks like, but the AI has to use vision models to figure out what is clickable and where.

Combining them gives the agent the best of both:

  • Structure from accessibility. The agent knows every interactive element, its type, label, and position.
  • Visual context from screen capture. The agent can verify what it sees matches what the accessibility tree reports.
  • Confidence scoring. When the accessibility tree says a button is at position (200, 300) and the screenshot confirms a button is visually present at that location, the agent can act with high confidence.

The MCP Architecture

The MCP server is written in Swift (because the macOS accessibility and screen capture APIs are native frameworks) and exposes tools like:

  • get_accessibility_tree(app: String) - returns the full UI element hierarchy for an app
  • capture_screen(region: Rect?) - captures a screenshot of the full screen or a specific region
  • click_element(app: String, element: String) - clicks a specific element by its accessibility identifier
  • type_text(text: String) - types text into the currently focused field

The Key Insight

Neither API alone is sufficient for reliable desktop automation. Accessibility APIs miss visual state. Screenshots miss semantic structure. The combination is what makes the agent actually reliable.

More on This Topic

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts