Building Local AI Agents on macOS: Accessibility APIs, Security, and Practical Setup
Experimenting with OpenClaw agentic AI agents revealed a clear pattern: agents that run locally and use macOS accessibility APIs are faster, cheaper, and more trustworthy than cloud-dependent screenshot-based alternatives. Here is what that looks like in practice.
1. Why Local-First Matters for Agent Security
When an AI agent operates on your desktop, it has access to things that are deeply sensitive: your calendar, your email drafts, open documents, browser sessions, chat messages. The agent sees your actual working state - not a sanitized API response, but the real data visible on your screen right now.
Cloud-based desktop agents route this data through remote servers. Every screenshot the agent captures, every text field it reads, every action it plans - all of that is serialized and sent offsite for processing. For personal productivity use cases, that is an uncomfortable tradeoff. For enterprise environments with compliance requirements, it may be a blocker entirely.
The local-first advantage: When inference runs on device - via a local LLM or an on-premise model - the agent's reasoning never leaves your machine. What the agent sees stays local. What it decides stays local. The only network activity is whatever actions it takes on your behalf (sending an email, submitting a form).
Local execution also removes the latency of a round-trip to a remote API. For agents that need to read and interact with UI state quickly, that difference compounds across every step of a multi-action task.
2. Accessibility APIs vs Screenshot-Based Approaches
This is the most important architectural decision in macOS agent development. There are two fundamentally different ways an agent can understand what is on screen:
- Screenshot + vision model: Capture the screen as a PNG, send it to a multimodal LLM, ask the model to identify where to click. The agent is essentially doing OCR and visual reasoning on every step.
- Accessibility (AX) APIs: Query macOS directly for the UI tree of the frontmost app. Get back a structured hierarchy of every button, text field, label, and list item - with exact coordinates, current values, and semantic roles.
| Factor | Accessibility APIs | Screenshot + Vision |
|---|---|---|
| Click accuracy | ~99% (exact coordinates) | ~80-90% (estimated from pixels) |
| Latency per step | 50-200ms | 2,000-5,000ms |
| Token cost | Low (text tree) | High (image tokens) |
| Works without display | Yes | No |
| Dynamic content | Real-time UI state | May miss updates between captures |
| Overlapping elements | Handled structurally | Frequently fails |
The screenshot approach has one real advantage: it works across platforms without per-platform integration work. But on macOS, where the AX API is mature and well-documented, there is rarely a good reason to choose pixels over structure.
3. Getting Structured Data from Native Apps
The macOS Accessibility API exposes every app's UI as a tree of AXUIElement objects. Each element has a role (button, text field, list, etc.), a label, a value, a position, and a size. For an AI agent, this is a dramatically richer signal than a flattened screenshot.
Here is what structured UI data looks like for a hypothetical email compose window:
AXWindow "Compose New Message"
AXGroup
AXTextField role=AXTextField label="To" value=""
AXTextField role=AXTextField label="Subject" value=""
AXTextArea role=AXTextArea label="Body" value=""
AXGroup
AXButton label="Send" x:820 y:640 w:80 h:32
AXButton label="Discard" x:720 y:640 w:80 h:32The agent does not need to ask "where is the To field?" - it is labeled and positioned in the tree. It does not need to estimate coordinates from a screenshot - it reads exact pixel values. It does not need to OCR button text - the labels are already strings.
Beyond UI elements, macOS offers additional structured data sources that agents can tap without any screen parsing: AppleScript and JXA for scriptable apps (Mail, Calendar, Contacts, Finder), the NSWorkspace API for running processes and open documents, and the Scripting Bridge for a wide range of third-party apps that expose scripting dictionaries.
4. Performance: The 50x Speed Difference
The performance gap between accessibility-based and screenshot-based agents is not a minor optimization - it changes what kinds of tasks agents can realistically do.
| Task | Accessibility API | Screenshot + Vision | Speedup |
|---|---|---|---|
| Read current app state | ~50ms | ~2,500ms | 50x |
| Click a specific button | ~100ms | ~3,000ms | 30x |
| Fill a form (5 fields) | ~2s | ~20-30s | 10-15x |
| Navigate a multi-step workflow | ~5s | ~60-120s | 12-24x |
The 50x figure comes from eliminating the inference step for reading state. An accessibility query is a local API call - no network request, no model inference, no image encoding. You get the UI tree back in tens of milliseconds.
At the action level, the gap narrows but remains significant. Clicking a button still requires the agent to reason about what to click, but the model is reasoning over a compact text tree instead of a high- resolution image. The context is smaller, inference is faster, and the output is more reliable.
5. Privacy Benefits of Local Execution
Local agent execution means the agent's full context - what it sees, what it reasons about, what it decides - never leaves the device. This has concrete implications:
- No incidental data collection: Cloud agents often log screenshots, input text, and action traces for model improvement. A local agent running a local model has no telemetry pipeline unless you explicitly build one.
- Compliance-friendly for sensitive workflows: If your agent is touching documents under NDA, legal hold, or HIPAA scope, local execution keeps that data within your control boundary.
- No third-party terms of service on your data: When you send a screenshot to a cloud API, you are subject to that provider's data handling policies. Local execution removes that dependency entirely.
- Air-gap compatible: Agents that run locally can operate in environments with restricted outbound internet access - useful for developer environments inside secure networks.
- Reproducibility: Local models do not change silently. If your agent behaves a certain way today with a pinned model version, it will behave the same way next week without a silent upstream update breaking your workflow.
Practical note: "Local" does not necessarily mean fully offline. Many production setups run the UI interaction layer locally (AX API reads, mouse/keyboard synthesis) while routing the actual LLM inference to a hosted API. This hybrid approach keeps sensitive screen state local while still using cloud model quality. The key is that the raw screenshot or UI tree never leaves the machine - only a structured, intentional prompt does.
6. Connecting AI Agents to Native Apps
Wiring an AI agent to macOS native apps involves several layers. Here is how each piece fits together:
Layer 1 - Observation (reading state)
Use the AX API to query the frontmost app's UI tree. In Swift, this means calling AXUIElementCreateApplication(pid) and recursively walking the element hierarchy. The output is a nested structure of roles, labels, values, and coordinates.
Serializing this to a compact text format (indented role/label/value lines) gives the LLM a scannable representation of the current screen state.
Layer 2 - Action (writing state)
Synthetic input via CGEvent lets the agent move the mouse, click, type text, scroll, and trigger keyboard shortcuts. These events are injected at the system level and are indistinguishable from physical input to the target app.
For higher-level actions on scriptable apps, AppleScript or JXA can bypass the UI entirely - creating calendar events, reading mail messages, or manipulating files without any visual interaction.
Layer 3 - Reasoning (the agent loop)
The agent loop is: read state, construct a prompt with the UI tree and task description, call the LLM, parse the action from the response, execute it, then loop. Each cycle is one "step" toward the goal.
The LLM receives the UI tree as text, reasons about which element to interact with, and emits a structured action (click at coordinates, type text into field, press key combination). No vision model required.
Open-source projects like Fazm implement exactly this stack for macOS. Fazm uses Swift for the native layer (AX queries, CGEvent synthesis, ScreenCaptureKit for fallback screenshots) and connects to a configurable LLM backend for the reasoning step. The architecture is intentionally modular - you can swap the model backend without changing any of the OS interaction code.
7. Guardrails and Permissions
An agent that can click anything and type anywhere is a powerful capability and a significant risk surface. The right approach is not to avoid building agents - it is to build in explicit guardrails from day one.
macOS itself provides the first layer of protection through mandatory permission prompts:
- Accessibility permission - required before any process can read AX data or inject input events. The user must manually grant this in System Settings under Privacy and Security. It cannot be requested programmatically.
- Screen recording permission - required for any screen capture beyond basic window metadata. Also requires explicit user approval.
- Automation permission - required to send Apple Events (AppleScript/JXA) to another app. macOS prompts per-target-app the first time.
Beyond OS-level permissions, well-designed agents add their own guardrail layer:
| Guardrail | What It Prevents | Implementation |
|---|---|---|
| Confirmation for destructive actions | Accidental deletes, sends, or submissions | Pause and show the planned action before executing |
| App allowlist | Agent touching apps outside its intended scope | Only inject events into pre-approved app bundle IDs |
| Action rate limits | Runaway loops that spam input | Max N actions per second, max M steps per task |
| Step logging | Inability to debug or audit what the agent did | Structured log of every observation, decision, and action |
| Kill switch | No way to stop a running agent task immediately | Global hotkey or menu bar control to halt execution |
The temptation in early agent development is to defer guardrails until "later." In practice, the right time to add them is before you first run the agent on a real machine with real apps. A runaway agent with no kill switch and no action log is genuinely unpleasant to deal with - it can generate dozens of phantom clicks before you can force-quit the process.
Try a local macOS agent built on accessibility APIs
Fazm is an open-source macOS AI agent that uses the AX API for reliable native app interaction - no screenshots, no pixel guessing, no cloud dependency required. Free to use and MIT licensed.
Get Fazm for macOS