Claude Code in Action - Desktop Automation
MCP Servers Changed How I Build with Claude Code. Here's How Desktop Automation Actually Works.
The Model Context Protocol turns Claude Code from a coding assistant into a system that can control your entire desktop. This guide covers how MCP servers work, how to wire them into Claude Code for desktop automation, what the different approaches look like in practice, and where hooks fit in to keep everything reliable. Whether you are automating browser workflows, testing native apps, or orchestrating multi-step desktop tasks, the MCP layer is what makes it all possible.
1. What MCP Servers Are and Why They Matter
The Model Context Protocol (MCP) is an open standard that lets AI models communicate with external tools through a unified interface. Instead of each tool needing its own custom integration, MCP defines a standard transport layer - a server exposes tools with typed parameters and descriptions, and the model calls them like functions.
For Claude Code specifically, MCP servers show up as additional tools in the model's context. When you configure an MCP server in your ~/.claude/settings.json, every tool that server exposes becomes available to Claude during your session. This is what transforms Claude Code from a text-in-text-out coding tool into something that can interact with browsers, desktop applications, databases, APIs, and operating system primitives.
The key architectural insight is that MCP servers run locally on your machine. They are not cloud services - they are processes that Claude Code spawns and communicates with over stdio or HTTP. This means they have full access to your local environment: your filesystem, your running applications, your screen, and your accessibility tree.
- Tool exposure: Each MCP server registers tools with names, descriptions, and JSON Schema parameters. Claude sees these as callable functions.
- Local execution: MCP servers run as child processes on your machine, so they can access local resources - files, apps, system APIs - that cloud-based tools cannot.
- Composability: You can run multiple MCP servers simultaneously. A browser automation server, a desktop control server, and a database server can all be active in the same session.
- Standard protocol: Because MCP is standardized, servers built by different teams interoperate without custom glue code.
2. How MCP Servers Extend Claude Code
Out of the box, Claude Code can read and write files, run shell commands, and search codebases. MCP servers extend this with entirely new capabilities. The configuration lives in your settings file:
// ~/.claude/settings.json
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["@anthropic/playwright-mcp"]
},
"desktop-control": {
"command": "python",
"args": ["-m", "macos_use.server"]
}
}
}Once configured, Claude Code launches these servers at session start. Every tool they expose appears in Claude's available tool list. For desktop automation, the most relevant categories of MCP servers include:
- Browser automation: Playwright MCP gives Claude the ability to navigate pages, click elements, fill forms, take screenshots, and read page content - all through the browser's automation protocol.
- Desktop control: Servers that expose macOS accessibility APIs or screen coordinate-based interactions, letting Claude operate native applications like Finder, System Settings, or any third-party app.
- Application-specific: Dedicated servers for tools like Slack, WhatsApp, or terminal multiplexers that provide structured APIs for specific applications rather than generic screen interaction.
The important distinction is between MCP servers that add tools to Claude's repertoire versus hooks that add automatic behaviors around Claude's existing actions. Both are critical for desktop automation, but they serve different purposes.
3. Desktop Automation via MCP: Two Approaches
When it comes to controlling desktop applications through MCP, there are two fundamentally different approaches. Understanding the tradeoffs is critical for building reliable workflows.
Screenshot-based (vision-driven)
This approach captures screenshots of the screen, sends them to the vision model, and uses the model's understanding of the image to determine where to click. Tools like Anthropic's Computer Use API work this way - the model literally looks at pixels and decides coordinates.
- Works with any application, including those without accessibility support
- Can understand visual layout, colors, and spatial relationships
- Slower per interaction - each action requires a screenshot capture, model inference, and coordinate calculation
- Coordinates can be fragile across different screen resolutions and window positions
- Requires significant token usage for each screenshot processed
Accessibility API-driven
This approach reads the operating system's accessibility tree - a structured representation of every UI element, its role, label, and position. Instead of pixel matching, the model works with semantic data: “Button labeled Submit at position (340, 520)”.
- Much faster per interaction - no screenshot processing needed for element discovery
- Coordinates come directly from the OS, so they are precise regardless of resolution
- Element identification is semantic - find by role and label rather than visual appearance
- Depends on applications properly implementing accessibility (most native macOS apps do, some Electron apps do not)
- Cannot assess visual appearance, colors, or layout aesthetics
4. Screenshot-Based vs Accessibility API - Detailed Comparison
Here is a practical comparison based on real-world desktop automation workflows. These numbers come from actual usage across common automation tasks.
| Factor | Screenshot-Based | Accessibility API |
|---|---|---|
| Latency per action | 2-5 seconds (screenshot + inference) | 200-500ms (tree traversal) |
| Token cost per action | 1,500-3,000 tokens (image encoding) | 200-500 tokens (text only) |
| Click accuracy | 85-92% (coordinate estimation) | 98-99% (OS-reported positions) |
| Cross-resolution support | Requires calibration | Works natively |
| Visual verification | Built-in | Requires separate screenshot |
| App compatibility | Universal | Depends on app accessibility |
| 10-step workflow time | 30-50 seconds | 5-10 seconds |
In practice, the best approach combines both. Use accessibility APIs for element discovery and clicking, then take targeted screenshots for visual verification at key checkpoints. This hybrid strategy gives you the speed and precision of accessibility with the visual confirmation of screenshot-based approaches.
5. Hooks for Auto-Validation in Desktop Workflows
Claude Code hooks are scripts that run automatically before or after specific events - tool calls, model responses, or session lifecycle events. For desktop automation, hooks serve as a critical safety and validation layer.
The basic idea: after Claude performs a desktop action (clicking a button, filling a form, opening an application), a hook can automatically verify the result before Claude proceeds. This prevents cascading failures where one misclick derails an entire workflow.
// Example: post-tool hook that verifies desktop state
// In settings.json hooks configuration
{
"hooks": {
"postToolUse": [
{
"matcher": "mcp__macos-use__*",
"command": "python scripts/verify_screen_state.py"
}
]
}
}Common hook patterns for desktop automation include:
- State verification: After any click or type action, verify the expected window or element is now active. This catches cases where a dialog popped up unexpectedly or the app switched to a different state.
- Error detection: Scan the accessibility tree or take a screenshot after each action to check for error dialogs, crash reporters, or unexpected alerts.
- Checkpoint logging: Record each desktop action and its result to a log file. When a multi-step workflow fails, you have a complete audit trail to debug from.
- Timeout enforcement: Kill a desktop automation sequence if a single step takes longer than a threshold, preventing the agent from getting stuck in infinite retry loops.
Hooks run synchronously - Claude waits for the hook to complete before proceeding. This is what makes them effective for validation. If a hook exits with a non-zero status, Claude sees the failure and can adjust its approach rather than blindly continuing.
6. Building Real Desktop Automation Workflows
Let's walk through what a real desktop automation workflow looks like with Claude Code and MCP servers. Consider a common scenario: automated testing of a native macOS application after each code change.
The workflow involves three phases: build, launch, and verify. Claude writes code, triggers a build via shell commands, then uses MCP desktop tools to launch the app and verify the UI matches expectations.
- Code change: Claude edits source files using its built-in file tools.
- Build: Claude runs the build command via shell. A hook verifies the build succeeded before proceeding.
- Launch: Claude uses the desktop MCP server to open the application. The accessibility tree confirms the main window appeared.
- Navigation: Claude traverses the accessibility tree to find and click the relevant UI elements, navigating to the screen that was changed.
- Verification: Claude takes a screenshot of the relevant area and visually confirms the change looks correct. A hook logs the result.
- Cleanup: Claude closes the application and reports the result.
This entire sequence happens within a single Claude Code session. The MCP servers handle the desktop interaction, hooks handle validation, and Claude orchestrates the whole thing based on its understanding of what needs to happen. No scripts to maintain, no test framework to configure - just natural language instructions backed by real desktop control.
For browser-based workflows, the pattern is similar but uses Playwright MCP instead. Claude navigates to a URL, interacts with page elements, fills forms, and verifies outcomes. The same hook-based validation applies - after each browser action, verify the page state matches expectations before proceeding.
7. The Tools Landscape for Desktop Automation
The ecosystem of MCP-compatible desktop automation tools is growing quickly. Here are the main options available today, each with different strengths.
| Tool | Approach | Platform | Best For |
|---|---|---|---|
| Playwright MCP | Browser automation protocol | Cross-platform | Web app testing, form filling, scraping |
| Computer Use (Anthropic) | Screenshot + coordinate | Cross-platform | Universal desktop control, any app |
| Fazm | Accessibility API | macOS | Fast native app control, precise element targeting |
| macos-use | Accessibility API | macOS | Accessibility tree traversal, open source |
| Application-specific servers | Native API wrappers | Varies | Deep integration with specific apps (Slack, WhatsApp) |
The choice depends on your specific workflow. For web-only automation, Playwright MCP is the clear winner - it is fast, reliable, and cross-platform. For native macOS apps, accessibility-based tools like Fazm or macos-use give you precise, fast control without the token overhead of screenshot processing. For maximum compatibility across any application on any platform, Computer Use trades speed for universality.
Many teams run multiple MCP servers simultaneously. A common setup pairs Playwright for browser work with an accessibility-based server for native app interaction. Claude Code handles the orchestration naturally - it picks the right tool for each step based on context.
What matters most is not which individual tool you choose but how you wire them together. MCP provides the protocol layer. Hooks provide the validation layer. Claude provides the reasoning layer. The combination is what makes desktop automation actually reliable enough to trust with real work, rather than just impressive demos that break on the second run.
Try Accessibility-Based Desktop Automation
Fazm gives Claude Code fast, precise macOS desktop control through the accessibility API. Set it up as an MCP server in under two minutes.
Try Fazm Free