MCP Architecture Guide

MCP Server Composability: How Stacking Tools Makes AI Agents Actually Useful

The real power of the Model Context Protocol is not any single server - it is what happens when you run several together. A Perplexity server for search, a Playwright server for web automation, an accessibility API server for desktop control. Each one is simple on its own. Combined, they give an AI agent capabilities that no monolithic tool can match. This guide covers the architecture, real examples, and honest limitations of MCP composability.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Free to start, fully open source.”

fazm.ai

1. What MCP Composability Actually Means

MCP composability is not a marketing term. It is a specific architectural property: you can add, remove, or swap MCP servers in your agent's config without changing any other server or the agent itself. Each server exposes tools through the same protocol. The agent sees all tools from all connected servers in a flat list and decides which ones to call based on the task.

This is different from traditional plugin systems where integrations need to know about each other. An MCP server for web search does not know an MCP server for file operations exists. They do not share state or coordinate. The LLM sitting on top is the orchestration layer - it reads the available tools, understands the task, and chains calls across servers as needed.

The practical result: you build capabilities by addition, not by rewriting. Want your coding agent to also search the web? Add a Perplexity MCP server. Want it to control a browser? Add Playwright. Want desktop automation? Add an accessibility API server. Each addition takes a config change, not a code change.

2. The Architecture Pattern: How Servers Stack

The MCP architecture has three layers. Understanding them explains why composability works and where it has limits.

The three layers:

1.MCP Servers - Each server wraps a specific capability (search, browser control, file access, database queries) and exposes it as a set of tools with typed inputs and outputs. Servers run as separate processes, typically launched via stdio or SSE transport.
2.MCP Client / Host - The client connects to multiple servers simultaneously and aggregates their tool definitions. Claude Desktop, Cursor, VS Code with Copilot, and CLI tools like Claude Code all act as MCP hosts. The host presents all tools to the LLM in a single context.
3.LLM Orchestration - The language model sees the full tool list and decides which tools to call, in what order, and how to combine their outputs. This is where composability happens - the model chains tools from different servers without any server-to-server communication.

A typical config looks like a JSON object mapping server names to their launch commands. Adding a new server is literally adding a new key-value pair. The agent restarts, picks up the new tools, and can immediately use them alongside everything else.

Try the AI agent that actually works with your apps

Fazm uses accessibility APIs to control your Mac natively. Voice-first, open source, runs locally.

3. Real Examples: Playwright, Accessibility APIs, Perplexity

Here is what composability looks like in practice with three servers that cover different domains:

Perplexity MCP Server (Web Search)

Exposes a search tool that returns sourced answers from the web. Some implementations scrape Perplexity directly, requiring no API key. The agent can ask factual questions mid-task - checking documentation, verifying library versions, looking up error codes - without leaving the workflow.

Typical tools: perplexity_search, perplexity_ask

Playwright MCP Server (Web Automation)

Gives the agent a real browser it can control - navigate pages, click elements, fill forms, take screenshots, read DOM content. The agent uses accessibility snapshots of the page rather than raw HTML, which makes interactions more reliable across different sites.

Typical tools: browser_navigate, browser_click, browser_snapshot, browser_type

Accessibility API MCP Server (Desktop Automation)

Uses the operating system's native accessibility APIs (macOS Accessibility, Windows UI Automation) to read and interact with desktop applications. The agent can see every UI element - buttons, text fields, menus - with their labels, roles, and positions. This is fundamentally more reliable than screenshot-based approaches because it reads the actual UI tree, not pixels.

Typical tools: click_element, type_text, read_screen, list_windows

Now combine them. An agent with all three servers can: search the web for information, open a browser to fill out a web form with that information, then switch to a desktop app to enter the same data there. No server knows about the others. The LLM figures out the workflow.

4. Monolithic Agent Tools vs Composable MCP

Before MCP, giving an agent new capabilities meant building custom tool integrations inside the agent codebase. Here is how the two approaches compare in practice:

Factor	Monolithic Tools	Composable MCP
Setup time for new capability	Hours to days (write integration code, test, deploy)	Minutes (add server to config, restart)
Maintenance burden	You own all integration code and updates	Server maintainers handle updates independently
Adding capability #5	Same effort as #1 - linear scaling	Same config change - constant time
Cross-tool workflows	Requires explicit wiring between tools	LLM chains tools automatically
Flexibility to swap providers	Rewrite integration code	Swap server config, same tool interface
Context window cost	Only tools you built	All server tools consume tokens (can grow large)
Reliability	Deterministic - you control the logic	Depends on LLM tool selection accuracy

The tradeoff is clear: MCP composability trades some determinism for dramatically faster capability expansion. For most agent workflows - especially during development and prototyping - the speed advantage dominates. For production systems where a specific tool chain must execute reliably every time, you may want to constrain which servers are available.

5. Desktop Automation via MCP

Desktop automation is one of the most compelling use cases for MCP composability, and it is also where the architecture choice matters most. There are two fundamental approaches to letting an AI control desktop applications:

Screenshot-based (vision models)

The agent takes a screenshot, sends it to a vision model, gets back coordinates to click. This works for demos but has real problems in production: resolution sensitivity, scaling issues across displays, slow inference for each screenshot, and no semantic understanding of what UI elements actually are. If a button moves 10 pixels, the model might miss it.

Accessibility API-based (structured UI tree)

The agent reads the OS accessibility tree - the same structured data that screen readers use. Every button, text field, menu item, and label is represented with its role, name, value, and position. The agent can click "Save" by name rather than by pixel coordinates. This is faster (no vision model inference), more reliable (elements are identified semantically), and works across display configurations.

Fazm takes the accessibility API approach as an MCP server for desktop automation. It connects to macOS and Windows accessibility APIs, exposes desktop interaction tools through MCP, and works alongside any other MCP servers you are already running. Because it is just another MCP server, adding desktop automation to an existing agent setup is a config change - you do not need to rebuild anything. Learn more at fazm.ai.

The composability angle matters here because desktop automation rarely exists in isolation. A typical workflow might involve searching the web for data (Perplexity server), then entering it into a desktop app (accessibility API server), then verifying the result in a browser (Playwright server). With MCP, this is three servers working together through one agent - no custom glue code required.

6. When Composability Breaks Down

MCP composability is not universally better. Here are the real limitations you will hit:

-Context window bloat. Every server adds its tool definitions to the LLM context. Ten servers with 5 tools each means 50 tool definitions consuming tokens on every request. This gets expensive and can degrade tool selection accuracy as the model has more options to choose from.
-No shared state between servers. If your Playwright server logs into a web app and your desktop server needs that same session, there is no built-in way to share cookies or auth tokens. Each server is isolated by design.
-LLM tool selection is not deterministic. The model might choose the wrong tool, call tools in a suboptimal order, or miss that a tool exists entirely. The more servers you add, the more likely this becomes. Prompt engineering helps but does not eliminate the problem.
-Error handling is limited. If one server crashes, the agent gets an error response but has no recovery mechanism beyond retrying. There is no circuit breaker or fallback pattern built into the protocol.
-Startup overhead. Each server is a separate process. Ten servers means ten processes consuming memory and requiring individual health monitoring. For local development this is fine. For production deployments, you need process management.

The pragmatic approach: start with 2-3 servers that cover your core workflow. Add more only when you have a concrete use case. Resist the urge to install every available server just because it is easy to do.

7. Getting Started: Picking Your First MCP Servers

If you are new to MCP, here is a practical starting stack based on what most developers actually need:

Recommended first servers:

-Filesystem server - Read and write files. This is the foundation. Most MCP hosts (Claude Desktop, Cursor) include this or something equivalent by default.
-Web search server - Perplexity, Brave Search, or Tavily. Lets your agent look things up without you copy-pasting from a browser. Pick whichever has the auth model you prefer (some require API keys, some do not).
-Browser automation server - Playwright MCP is the most mature option. Gives your agent the ability to interact with web applications, which covers a huge range of tasks from form filling to data extraction.

Once you have those three running and useful, consider adding specialized servers based on your workflow: database access (Postgres MCP), desktop automation (accessibility API-based), git operations, or communication tools (Slack, email).

The key insight is that each server you add does not just give you new tools - it gives you new combinations of tools. Three servers with 5 tools each is not 15 capabilities; it is potentially hundreds of workflows the LLM can construct by chaining those tools together. That multiplicative effect is why composability matters more than any individual server.

Add Desktop Automation to Your MCP Stack

Fazm is an MCP server for desktop automation using accessibility APIs. Drop it into your existing setup alongside your other MCP servers.

Free to start. Fully open source. Runs locally on your Mac.