The Complete AI Agent Tooling Stack: MCP, Accessibility APIs, Memory, and Orchestration

A discussion made the case that AI agent improvements come from tooling, not models. After building and deploying production agents, we agree completely. The model is the easy part. The tooling stack - how the agent perceives, acts, remembers, and coordinates - is where the real engineering challenge lives.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Free to start, fully open source.”

fazm.ai

1. The Tooling Thesis: Why Models Are Not the Bottleneck

When an AI agent fails at a task, the instinct is to blame the model. "It is not smart enough." But in practice, most agent failures come from tooling gaps: the agent could not see the right information, could not execute the right action, forgot critical context from a previous step, or got stuck because no tool existed for what it needed to do.

Consider a model upgrade from Claude 3.5 Sonnet to Opus. That upgrade improves reasoning quality by maybe 10-20% on benchmarks. Now compare that to adding a proper file system tool that lets the agent search, read, and edit files reliably. That single tool addition can take task success rates from 30% to 90% - a 3x improvement that no model upgrade can match.

The leverage equation: A mediocre model with excellent tools outperforms an excellent model with mediocre tools. This is unintuitive but consistently observed in production agent deployments. Invest in tooling first, model selection second.

2. Perception Layer: How Agents See the World

An agent can only act on what it can perceive. The perception layer determines what information is available to the model at decision time. Different agent types use radically different perception approaches:

Perception Method	How It Works	Strengths	Limitations
Screenshot/vision	Capture screen pixels, send to vision model	Works with any application	Expensive, slow, imprecise coordinates
Accessibility tree	Read OS accessibility APIs for structured UI data	Precise elements, fast, low-cost	Not all apps expose full trees
DOM/browser	Parse page DOM via browser automation	Rich structured data for web	Only works in browsers
File system	Read files, search codebases, parse configs	Deep code understanding	Limited to file content
API responses	Call APIs and parse structured responses	Clean data, well-defined schemas	Requires API integration per service

The most capable agents combine multiple perception methods. A desktop agent might use accessibility trees for most interactions, fall back to screenshots for visual verification, and use file system access for code tasks. Each layer compensates for the others' limitations.

Try the AI agent that actually works with your apps

Fazm uses accessibility APIs to control your Mac natively. Voice-first, open source, runs locally.

3. Action Layer: MCP Servers and Tool Interfaces

The Model Context Protocol (MCP) has become the standard interface between AI agents and external tools. Instead of every agent building custom integrations with every service, MCP provides a uniform protocol: the agent discovers available tools, calls them with structured parameters, and receives structured responses.

The practical impact is significant. Before MCP, adding a new tool to an agent meant writing custom code. With MCP, you install a server and the agent immediately gains access to its capabilities. This has created an ecosystem of reusable tool servers:

Browser automation servers - Playwright-based MCP servers that give agents full browser control (navigation, clicking, form filling, screenshots)
File system servers - Structured file operations with search, read, write, and edit capabilities
Git servers - Version control operations with safety guardrails built in
Database servers - SQL queries, schema inspection, data manipulation
Communication servers - Email (Gmail), messaging (Slack, WhatsApp), social media
OS control servers - macOS accessibility, keyboard/ mouse automation, application management

The quality of MCP servers varies enormously. A well-built server handles errors gracefully, provides clear output, and includes safety checks. A poorly built one crashes silently and returns ambiguous responses that confuse the model. Server quality is often the difference between a reliable agent and an unreliable one.

4. Memory: Context Persistence Across Sessions

Without memory, every agent session starts from zero. The agent has no knowledge of what it did yesterday, what mistakes it made, or what the user prefers. Memory systems solve this by persisting relevant context across sessions.

Memory comes in several forms, each serving a different purpose:

Instruction memory (CLAUDE.md) - Static rules and preferences that apply to every session. This is the simplest and most reliable form of memory because it is human-curated and version controlled.
Episodic memory - Records of past interactions, decisions, and outcomes. Systems like Hindsight store these experiences and surface them when relevant context matches occur in future sessions.
Semantic memory - Indexed knowledge about the codebase, project structure, and domain. This includes things like "the auth system uses JWT tokens stored in httpOnly cookies" - facts the agent should know without re-discovering them.
Procedural memory (skills) - Encoded workflows that tell the agent how to perform specific tasks. Skills are essentially procedural memory stored as prompt instructions.

The practical state of memory systems is still early. Most production agents rely primarily on instruction memory (CLAUDE.md) and procedural memory (skills), with episodic and semantic memory being supplementary. This will change as memory infrastructure matures, but today the ROI on writing a good CLAUDE.md far exceeds the ROI on setting up a vector database for agent memories.

5. Orchestration: Managing Complex Agent Workflows

Orchestration is the layer that manages how agents plan, execute, and recover from failures. For simple tasks, the model's built-in planning is sufficient. For complex multi-step workflows, external orchestration becomes essential.

Orchestration patterns from simplest to most complex:

Single-shot execution - Agent receives task, executes to completion. No external orchestration needed. Works for tasks under 10 minutes that require no human checkpoints.
Skill-based workflows - Skills provide the step sequence, and the agent follows them. The skill is the orchestrator. Works for repeatable processes like deploy, review, or audit.
Human-in-the-loop - Agent executes until it hits a decision point, pauses for human input, then continues. Common in production workflows where certain actions need approval.
Sub-agent delegation - A parent agent spawns focused sub-agents for specific parts of a larger task. The parent manages task assignment and result aggregation. Useful for parallelizable work.
Event-driven orchestration - Agents respond to external triggers (cron schedules, webhooks, file changes) rather than being invoked by a user. This enables background automation and continuous monitoring.

The orchestration layer is where most over-engineering happens. Teams build elaborate agent management systems when a simple skill-based workflow would suffice. Start with the simplest pattern that works and add complexity only when you hit real limitations.

6. Comparing Tooling Stacks Across Agent Types

Different types of AI agents emphasize different parts of the tooling stack. Understanding these trade-offs helps you choose the right tools for your use case:

Agent Type	Perception	Action	Memory	Example
Code agent	File system, LSP	File edit, terminal, git	CLAUDE.md, skills	Claude Code, Cursor, Aider
Browser agent	DOM, screenshots	Click, type, navigate	Session cookies, profiles	Playwright MCP, Browser Use
Desktop agent	Accessibility tree, vision	OS-level control, all apps	System state, user prefs	Fazm, macOS agents
API agent	API responses, webhooks	HTTP calls, SDK methods	API state, auth tokens	Custom integrations

The most versatile agents combine capabilities from multiple categories. A desktop agent that can also read code and call APIs is significantly more capable than one limited to screen-level interaction. This is why the MCP ecosystem matters - it lets you compose tools from different domains into a single agent.

The trend is toward convergence. Code agents are gaining browser control (via MCP). Desktop agents are getting better at code tasks. The distinction between agent types is blurring as tooling ecosystems grow.

7. Building Your Agent Tooling Stack

If you are building or deploying AI agents, here is a practical framework for assembling your tooling stack:

Start with perception - What does your agent need to see? If it is code, you need file system tools. If it is web pages, you need browser automation. If it is desktop apps, you need accessibility APIs. Get perception right first because everything else depends on it.
Add essential actions - Give the agent the minimum set of tools to complete its core task. Resist adding tools speculatively. Each tool adds complexity to the model's decision space and can degrade performance if the model has to choose between too many options.
Implement basic memory - A CLAUDE.md with project rules and a few key skills covers 80% of memory needs. Add episodic memory only if your agent performs recurring tasks where learning from history measurably improves outcomes.
Keep orchestration simple - Use single-shot execution for most tasks. Add skill-based workflows for repeatable processes. Only build custom orchestration if you have proven that simpler approaches are insufficient.
Measure tool reliability - Track success rates per tool. A tool that fails 20% of the time will make the agent look unreliable even if the model performs perfectly. Fix flaky tools before upgrading models.

The most common mistake is building a complex orchestration layer before the perception and action layers are solid. If the agent cannot reliably read a file or click a button, no amount of orchestration sophistication will help.

Think of it as a pyramid: perception at the base, actions next, memory above that, and orchestration at the top. Each layer must be solid before the next one adds value. Skip a layer and the whole stack is unreliable.

See a full agent tooling stack in action

Fazm is an open-source macOS AI agent with accessibility-based perception, MCP tool integration, memory via CLAUDE.md and skills, and skill-based orchestration. Explore the architecture.

View on GitHub