The Octopus Model: Why the Best AI Agents Split Brain from Arms

M
Matthew Diakonov

The Octopus Model: Why the Best AI Agents Split Brain from Arms

An octopus has approximately 500 million neurons. Two-thirds of them are not in its brain - they are distributed through its eight arms. Each arm can taste, touch, and react independently. The brain sets direction. The arms handle local perception and execution.

Researchers studying octopus cognition have confirmed that this is not just anatomical curiosity. The distributed architecture provides real advantages: the octopus can perform multiple tasks simultaneously, each arm can respond to local stimuli faster than a central brain could process the input, and the system degrades gracefully when part of it is compromised.

This is exactly how the best desktop AI agents work.

The Centralized Trap

Most agent frameworks are centralized by default. The LLM receives a screenshot or HTML dump of the current UI state. It reasons about what is on screen. It outputs precise coordinates or element selectors. It sends those to an execution layer. The execution layer acts. The result comes back to the LLM for assessment. Then the cycle repeats.

Every step of that loop goes through the central brain. The LLM is the bottleneck. Every action requires an inference call to answer "what do I see now, what should I do, and did it work?"

This is expensive in tokens, slow in latency, and fragile in reliability. Pixel coordinates change when windows resize. HTML structure changes when apps update. The LLM has to reinterpret the environment from scratch on every cycle.

Arms That See

The insight most agent frameworks miss: the execution layer should handle its own perception.

In macOS accessibility-based agents, every time the MCP server performs an action - a click, a keystroke, a scroll, a window activation - it automatically traverses the accessibility tree and reports the resulting UI state. The arm does not wait for the brain to ask "what do you see now?" It reports back what changed as a side effect of acting.

The LLM receives structured data: element roles, labels, values, positions. Not raw pixels. Not HTML. A clean tree of named elements that it can reason about directly.

// What the arm reports after clicking "New Document"
{
  "action_completed": "click",
  "target": "Button: New Document",
  "result_state": {
    "active_window": "Untitled - TextEdit",
    "focused_element": "Text Area: [empty]",
    "available_actions": ["type", "open_file", "save", "close"]
  }
}

The LLM does not need to run another inference cycle to understand "did clicking New Document work?" The arm tells it. The LLM only needs to reason about the structured result.

The Architecture: Brain and Arms

The brain is an LLM - Claude, GPT-4, or a smaller model. It understands the goal, plans the sequence of actions, interprets complex results, and handles cases the arms cannot resolve locally.

The arms are MCP servers that implement specific execution domains:

  • macOS accessibility server - reads the UI tree, clicks elements, types text, navigates menus
  • filesystem server - reads, writes, and organizes files
  • process server - launches apps, runs scripts, manages windows
  • network server - makes HTTP requests, downloads files

Each server handles its own perception within its domain. The accessibility server knows what is on screen. The filesystem server knows what files exist. The network server knows what responses look like.

User: "Find the latest quarterly report and open it in Excel"

Brain (LLM):
  1. Ask filesystem arm to find recent files matching "quarterly report"
  2. Ask filesystem arm to confirm it is an xlsx or csv file
  3. Ask process arm to open the file with Excel

Filesystem arm (no LLM):
  - Searches ~/Documents recursively
  - Returns: [{path: "~/Documents/Q4-2025-Report.xlsx", modified: "2025-12-31"}]
  - Confirms: file extension is .xlsx

Process arm (no LLM):
  - Runs: open -a "Microsoft Excel" "~/Documents/Q4-2025-Report.xlsx"
  - Waits for Excel window to appear
  - Reports: "Excel opened, window title: Q4-2025-Report.xlsx"

The brain made two decisions (find the file, open it) and delegated the execution details to arms that handled them without LLM inference. The arms did not just execute - they also reported state that confirmed success.

Why Separation Matters for Cost

The numbers are concrete. Processing a screenshot through a vision model to understand the current UI state costs 800-2,000 tokens per image. Processing the structured output of an accessibility tree traversal costs 100-400 tokens for the same information, delivered in a more reliable format.

Over a typical 20-step desktop automation task:

  • Screenshot-based approach: 20 x 1,400 tokens average = 28,000 tokens for perception alone
  • Accessibility tree approach: 20 x 250 tokens average = 5,000 tokens for perception

That is roughly a 6x reduction in token costs for the perception layer alone, plus faster latency because the arms do not need to wait for a vision inference call before reporting state.

Swapping the Brain

The other benefit of clean separation: you can change the brain without changing the arms.

Use Claude Opus for complex reasoning tasks that require careful multi-step planning. Use Claude Haiku for simple tasks that just need "click this button, type that text." Use a fine-tuned smaller model for specific workflows you have seen hundreds of times.

The arms do not care which LLM is driving them. The MCP protocol abstracts the connection. You optimize the brain choice for each task type without touching the execution infrastructure.

What the Octopus Gets Right

The octopus figured out distributed cognition over hundreds of millions of years of evolution. The architectural advantages it demonstrates - local perception, parallel execution, graceful degradation, central coordination without central bottleneck - turn out to apply directly to AI agent design.

The research on octopus-inspired AI has grown substantially. From 760 papers in 2021 to over 1,170 in 2024, researchers are applying octopus architectural principles to sensor design, processor architecture, and distributed agent systems. The common thread: intelligence does not require centralization. In many cases, it works better without it.

For desktop AI agents specifically, the lesson is practical and immediate. Do not route perception through the central model when execution arms can handle their own sensing. Let the arms see, act, and report. Let the brain plan and decide.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts