How AI Agents Work: Architecture, Loops, and Tool Use Explained

Matthew Diakonov··14 min read

How AI Agents Work

An AI agent is a program that takes a goal in natural language, breaks it into steps, executes those steps using tools, and adjusts its plan when something goes wrong. That one sentence hides a lot of engineering. This post unpacks the actual architecture: the core loop, how tool calls work, what memory and planning look like in practice, and where the whole thing falls apart.

If you already know what AI agents are, this goes one level deeper into the machinery.

The Core Loop: Perceive, Reason, Act

Every AI agent runs some version of the same loop. The agent observes the current state of its environment, sends that observation (plus the goal and history) to a language model, receives a decision about what to do next, executes that action, and repeats.

Perceivescreen, DOM, APIReasonLLM decides next stepActtool call or outputObservecheck resultloop until goal met or failure

Here is a simplified version in pseudocode:

def run_agent(goal: str, tools: list[Tool]):
    history = [{"role": "user", "content": goal}]
    while True:
        observation = perceive_environment()
        history.append({"role": "system", "content": observation})

        response = llm.chat(history, tools=tools)

        if response.type == "tool_call":
            result = execute_tool(response.tool_name, response.args)
            history.append({"role": "tool", "content": result})
        elif response.type == "final_answer":
            return response.content
        else:
            history.append({"role": "assistant", "content": response.content})

The key insight: the LLM is not running the whole time. It gets called once per loop iteration, receives the full context (goal + history + current observation), and returns a single decision. Between LLM calls, the agent is executing tools, waiting for results, and assembling the next context window.

How Tool Calling Works

Tools are what separate an agent from a chatbot. A chatbot can only produce text. An agent can click buttons, read files, query databases, send emails, and control applications.

The mechanism is straightforward. When you define tools for an LLM, you describe each tool's name, parameters, and purpose in a JSON schema. The LLM sees these tool descriptions as part of its context. Instead of producing plain text, it can produce a structured tool call:

{
  "tool": "click_element",
  "arguments": {
    "selector": "#submit-button",
    "wait_after": 2000
  }
}

The agent runtime catches this structured output, executes the tool, captures the result, and feeds it back as the next message in the conversation. The LLM never touches the actual system directly. It only produces instructions, and the runtime executes them.

| Component | Role | Example | |---|---|---| | LLM | Decides which tool to call and with what arguments | "Call read_file with path /etc/hosts" | | Tool registry | Maps tool names to executable functions | {"read_file": read_file_fn, "click": click_fn} | | Runtime / harness | Executes tools, enforces permissions, returns results | Runs the function, captures stdout, checks allowlist | | Context assembler | Packs goal + history + observation into the next LLM call | Concatenates messages, trims to fit context window |

This architecture means the LLM itself never has direct system access. Every action goes through the runtime, which can enforce permissions, rate limits, and approval flows. When you see an agent ask "Can I delete this file?", that is the runtime's permission layer, not the LLM being polite.

Perception: How Agents See the World

How an agent perceives its environment depends entirely on what kind of agent it is.

Screenshot-based perception

The agent takes a screenshot of the screen, sends the image to a vision-capable LLM, and the model interprets what it sees. This approach works with any application but is slow (each screenshot analysis takes 1 to 3 seconds) and error-prone. The model might misread text, misidentify UI elements, or fail to notice small details.

Accessibility tree perception

On macOS, the accessibility API exposes every UI element as a structured tree: buttons, text fields, labels, menus, each with a role, title, position, and size. An agent reads this tree and gets a precise, machine-readable description of the entire screen in about 50ms. No vision model needed for basic navigation.

Fazm uses this approach, which is why it feels noticeably faster than screenshot-based agents. The accessibility tree gives the agent exact coordinates and element types, so there is no guessing about what a button says or where a text field is.

DOM-based perception

Web agents read the Document Object Model directly. They can see every element, its attributes, its computed styles, and its position. This is the most precise perception method for browser-based tasks, but it only works inside a browser.

Hybrid approaches

Most production agents combine methods. Fazm uses the accessibility tree as its primary perception layer, falls back to screenshots for apps with poor accessibility support, and reads the DOM directly when controlling a browser. The agent picks the best perception method for each situation.

The Planning Layer

Simple agents react step by step: see the screen, decide the next click, execute, repeat. This works for straightforward tasks but breaks down on anything that requires coordination across multiple steps.

A planning layer sits between the goal and the execution loop. Before the agent starts clicking, it sketches a plan:

Goal: "Book a flight to Tokyo under $800 for next Thursday"

Plan:
1. Open browser
2. Navigate to flight search site
3. Enter departure city, destination (Tokyo), date (next Thursday)
4. Search flights
5. Filter results under $800
6. Select cheapest option
7. Fill in passenger details
8. Complete booking

The plan is not rigid. If step 4 returns no results, the agent re-plans: maybe try a different airport, shift dates by a day, or check a different airline. This ability to re-plan on failure is what makes agents useful for real-world tasks where conditions are unpredictable.

Planning quality varies dramatically across different LLMs. Claude and GPT-4 class models produce reasonable multi-step plans for most tasks. Smaller models tend to skip steps or produce plans that are technically correct but practically infeasible (like "click the button" when the button requires scrolling first).

Memory: Short-term and Long-term

Without memory, every conversation starts from zero. The agent does not know your name, your preferences, or what it did for you yesterday. Memory fixes this, and it comes in two forms.

Short-term memory (context window)

The conversation history itself is the agent's short-term memory. Every observation, tool result, and decision gets appended to the message list and sent to the LLM on the next call. This is effective but limited by the LLM's context window (typically 128K to 200K tokens for current models).

For long tasks, the history grows past the context limit. Agents handle this with summarization (compress older messages into summaries), sliding windows (drop the oldest messages), or selective retrieval (only include messages relevant to the current step).

Long-term memory (persistent storage)

Long-term memory persists between sessions. When you tell an agent "I prefer window seats" or "my assistant's name is Sarah," it writes that fact to a persistent store (usually a file, a database, or a vector store) and retrieves it in future sessions.

| Memory type | Stored where | Survives restart | Typical use | |---|---|---|---| | Context window | In-flight LLM messages | No | Current task state, recent tool results | | Session summary | File or database | Yes | What happened in previous sessions | | User preferences | Structured file (JSON, YAML) | Yes | Name, email, timezone, preferred tools | | Semantic memory | Vector database | Yes | Recalled by similarity search when relevant |

The hard part is not storage; it is retrieval. An agent that remembers everything but cannot find the right memory at the right time is just as broken as one with no memory at all. Good memory systems tag memories with types, timestamps, and relevance scores so the retrieval query can filter efficiently.

How Agents Handle Errors

Real-world tasks fail constantly. A website returns a 500 error. A button does not exist because the page layout changed. A file is locked by another process. An API rate-limits the request.

Competent agents handle errors at three levels:

Retry with backoff. For transient errors (network timeouts, rate limits), the agent waits and retries. Most agents implement exponential backoff: wait 1 second, then 2, then 4, up to a maximum.

Alternative path. If clicking a button fails, try keyboard navigation. If one API endpoint is down, try a different one. If a file cannot be written to the target directory, check permissions and suggest an alternative location.

Escalate to the user. When the agent genuinely cannot proceed (wrong password, ambiguous instruction, destructive action requiring confirmation), it stops and asks. An agent that silently fails or guesses wrong on a destructive action is worse than no agent at all.

The error handling hierarchy matters. An agent that escalates every minor issue is annoying. An agent that never escalates is dangerous. The best agents calibrate: retry automatically for transient errors, try alternatives for recoverable errors, and escalate for ambiguous or destructive situations.

Multi-Agent Architectures

Complex tasks sometimes benefit from multiple agents working together. Instead of one agent doing everything, you split the work across specialists.

Orchestrator Agentbreaks goal into subtasksResearch Agentweb search, data gatherCode Agentwrite, test, deployBrowser Agentnavigate, fill formsResults flow back to orchestrator for assembly

The orchestrator pattern is the most common: one "manager" agent receives the goal, breaks it into subtasks, delegates each to a specialist agent, collects results, and assembles the final output. This works well when subtasks are independent (research while coding while browsing).

The tricky part is coordination. If the research agent discovers that the user's requirements changed, how does it communicate that to the code agent mid-task? Most frameworks handle this with shared state files or message queues, but it adds complexity.

Common Pitfalls

  • Context window overflow. Long tasks generate so much history that the agent loses track of its original goal. Symptoms: the agent starts repeating actions, forgets what it already tried, or produces contradictory outputs. Fix: implement aggressive summarization and only keep the most recent tool results in full detail.

  • Tool call hallucination. The LLM invents tool names or parameters that do not exist. This happens more with smaller models and when the tool descriptions are vague. Fix: validate every tool call against the registry before executing, and return a clear error when the tool does not exist.

  • Infinite retry loops. The agent encounters an error, retries the same action, gets the same error, and loops forever. Fix: track retry counts per action and escalate after 3 attempts. Also check whether the environment changed between retries.

  • Perception mismatch. The agent thinks a button is at coordinates (200, 300) but it is actually at (200, 450) because a notification shifted the layout. Fix: re-perceive after every action, never cache UI state across steps.

  • Over-planning. The agent spends 30 seconds generating a detailed 20-step plan for a task that only needs 3 steps. Fix: plan incrementally. Start with a high-level sketch and elaborate only the next 2 to 3 steps.

A Minimal Agent in 40 Lines

Here is a working agent loop using the Anthropic SDK. It can read and write files:

import anthropic

client = anthropic.Anthropic()
tools = [
    {
        "name": "read_file",
        "description": "Read contents of a file",
        "input_schema": {
            "type": "object",
            "properties": {"path": {"type": "string"}},
            "required": ["path"]
        }
    },
    {
        "name": "write_file",
        "description": "Write contents to a file",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string"},
                "content": {"type": "string"}
            },
            "required": ["path", "content"]
        }
    }
]

messages = [{"role": "user", "content": "Read config.json and add a 'debug': true field"}]

while True:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        tools=tools,
        messages=messages,
    )
    messages.append({"role": "assistant", "content": response.content})

    if response.stop_reason == "end_turn":
        print(response.content[-1].text)
        break

    for block in response.content:
        if block.type == "tool_use":
            if block.name == "read_file":
                result = open(block.input["path"]).read()
            elif block.name == "write_file":
                open(block.input["path"], "w").write(block.input["content"])
                result = "Written successfully"
            messages.append({
                "role": "user",
                "content": [{"type": "tool_result", "tool_use_id": block.id, "content": result}]
            })

This is the complete pattern. Every production agent (including desktop agents like Fazm, code agents like Claude Code, and browser agents) is a more sophisticated version of this same loop with more tools, better error handling, and memory layers on top.

What Determines Agent Quality

Not all agents are equal, even when they use the same underlying LLM. The quality differences come from engineering decisions outside the model:

| Factor | Impact | Example | |---|---|---| | Perception speed | Directly affects task completion time | Accessibility API (50ms) vs screenshot analysis (2s) | | Tool design | Poorly designed tools cause LLM confusion | click(x, y) vs click_element(name="Submit") | | Context management | Determines how long tasks can run | Naive truncation vs intelligent summarization | | Error recovery | Determines success rate on real tasks | Single retry vs multi-strategy fallback | | Permission model | Determines user trust and safety | Auto-approve reads, confirm writes, block deletes | | Memory architecture | Determines improvement over time | No memory vs typed persistent memory with retrieval |

The LLM provides the reasoning capability. Everything else determines whether that reasoning translates into reliable task completion.

Wrapping Up

AI agents work by running a loop: perceive the environment, send context to an LLM, execute the returned tool call, observe the result, and repeat. The engineering that makes agents useful lives in the layers around that loop: fast perception through accessibility APIs, well-designed tools, persistent memory, intelligent error recovery, and permission systems that keep users in control. If you want to try this in practice, start with the minimal example above or grab a production agent and watch it work.

Fazm is an open source macOS AI agent that uses accessibility APIs for fast, reliable desktop automation. Open source on GitHub.

Related Posts