Error Propagation in Multi-Agent AI Systems

M
Matthew Diakonov

Error Propagation in Multi-Agent AI Systems

When one agent in a multi-agent system makes a bad decision, every downstream agent inherits that error and builds on top of it. This is the single biggest reliability problem in agent networks, and most teams discover it only after they have already committed to a multi-agent architecture.

Why Multi-Agent Errors Are Different

A single agent making a mistake is straightforward to debug. You check its input, its reasoning, and its output. The error is self-contained.

In a multi-agent system, errors compound. Agent B trusts Agent A's output. Agent C trusts Agent B's output, which is already built on Agent A's error. By the time you notice something is wrong, the root cause is buried three layers deep and wrapped in confident, well-formatted prose that makes it look correct.

This is not a theoretical concern. In our own testing with Fazm running multi-step desktop automation workflows, we watched a single misidentified UI element cascade through four downstream actions before the workflow visibly failed. The root cause took 20 minutes to trace because each agent's log entry looked individually reasonable.

Agent A misreads UI element Agent B trusts bad data Agent C compounds error Output visibly broken Error amplifies at each handoff Each agent's log looks individually reasonable Error origin Inherited error

The Three Failure Modes

We have seen three distinct patterns of error propagation in production agent systems. Each one is harder to detect than the last.

Silent Data Corruption

An agent reformats data slightly wrong. Maybe it truncates a field, changes a date format, or drops a decimal point. Downstream agents do not flag the change because the data still parses correctly. The error only surfaces when a human reviews the final output and notices a wrong number.

This is the most common failure mode. It is also the most insidious because automated checks often pass. The data is structurally valid; it is just semantically wrong.

Confidence Amplification

Each agent in the chain adds certainty language to its output. Agent A says "this might be the login button." Agent B reports "the login button was identified." Agent C states "after confirming the login button." A guess became a stated fact through three handoffs.

This pattern is especially dangerous when agents use chain-of-thought reasoning, because each step's reasoning builds on the previous agent's confident framing rather than the original uncertain observation.

Error Laundering

The original mistake becomes untraceable because each agent rephrased and restructured the output. By the time you are looking at Agent D's output, there is no direct link back to Agent A's original misidentification. The error has been laundered through three layers of summarization.

Failure Mode How It Starts How It Spreads How You Detect It
Silent data corruption Agent changes a value during reformatting Downstream agents process wrong values as valid Manual review of final output against source
Confidence amplification Agent hedges with "might" or "possibly" Each handoff removes qualifiers Compare first agent's uncertainty with last agent's certainty
Error laundering Agent makes a wrong inference Each agent rephrases, losing traceability Full input/output trace at every boundary

Why Agents Do Not Catch Each Other's Mistakes

The intuition is that more agents means more verification. In practice, agents are bad at questioning each other's work. They tend to treat upstream output as ground truth, especially when it is well-formatted and plausible-looking.

This is the same problem as AI-generated code that "looks right." The output is polished enough that review is perfunctory. When Agent B receives a neatly structured JSON response from Agent A, it does not run adversarial checks. It processes the data and moves on.

There is a deeper problem: agents do not have the context to know what "wrong" looks like. Agent B knows what Agent A told it, but it does not know what Agent A actually observed. Without access to the original ground truth (the raw screenshot, the actual API response, the real file contents), Agent B has no basis for skepticism.

Warning

Adding a "reviewer agent" that checks other agents' work rarely helps. The reviewer suffers from the same problem: it sees the polished output, not the raw input. Unless the reviewer independently re-derives the answer from source data, it is just another node in the error propagation chain.

Patterns That Actually Contain Errors

After building and debugging multi-agent workflows in Fazm, we have landed on a few patterns that meaningfully reduce error propagation.

1. Validate at Boundaries, Not Endpoints

Every time output crosses from one agent to another, validate it against the original source of truth. Not the intermediate agent's summary. The actual source.

For desktop automation, this means re-reading the screen state at each handoff rather than trusting the previous agent's description of what is on screen. For code generation pipelines, this means running the code after each agent's contribution rather than waiting until the end.

# Bad: trust the chain
result_a = agent_a.run(task)
result_b = agent_b.run(result_a)  # trusts A's output blindly
result_c = agent_c.run(result_b)  # trusts B's output blindly

# Better: validate at each boundary
result_a = agent_a.run(task)
validated_a = validate_against_source(result_a, original_input)
result_b = agent_b.run(validated_a)
validated_b = validate_against_source(result_b, original_input)
result_c = agent_c.run(validated_b)

2. Keep Chains Short

The fewer handoffs, the less opportunity for error propagation. Two agents with clear responsibilities beat five agents in a pipeline. Every additional agent in the chain multiplies the probability of undetected error.

If you have a five-agent pipeline and each agent has a 90% accuracy rate on passing data correctly, the chain's end-to-end accuracy is 0.9^5 = 59%. With two agents at the same per-agent accuracy, you get 81%. The math is unforgiving.

Chain Length Per-Agent Accuracy 95% Per-Agent Accuracy 90% Per-Agent Accuracy 85%
2 agents 90.2% 81.0% 72.2%
3 agents 85.7% 72.9% 61.4%
5 agents 77.4% 59.0% 44.4%
8 agents 66.3% 43.0% 27.2%

3. Pass Raw Context Forward

Instead of having each agent summarize and pass only its conclusions, attach the original source data to every message in the chain. This lets downstream agents cross-check against the source rather than relying on summaries.

The tradeoff is token cost. Passing raw screenshots, full file contents, or complete API responses through the chain uses more context window. But compared to the cost of debugging a cascading error, it is cheap.

4. Log Inputs and Outputs at Every Boundary

When errors do propagate, you need the full trace to find the root cause. This means logging the exact input and exact output at every agent boundary, not just the final result.

import json
from datetime import datetime

def agent_boundary_log(agent_name, input_data, output_data):
    entry = {
        "timestamp": datetime.now().isoformat(),
        "agent": agent_name,
        "input_hash": hash(json.dumps(input_data, sort_keys=True)),
        "output_hash": hash(json.dumps(output_data, sort_keys=True)),
        "input_preview": str(input_data)[:500],
        "output_preview": str(output_data)[:500]
    }
    # Write to structured log for later diffing
    with open("/tmp/agent_trace.jsonl", "a") as f:
        f.write(json.dumps(entry) + "\n")

With this trace, you can diff adjacent entries to find exactly where the data diverged from the expected path.

5. Use Typed Contracts Between Agents

Define explicit schemas for what each agent accepts and produces. When Agent A's output does not match Agent B's expected input schema, fail loudly rather than letting Agent B attempt to parse malformed data.

from pydantic import BaseModel, ValidationError

class UIElementReport(BaseModel):
    element_type: str
    label: str
    coordinates: tuple[int, int]
    confidence: float  # 0.0 to 1.0

def agent_b_receive(raw_output: dict):
    try:
        report = UIElementReport(**raw_output)
        if report.confidence < 0.8:
            raise ValueError(
                f"Low confidence ({report.confidence}) on element '{report.label}'"
            )
        return report
    except ValidationError as e:
        # Fail loudly, do not attempt to proceed
        raise RuntimeError(f"Agent A output failed schema validation: {e}")

Common Pitfalls

  • Adding more agents to fix reliability problems. If your two-agent system has error propagation issues, a three-agent system will be worse. Fix the boundary validation first.
  • Trusting "consensus" from multiple agents. Three agents that all read the same wrong input will all agree on the wrong answer. Consensus without independent verification is just error amplification with extra steps.
  • Skipping traces in production. Boundary logging feels like overhead until you spend four hours debugging a cascading error that a trace would have localized in seconds.
  • Over-relying on retry logic. Retrying a failed agent does not help if the agent is consistently wrong about a specific type of input. You need to fix the agent's handling of that input class, not retry the same broken logic.

Checklist for Multi-Agent Error Containment

Validate agent outputs against original source data, not intermediate summaries
Keep agent chains to 2-3 agents maximum for any single workflow
Log inputs and outputs at every agent boundary with structured traces
Use typed schemas (Pydantic, Zod, etc.) for inter-agent data contracts
Pass raw context (screenshots, file contents) alongside summaries
Fail loudly on schema mismatches; never silently adapt to bad data

Wrapping Up

Error propagation is the tax you pay for every agent you add to a system. The math compounds against you: each handoff multiplies error probability. The most reliable multi-agent systems we have built keep chains short, validate at every boundary against the original source, and log everything. Start with a single agent and only add more when you have proven the single-agent version works reliably.

Fazm is an open source macOS AI agent. Open source on GitHub.