Forgiveness in Error Handling - Why Agent Recovery Matters More Than Prevention

M
Matthew Diakonov

Forgiveness in Error Handling - Why Agent Recovery Matters More Than Prevention

The instinct in agent development is to prevent every possible failure. Guard clauses everywhere, validation on every input, pre-checks before every action. But the reality is that desktop environments are messy. Windows move, elements change labels, apps update their layouts overnight. System notifications pop up at exactly the wrong moment.

Prevention has limits. Recovery is the real engineering challenge.

Why Prevention Alone Fails

You cannot predict every state a macOS desktop will be in when your agent runs. A notification appears and covers the button you need to click. An app crashes and restarts with a different window position. The user has a different number of monitors today than yesterday. A system update changed the accessibility label on a button you were targeting by name.

Trying to guard against all of these leads to brittle, over-engineered code that still fails when something truly unexpected happens - and it always does. The guard clauses themselves become a maintenance burden.

More importantly, over-prevention creates a different failure mode: the agent that refuses to proceed because a precondition is not exactly met, even when a reasonable recovery would work fine. Paranoid agents are not safe agents. They are agents that block on trivial issues and require human intervention for problems they could have solved themselves.

Error Classification First

Not all errors are equal. The first step toward good recovery is classifying what actually went wrong:

Transient errors - Temporary states that resolve on their own with a brief wait. Network timeouts, rate limits, apps that are mid-launch. The right response is retry with backoff.

State errors - The system is in a different state than expected, but a known recoverable state. Expected element not found because a modal appeared. The right response is detect the actual state and navigate out of it.

Environmental errors - Something changed in the environment outside the agent's control. An app updated and changed its UI. The right response is log with full context and escalate for human review.

Fatal errors - Errors where proceeding would cause damage (writing corrupted data, sending duplicate messages). The right response is stop and report.

The classification determines the recovery strategy. Applying the wrong strategy wastes tokens and time - retrying a fatal error, or escalating a transient one.

Exponential Backoff with Jitter

For transient errors, exponential backoff is the standard pattern. AWS research on distributed systems found that exponential backoff with jitter reduces retry storms by 60 to 80% compared to fixed-interval retries.

import asyncio
import random
from typing import Callable, TypeVar

T = TypeVar('T')

async def retry_with_backoff(
    action: Callable[[], T],
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    retryable_errors: tuple = (TimeoutError, ConnectionError)
) -> T:
    """Retry with exponential backoff and jitter."""
    for attempt in range(max_retries + 1):
        try:
            return await action()
        except retryable_errors as e:
            if attempt == max_retries:
                raise

            # Exponential backoff: 1s, 2s, 4s...
            delay = min(base_delay * (2 ** attempt), max_delay)
            # Add jitter to prevent retry storms
            jitter = random.uniform(0, delay * 0.3)
            wait_time = delay + jitter

            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait_time:.1f}s")
            await asyncio.sleep(wait_time)

For desktop automation specifically, common retryable errors include element-not-found (the app is still loading), click-failed (a notification momentarily covered the target), and screenshot-empty (a brief display glitch). Non-retryable errors include permission denied, invalid credentials, and file system full.

State-Aware Recovery

For state errors in desktop automation, the agent needs to understand where it actually is, not just where it expected to be.

A practical pattern is the "state detective" - before retrying any action, re-scan the environment and identify the current state explicitly:

async def navigate_to_settings(agent):
    """Navigate to Settings, handling unexpected UI states."""
    try:
        await agent.click("Settings button")
    except ElementNotFoundError:
        # Diagnose actual state before retrying
        current_state = await agent.identify_current_state()

        if current_state == "modal_dialog_open":
            await agent.dismiss_modal()
            await agent.click("Settings button")  # Retry after recovery

        elif current_state == "wrong_window_focused":
            await agent.focus_main_window()
            await agent.click("Settings button")

        elif current_state == "app_loading":
            await agent.wait_for_element("main_content", timeout=5.0)
            await agent.click("Settings button")

        else:
            # Unknown state - log with full context and escalate
            await agent.capture_diagnostic_screenshot()
            raise UnrecoverableStateError(
                f"Unknown state: {current_state}. Manual intervention required."
            )

The key is that the agent does not blindly retry. It identifies the problem first, applies the specific recovery for that problem, then retries. This makes recovery deterministic and debuggable.

Checkpoint-Based Recovery for Long Tasks

For workflows that take minutes or hours, checkpoints prevent having to restart from scratch when something fails midway:

class CheckpointedWorkflow:
    def __init__(self, workflow_id: str):
        self.workflow_id = workflow_id
        self.checkpoint_file = f"/tmp/workflow_{workflow_id}_state.json"

    def save_checkpoint(self, step: str, state: dict):
        data = {
            "completed_step": step,
            "state": state,
            "timestamp": time.time()
        }
        with open(self.checkpoint_file, 'w') as f:
            json.dump(data, f)

    def load_checkpoint(self) -> dict | None:
        if os.path.exists(self.checkpoint_file):
            with open(self.checkpoint_file) as f:
                return json.load(f)
        return None

    async def run(self, steps: list):
        checkpoint = self.load_checkpoint()
        start_from = 0

        if checkpoint:
            completed = checkpoint["completed_step"]
            start_from = next(
                (i + 1 for i, s in enumerate(steps) if s.name == completed),
                0
            )
            print(f"Resuming from step {start_from}: {steps[start_from].name}")

        for i, step in enumerate(steps[start_from:], start=start_from):
            result = await step.execute()
            self.save_checkpoint(step.name, result)

A 200-step screenshot automation that fails at step 147 resumes from step 147, not step 1.

What Good Logging Actually Looks Like

When a failure does escalate to human review, the logs need to tell the full story. Not just "element not found" but: what was the agent trying to do, what did the accessibility tree look like at that moment, what had just succeeded before this failure, and what recovery strategies were already attempted.

async def log_failure(error, context):
    diagnostic = {
        "error": str(error),
        "error_type": type(error).__name__,
        "timestamp": time.time(),
        "task": context.current_task,
        "last_successful_step": context.last_success,
        "recovery_attempts": context.recovery_attempts,
        "accessibility_snapshot": await capture_ax_tree(),
        "screenshot_path": await capture_screenshot()
    }
    logger.error("Agent failure", extra={"diagnostic": diagnostic})

Structured logs in JSON format that include an accessibility tree snapshot and screenshot path mean you can reproduce and understand any failure without having to sit at the machine when it happens.

The Practical Balance

Good error handling is a spectrum. At one end: every error is fatal, the agent stops and demands human intervention. At the other end: the agent retries everything indefinitely and hides all failures.

Neither extreme is useful. The right calibration is:

  • Retry transient errors automatically, with backoff and a limit
  • Recover from known state errors with specific recovery logic
  • Checkpoint long workflows so partial progress is preserved
  • Log enough context that human review of escalated failures is fast
  • Never silently discard errors - if you are not retrying, you are reporting

The agents that survive real-world use are not the ones that never fail. They are the ones that fail gracefully, understand what happened, and get back on track without human intervention wherever possible.

More on This Topic

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts