How to Build Resilient AI Agent Pipelines That Survive API Outages
How to Build Resilient AI Agent Pipelines That Survive API Outages
Your AI agent pipeline works perfectly until an API goes down. Then everything stops. The LLM provider has an outage, the embedding service is unreachable, or a third-party integration returns 503s. If your pipeline has no resilience built in, one failure takes down the entire workflow.
Circuit Breakers for Agent Pipelines
The circuit breaker pattern prevents your agent from hammering a failing service. After a threshold of failures - say five consecutive errors - the circuit opens and the agent stops trying that service for a cooldown period. This protects both your agent from wasting time and the failing service from additional load.
Implement three states - closed (normal operation), open (service assumed down, skip calls), and half-open (periodically test if the service recovered). For LLM calls, this means the agent detects when the model provider is down and switches behavior instead of retrying endlessly.
Fallback Chains
Every critical capability in your pipeline should have at least one fallback. If your primary LLM is unavailable, fall back to a secondary provider. If cloud embeddings are down, use local embeddings through Ollama. If your vector database is unreachable, fall back to keyword search.
The fallback does not need to be equivalent quality. A local model producing decent results is infinitely better than a cloud model producing nothing. Design your fallbacks for availability, not parity.
For desktop agents on macOS, the most important fallback is the ability to queue actions for later execution. If the agent cannot complete a step because a service is down, save the pending action with full context and retry when the service recovers.
Retry Logic That Does Not Make Things Worse
Naive retry logic - try again immediately, forever - makes outages worse. Use exponential backoff with jitter. Start with a short delay, double it each retry, and add randomness so multiple agents are not retrying in sync.
Set a maximum retry count. After three to five retries with backoff, stop and either fall back or queue the action. Never let an agent sit in an infinite retry loop - it wastes tokens, fills logs with noise, and blocks other work.
Idempotent Actions
Design every agent action to be safely retryable. If the agent sent an email but is not sure it went through, it should be able to check before sending again rather than duplicating the message. Idempotency keys, deduplication checks, and pre-action state verification prevent the worst failure mode - an agent that retries successfully but creates duplicates.
Monitoring and Alerting
You cannot fix what you cannot see. Log every circuit breaker state change, every fallback activation, and every retry attempt. Set alerts for when fallbacks are active so you know your pipeline is running in degraded mode before your users notice.
- Error Handling Production AI Agents
- Real Bottleneck Recovery Not Prevention
- Optimizing Multi-Step Agents Running Log Loops
Fazm is an open source macOS AI agent. Open source on GitHub.