Uptime Lies - Co-Failure Patterns in AI Infrastructure
Uptime Lies and Co-Failure Patterns
You have five AI agent services. Each one reports 99.9 percent uptime. Your dashboard looks green. But all five services connect to the same Postgres instance. When that database has 30 minutes of downtime, all five services fail simultaneously. Your actual system availability is much worse than any individual service suggests.
What Co-Failure Means
Co-failure is when multiple components fail at the same time because they share a dependency. Individual uptime metrics hide this because they measure each service in isolation. The system as a whole is only as reliable as its most fragile shared dependency.
Common shared dependencies in AI agent infrastructure:
- Database - multiple agents reading from and writing to the same instance
- API provider - all agents calling the same LLM API, hitting the same rate limits
- Network - agents running in the same region or on the same network path
- Credentials - a single API key used by multiple agents, revoked once and everything breaks
- DNS - a single DNS provider or resolver that all services depend on
Why Individual Uptime Is Misleading
Service A was down for 10 minutes on Monday. Service B was down for 10 minutes on Wednesday. Service C was down for 10 minutes on Friday. Each reports 99.99 percent monthly uptime. But if all three went down at the same time on Monday because of a shared database, the user experienced 10 minutes of complete outage - not the graceful degradation the metrics suggest.
The multiplication of individual uptimes only works when failures are independent. With shared dependencies, they are not.
Finding Co-Failure Risks
Map your dependency graph explicitly:
- List every external service, database, API, and shared resource
- Draw which agents depend on each one
- Identify single points of failure - any resource where failure takes down multiple agents
- Check if your monitoring would detect a shared dependency failure as one incident or as separate incidents
Reducing Co-Failure
You do not need to eliminate all shared dependencies. You need to know where they are and have plans for when they fail:
- Circuit breakers - when a shared dependency fails, agents degrade gracefully instead of crashing
- Fallback paths - if the primary database is down, can agents use a read replica or cached data?
- Isolation where it matters - critical agents get their own database connections or API keys
- Correlated alerting - if three services alert at the same time, treat it as one incident with a shared root cause
- Resilient AI Agent Pipelines That Survive API Outages
- Avoid Single LLM Provider Dependency
- Multi-Provider Switching for Rate Limits
Fazm is an open source macOS AI agent. Open source on GitHub.