API Endpoints That Stay Alive - Health Checks, Heartbeats, and Warm Connections
API Endpoints That Stay Alive - Health Checks, Heartbeats, and Warm Connections
A door with a pulse is an API endpoint that is alive - not just responding with 200 OK, but genuinely ready to handle real work. The distinction matters enormously for AI agents that depend on external services to function.
API reliability has actually gotten worse recently. The Uptrends State of API Reliability 2025 report found that average API uptime fell from 99.66% to 99.46% between Q1 2024 and Q1 2025, resulting in 60% more downtime year-over-year. As AI agents orchestrate more API calls in series, each small reliability gap compounds.
The Difference Between Alive and Responsive
A health check that returns {"status": "ok"} tells you almost nothing. The endpoint is reachable. The web server is running. But can it actually process a request? Is the database connection pool healthy? Are downstream services available?
For AI agents, this is not an academic concern. An agent that calls an LLM API, gets back a 200 response with an empty completion because the model is overloaded, and then tries to parse that empty response as instructions - that agent is about to do something unpredictable and possibly destructive.
Real health checks include dependency probes:
from fastapi import FastAPI, Response
import asyncpg
import httpx
import asyncio
from datetime import datetime
app = FastAPI()
@app.get("/health")
async def health_check():
checks = {}
overall_healthy = True
# Check database connection
try:
conn = await asyncpg.connect(DATABASE_URL, timeout=2.0)
await conn.fetchval("SELECT 1")
await conn.close()
checks["database"] = {"status": "ok"}
except Exception as e:
checks["database"] = {"status": "error", "detail": str(e)}
overall_healthy = False
# Check LLM API availability
try:
async with httpx.AsyncClient(timeout=3.0) as client:
resp = await client.get("https://api.anthropic.com/health")
checks["llm_api"] = {"status": "ok" if resp.status_code == 200 else "degraded"}
except Exception as e:
checks["llm_api"] = {"status": "unreachable", "detail": str(e)}
overall_healthy = False
# Check memory store
try:
# Test a simple read/write
await memory_store.ping()
checks["memory_store"] = {"status": "ok"}
except Exception as e:
checks["memory_store"] = {"status": "error", "detail": str(e)}
overall_healthy = False
status_code = 200 if overall_healthy else 503
return Response(
content=json.dumps({
"status": "healthy" if overall_healthy else "degraded",
"checks": checks,
"timestamp": datetime.utcnow().isoformat()
}),
status_code=status_code,
media_type="application/json"
)
This health check fails loudly when any dependency is unhealthy, returning 503 instead of a misleading 200. An upstream load balancer or monitoring system can act on that 503. An agent checking this endpoint before starting a task knows whether to proceed.
The industry recommendation: run health checks every 30 seconds to 1 minute. More frequent than that and the checks themselves become load. Less frequent and you detect outages too slowly. Health check endpoints should respond in under 100ms - if your health check is slow, it is probing too deeply.
Heartbeats for Long-Running Agent Sessions
Desktop agents often maintain long-running connections to multiple services - LLM providers, memory stores, MCP servers, local databases. These connections go stale. TCP keepalives help but are not sufficient for application-level state.
Application-level heartbeats solve this. The pattern:
import asyncio
import logging
from typing import Callable, Dict
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class ServiceStatus:
name: str
last_ping: datetime | None = None
consecutive_failures: int = 0
is_healthy: bool = True
class AgentHeartbeatMonitor:
def __init__(self, interval_seconds: int = 30):
self.interval = interval_seconds
self.services: Dict[str, ServiceStatus] = {}
self.ping_funcs: Dict[str, Callable] = {}
self._running = False
def register(self, name: str, ping_func: Callable):
"""Register a service and its ping function."""
self.services[name] = ServiceStatus(name=name)
self.ping_funcs[name] = ping_func
async def _ping_service(self, name: str):
status = self.services[name]
try:
await self.ping_funcs[name]()
status.last_ping = datetime.now()
status.consecutive_failures = 0
if not status.is_healthy:
logging.info(f"Service {name} recovered")
status.is_healthy = True
except Exception as e:
status.consecutive_failures += 1
if status.consecutive_failures >= 3:
if status.is_healthy:
logging.warning(f"Service {name} marked unhealthy: {e}")
status.is_healthy = False
async def run(self):
self._running = True
while self._running:
await asyncio.gather(*[
self._ping_service(name)
for name in self.services
])
await asyncio.sleep(self.interval)
def is_ready(self, service_name: str) -> bool:
"""Check if a service is healthy before using it."""
return self.services.get(service_name, ServiceStatus("")).is_healthy
def stop(self):
self._running = False
# Usage in an agent
monitor = AgentHeartbeatMonitor(interval_seconds=30)
# Register each dependency
monitor.register("llm_api", lambda: anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1,
messages=[{"role": "user", "content": "ping"}]
))
monitor.register("memory_store", lambda: memory_store.ping())
monitor.register("mcp_filesystem", lambda: mcp_client.call_tool("list_directory", {"path": "."}))
# Before any task, check readiness
async def run_task(task: str):
if not monitor.is_ready("llm_api"):
raise RuntimeError("LLM API is unhealthy - cannot proceed")
if not monitor.is_ready("memory_store"):
logging.warning("Memory store degraded - proceeding with reduced capability")
# ... proceed with task
The key design decision: mark a service unhealthy after 3 consecutive failures, not the first. Single-ping failures are common and transient. Three in a row means something is actually wrong.
Connection Warmth Matters for Latency
Cold API connections add latency that compounds across multi-step agent workflows. An agent making 15 API calls to complete a task - hitting the accessibility API, querying a knowledge graph, calling an LLM, updating a database - cannot afford connection setup overhead on every call.
The numbers: first request on a cold connection might take 200ms to establish. Subsequent requests on the same HTTP/2 connection take 20ms. Over a 15-call workflow, that difference is (200 - 20) * 15 = 2.7 seconds of pure overhead eliminated just by keeping connections warm.
Connection pooling implementation:
import httpx
from contextlib import asynccontextmanager
class WarmConnectionPool:
def __init__(self, base_url: str, max_connections: int = 10):
self.client = httpx.AsyncClient(
base_url=base_url,
http2=True, # Use HTTP/2 for multiplexing
limits=httpx.Limits(
max_connections=max_connections,
max_keepalive_connections=max_connections,
keepalive_expiry=30.0 # Keep connections alive 30 seconds
),
timeout=httpx.Timeout(
connect=5.0,
read=30.0,
write=10.0,
pool=5.0
)
)
async def post(self, path: str, **kwargs) -> httpx.Response:
return await self.client.post(path, **kwargs)
async def aclose(self):
await self.client.aclose()
# Singleton pool shared across all agent calls
llm_pool = WarmConnectionPool("https://api.anthropic.com")
memory_pool = WarmConnectionPool("http://localhost:8765")
The pool reuses connections across calls. HTTP/2 multiplexing means multiple requests can share the same connection. keepalive_expiry=30 keeps connections alive between heartbeat cycles so the next task starts warm.
Build for Degraded States
The best agent architectures assume some endpoints will be temporarily dead. They have fallback paths, cached responses, and graceful degradation. An agent that crashes because one API is down is an agent that cannot be trusted with real work.
A degradation hierarchy:
async def get_user_context(user_id: str) -> dict:
"""Get user context with graceful degradation."""
# Tier 1: Try the full graph database (best data, slowest)
if monitor.is_ready("graph_db"):
try:
return await graph_db.get_full_context(user_id, timeout=2.0)
except Exception:
pass
# Tier 2: Fall back to SQLite cache (recent data, fast)
if monitor.is_ready("sqlite_cache"):
try:
return await sqlite_cache.get_context(user_id)
except Exception:
pass
# Tier 3: Fall back to in-memory session data (current session only)
if user_id in session_memory:
return session_memory[user_id]
# Tier 4: Return empty context rather than crashing
logging.warning(f"All memory tiers failed for user {user_id}, starting fresh")
return {}
Each tier is worse than the previous, but the agent keeps running. The user might notice reduced context quality, but the task completes. That is better than a crash.
The Monitoring Gap
The gap between "my API works" and "my API is reliably ready for AI agents" is real. AI endpoints must handle variable processing times, streaming responses, token-based billing, and complex error states that traditional REST patterns were not designed for. A standard /health that returns 200 OK tells you the server is up. It does not tell you the model is loaded, the connection pool has capacity, or that the last 10 requests succeeded.
Build health checks that actually check. Run heartbeats to detect stale connections before you need them. Pool connections to eliminate cold-start latency. Degrade gracefully instead of failing hard. These four things keep an AI agent running reliably in production.
Fazm is an open source macOS AI agent. Open source on GitHub.