API Endpoints That Stay Alive - Health Checks, Heartbeats, and Warm Connections

M
Matthew Diakonov

API Endpoints That Stay Alive - Health Checks, Heartbeats, and Warm Connections

A door with a pulse is an API endpoint that is alive - not just responding with 200 OK, but genuinely ready to handle real work. The distinction matters enormously for AI agents that depend on external services to function.

API reliability has actually gotten worse recently. The Uptrends State of API Reliability 2025 report found that average API uptime fell from 99.66% to 99.46% between Q1 2024 and Q1 2025, resulting in 60% more downtime year-over-year. As AI agents orchestrate more API calls in series, each small reliability gap compounds.

The Difference Between Alive and Responsive

A health check that returns {"status": "ok"} tells you almost nothing. The endpoint is reachable. The web server is running. But can it actually process a request? Is the database connection pool healthy? Are downstream services available?

For AI agents, this is not an academic concern. An agent that calls an LLM API, gets back a 200 response with an empty completion because the model is overloaded, and then tries to parse that empty response as instructions - that agent is about to do something unpredictable and possibly destructive.

Real health checks include dependency probes:

from fastapi import FastAPI, Response
import asyncpg
import httpx
import asyncio
from datetime import datetime

app = FastAPI()

@app.get("/health")
async def health_check():
    checks = {}
    overall_healthy = True

    # Check database connection
    try:
        conn = await asyncpg.connect(DATABASE_URL, timeout=2.0)
        await conn.fetchval("SELECT 1")
        await conn.close()
        checks["database"] = {"status": "ok"}
    except Exception as e:
        checks["database"] = {"status": "error", "detail": str(e)}
        overall_healthy = False

    # Check LLM API availability
    try:
        async with httpx.AsyncClient(timeout=3.0) as client:
            resp = await client.get("https://api.anthropic.com/health")
            checks["llm_api"] = {"status": "ok" if resp.status_code == 200 else "degraded"}
    except Exception as e:
        checks["llm_api"] = {"status": "unreachable", "detail": str(e)}
        overall_healthy = False

    # Check memory store
    try:
        # Test a simple read/write
        await memory_store.ping()
        checks["memory_store"] = {"status": "ok"}
    except Exception as e:
        checks["memory_store"] = {"status": "error", "detail": str(e)}
        overall_healthy = False

    status_code = 200 if overall_healthy else 503
    return Response(
        content=json.dumps({
            "status": "healthy" if overall_healthy else "degraded",
            "checks": checks,
            "timestamp": datetime.utcnow().isoformat()
        }),
        status_code=status_code,
        media_type="application/json"
    )

This health check fails loudly when any dependency is unhealthy, returning 503 instead of a misleading 200. An upstream load balancer or monitoring system can act on that 503. An agent checking this endpoint before starting a task knows whether to proceed.

The industry recommendation: run health checks every 30 seconds to 1 minute. More frequent than that and the checks themselves become load. Less frequent and you detect outages too slowly. Health check endpoints should respond in under 100ms - if your health check is slow, it is probing too deeply.

Heartbeats for Long-Running Agent Sessions

Desktop agents often maintain long-running connections to multiple services - LLM providers, memory stores, MCP servers, local databases. These connections go stale. TCP keepalives help but are not sufficient for application-level state.

Application-level heartbeats solve this. The pattern:

import asyncio
import logging
from typing import Callable, Dict
from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class ServiceStatus:
    name: str
    last_ping: datetime | None = None
    consecutive_failures: int = 0
    is_healthy: bool = True

class AgentHeartbeatMonitor:
    def __init__(self, interval_seconds: int = 30):
        self.interval = interval_seconds
        self.services: Dict[str, ServiceStatus] = {}
        self.ping_funcs: Dict[str, Callable] = {}
        self._running = False

    def register(self, name: str, ping_func: Callable):
        """Register a service and its ping function."""
        self.services[name] = ServiceStatus(name=name)
        self.ping_funcs[name] = ping_func

    async def _ping_service(self, name: str):
        status = self.services[name]
        try:
            await self.ping_funcs[name]()
            status.last_ping = datetime.now()
            status.consecutive_failures = 0
            if not status.is_healthy:
                logging.info(f"Service {name} recovered")
                status.is_healthy = True
        except Exception as e:
            status.consecutive_failures += 1
            if status.consecutive_failures >= 3:
                if status.is_healthy:
                    logging.warning(f"Service {name} marked unhealthy: {e}")
                    status.is_healthy = False

    async def run(self):
        self._running = True
        while self._running:
            await asyncio.gather(*[
                self._ping_service(name)
                for name in self.services
            ])
            await asyncio.sleep(self.interval)

    def is_ready(self, service_name: str) -> bool:
        """Check if a service is healthy before using it."""
        return self.services.get(service_name, ServiceStatus("")).is_healthy

    def stop(self):
        self._running = False


# Usage in an agent
monitor = AgentHeartbeatMonitor(interval_seconds=30)

# Register each dependency
monitor.register("llm_api", lambda: anthropic_client.messages.create(
    model="claude-3-haiku-20240307",
    max_tokens=1,
    messages=[{"role": "user", "content": "ping"}]
))
monitor.register("memory_store", lambda: memory_store.ping())
monitor.register("mcp_filesystem", lambda: mcp_client.call_tool("list_directory", {"path": "."}))

# Before any task, check readiness
async def run_task(task: str):
    if not monitor.is_ready("llm_api"):
        raise RuntimeError("LLM API is unhealthy - cannot proceed")
    if not monitor.is_ready("memory_store"):
        logging.warning("Memory store degraded - proceeding with reduced capability")
    # ... proceed with task

The key design decision: mark a service unhealthy after 3 consecutive failures, not the first. Single-ping failures are common and transient. Three in a row means something is actually wrong.

Connection Warmth Matters for Latency

Cold API connections add latency that compounds across multi-step agent workflows. An agent making 15 API calls to complete a task - hitting the accessibility API, querying a knowledge graph, calling an LLM, updating a database - cannot afford connection setup overhead on every call.

The numbers: first request on a cold connection might take 200ms to establish. Subsequent requests on the same HTTP/2 connection take 20ms. Over a 15-call workflow, that difference is (200 - 20) * 15 = 2.7 seconds of pure overhead eliminated just by keeping connections warm.

Connection pooling implementation:

import httpx
from contextlib import asynccontextmanager

class WarmConnectionPool:
    def __init__(self, base_url: str, max_connections: int = 10):
        self.client = httpx.AsyncClient(
            base_url=base_url,
            http2=True,                    # Use HTTP/2 for multiplexing
            limits=httpx.Limits(
                max_connections=max_connections,
                max_keepalive_connections=max_connections,
                keepalive_expiry=30.0      # Keep connections alive 30 seconds
            ),
            timeout=httpx.Timeout(
                connect=5.0,
                read=30.0,
                write=10.0,
                pool=5.0
            )
        )

    async def post(self, path: str, **kwargs) -> httpx.Response:
        return await self.client.post(path, **kwargs)

    async def aclose(self):
        await self.client.aclose()

# Singleton pool shared across all agent calls
llm_pool = WarmConnectionPool("https://api.anthropic.com")
memory_pool = WarmConnectionPool("http://localhost:8765")

The pool reuses connections across calls. HTTP/2 multiplexing means multiple requests can share the same connection. keepalive_expiry=30 keeps connections alive between heartbeat cycles so the next task starts warm.

Build for Degraded States

The best agent architectures assume some endpoints will be temporarily dead. They have fallback paths, cached responses, and graceful degradation. An agent that crashes because one API is down is an agent that cannot be trusted with real work.

A degradation hierarchy:

async def get_user_context(user_id: str) -> dict:
    """Get user context with graceful degradation."""

    # Tier 1: Try the full graph database (best data, slowest)
    if monitor.is_ready("graph_db"):
        try:
            return await graph_db.get_full_context(user_id, timeout=2.0)
        except Exception:
            pass

    # Tier 2: Fall back to SQLite cache (recent data, fast)
    if monitor.is_ready("sqlite_cache"):
        try:
            return await sqlite_cache.get_context(user_id)
        except Exception:
            pass

    # Tier 3: Fall back to in-memory session data (current session only)
    if user_id in session_memory:
        return session_memory[user_id]

    # Tier 4: Return empty context rather than crashing
    logging.warning(f"All memory tiers failed for user {user_id}, starting fresh")
    return {}

Each tier is worse than the previous, but the agent keeps running. The user might notice reduced context quality, but the task completes. That is better than a crash.

The Monitoring Gap

The gap between "my API works" and "my API is reliably ready for AI agents" is real. AI endpoints must handle variable processing times, streaming responses, token-based billing, and complex error states that traditional REST patterns were not designed for. A standard /health that returns 200 OK tells you the server is up. It does not tell you the model is loaded, the connection pool has capacity, or that the last 10 requests succeeded.

Build health checks that actually check. Run heartbeats to detect stale connections before you need them. Pool connections to eliminate cold-start latency. Degrade gracefully instead of failing hard. These four things keep an AI agent running reliably in production.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts