How to Limit the Blast Radius of a Compromised AI Agent

Matthew Diakonov··15 min read

How to Limit the Blast Radius of a Compromised AI Agent

Your AI agent will eventually process something malicious. A prompt injection hidden in a webpage, a poisoned tool result, a crafted document. The question is not whether the agent gets compromised, but how much damage it can do when it does. This guide covers the concrete techniques that keep a single compromised agent from becoming a full system breach.

Why Blast Radius Matters More Than Prevention

Prevention is necessary but insufficient. You can filter inputs, validate tool calls, and monitor outputs, but no defense catches everything. The security model that actually works treats compromise as inevitable and focuses on containment.

Think of it like fire safety in a building. Sprinklers and alarms help, but the thing that saves lives is compartmentalization: fire doors, independent HVAC zones, and structural barriers that keep a kitchen fire from burning down the whole building.

The same principle applies to AI agents. Every technique in this guide creates a barrier that limits how far a compromise can spread.

Without containment vs. With containmentNo isolationAgentFilesNetworkCredentialsShellWith containmentAgentFilesNetworkCredentialsShellxx

Layer 1: Process Isolation

The most effective containment runs the agent in a separate process with its own permissions. If the agent process cannot access a resource, no amount of prompt injection can reach it.

Containers are the simplest approach. Run the agent inside a Docker container with a read-only filesystem, no network access to internal services, and only the specific files it needs mounted as volumes:

docker run --rm \
  --read-only \
  --network=none \
  --tmpfs /tmp:size=100m \
  -v /data/agent-workspace:/workspace:rw \
  -v /data/agent-config:/config:ro \
  --memory=512m \
  --cpus=1 \
  agent-image:latest

This gives the agent a writable workspace but nothing else. Even if it runs arbitrary code, it cannot reach the network, read your home directory, or consume unlimited resources.

macOS sandboxing works at the app level. The sandbox-exec command (deprecated but still functional through macOS 15) and App Sandbox entitlements restrict file access, network connections, and hardware access per-process. For desktop agents, this means the agent helper process can be sandboxed even when the parent app has broader access.

| Isolation method | Setup effort | Containment strength | Performance overhead | |---|---|---|---| | Docker container | Low | High (filesystem, network, resources) | ~5% CPU, memory limit enforced | | VM (Firecracker, QEMU) | Medium | Very high (full hardware isolation) | 10-15% CPU, fixed memory | | macOS App Sandbox | Medium | Medium (filesystem, network per-app) | Negligible | | Linux namespaces (manual) | High | High (fine-grained control) | Negligible | | chroot / unshare | Low | Low (easy to escape without namespaces) | Negligible |

Layer 2: Least-Privilege Tool Design

Every tool your agent can call defines the upper bound of what a compromised agent can do. The principle is simple: each tool should do one thing and have the minimum permissions needed to do it.

Warning

A tool that can "run any shell command" makes every other security measure irrelevant. If one of your agent's tools is an unrestricted shell, that is your blast radius: everything the user can do.

Instead of giving the agent a general-purpose shell tool, create narrow tools that do exactly what the workflow needs:

run_shell(command: string) - executes any shell command
read_file(path: string, allowed_dirs: ["/data/reports"]) - reads files only from allowed directories
query_db(sql: string, tables: ["analytics"]) - runs read-only queries against specific tables
send_slack(channel: "#alerts", message: string) - posts only to a specific channel

The MCP (Model Context Protocol) specification supports this pattern natively. Each MCP server exposes a defined set of tools with typed parameters. Instead of one server with broad access, run multiple servers with narrow scopes. If one server gets compromised through a poisoned tool result, the others remain unaffected because they run in separate processes. We wrote more about this trust model in our post on MCP server trust surfaces.

Layer 3: Network Segmentation

A compromised agent that can reach the internet can exfiltrate data. One that can reach internal services can pivot to more valuable targets. Network controls are the second most effective containment layer after process isolation.

# Allow only specific outbound destinations
iptables -A OUTPUT -m owner --uid-owner agent-user -d api.openai.com -p tcp --dport 443 -j ACCEPT
iptables -A OUTPUT -m owner --uid-owner agent-user -j DROP

For Docker-based agents, use custom networks with no internet access and explicit service links:

# docker-compose.yml
services:
  agent:
    image: agent-image:latest
    networks:
      - agent-net
    # No connection to default bridge = no internet
  
  allowed-api:
    image: api-proxy:latest
    networks:
      - agent-net
      - external
    # Proxy validates and rate-limits agent requests

networks:
  agent-net:
    internal: true  # No external connectivity
  external:
    driver: bridge

The proxy pattern is especially useful: the agent talks to your proxy, which validates the request shape, enforces rate limits, and forwards only to allowed destinations. The agent never sees real API keys or network addresses.

Layer 4: Credential Isolation

Never give the agent direct access to credentials. This is the single highest-leverage change you can make, because credentials amplify every other exploit.

| Pattern | Risk | Better alternative | |---|---|---| | API keys in environment variables | Agent can read and exfiltrate them | Proxy service that adds auth headers | | SSH keys mounted in agent container | Agent can access any server the key unlocks | Short-lived certificates from a CA (e.g., Vault) | | Database connection string with full access | Agent can DROP tables | Read-only database replica with row-level security | | OAuth tokens with broad scopes | Agent can act as the user everywhere | Scoped tokens that expire in minutes | | AWS credentials with AdministratorAccess | Game over | IAM role with only the specific actions needed |

The proxy pattern from the network section solves credential isolation too. The agent sends { "action": "send_email", "to": "user@example.com", "body": "..." } to your proxy. The proxy validates the request, adds the API key from its own secure storage, and makes the actual API call. The agent never touches the credential.

Layer 5: Human-in-the-Loop Checkpoints

Some actions are too consequential to run without approval, regardless of how good your other containment is. Build approval gates into the tool layer:

class ApprovalRequiredTool:
    """Wraps any tool with a human approval step."""
    
    ALWAYS_REQUIRE = {
        "delete_file", "send_email", "deploy", 
        "modify_permissions", "create_user"
    }
    
    THRESHOLD_REQUIRE = {
        "write_file": lambda args: len(args["content"]) > 10000,
        "query_db": lambda args: "DELETE" in args["sql"].upper(),
        "transfer": lambda args: float(args["amount"]) > 100,
    }
    
    def execute(self, tool_name, args):
        if tool_name in self.ALWAYS_REQUIRE:
            return self.request_approval(tool_name, args)
        
        check = self.THRESHOLD_REQUIRE.get(tool_name)
        if check and check(args):
            return self.request_approval(tool_name, args)
        
        return self.run(tool_name, args)

This is what Claude Code does with its permission modes. In the default mode, destructive operations like file deletion, shell commands, and git pushes require explicit user approval. The agent can still read, search, and analyze freely, but it cannot cause irreversible damage without a human confirming the action.

Layer 6: Audit Logging and Anomaly Detection

Containment without visibility is incomplete. Every tool call, every API request, and every file access should be logged with enough detail to reconstruct what happened.

import json
import time

def log_tool_call(agent_id, tool_name, args, result, duration_ms):
    entry = {
        "timestamp": time.time(),
        "agent_id": agent_id,
        "tool": tool_name,
        "args": sanitize(args),  # strip sensitive values
        "result_size": len(json.dumps(result)),
        "duration_ms": duration_ms,
        "success": result.get("error") is None,
    }
    # Append to immutable log (agent cannot modify)
    append_to_audit_log(entry)
    
    # Check for anomalies
    if tool_name == "read_file" and calls_in_last_minute(agent_id, "read_file") > 50:
        kill_agent(agent_id, reason="excessive file reads")
    if result_size > 1_000_000:  # 1MB result
        flag_for_review(agent_id, "large data access")

The audit log must live outside the agent's reach. If the agent can delete or modify its own logs, a compromised agent will cover its tracks. Write logs to a separate service, a different user's filesystem, or an append-only storage backend.

Layer 7: Ephemeral Sessions

Long-running agent sessions accumulate context, credentials, and cached data that increase blast radius over time. Ephemeral sessions reset the agent's state regularly, limiting how much a compromised agent can access from previous interactions.

# Each task gets a fresh container
docker run --rm \
  --read-only \
  --network=agent-net \
  -e TASK_ID="$TASK_ID" \
  agent-image:latest \
  python3 run_task.py

# Container is destroyed after the task completes
# No persistent state, no accumulated credentials

This is analogous to how browsers use incognito mode or how serverless functions get a fresh execution environment per invocation. The agent starts clean, does its work, and disappears. Any compromise is contained to that single task execution.

Putting It All Together

No single technique is sufficient. Real containment comes from layering multiple independent controls so that each layer catches what the others miss.

Defense in Depth: 7 LayersLayer 1: Process Isolation (containers, VMs, sandboxes)Layer 2: Least-Privilege Tools (narrow, typed, scoped)Layer 3: Network Segmentation (no internet, proxy only)Layer 4: Credential Isolation (proxy, short-lived tokens)Layer 5: Human Approval GatesLayer 6: Audit Logging + Anomaly KillLayer 7: Ephemeral Sessions

Here is a practical checklist you can work through for any agent deployment:

Run the agent in a container or VM, not directly on your host
Replace general-purpose shell tools with narrow, typed alternatives
Block outbound network by default, allowlist specific destinations
Never pass credentials to the agent; use a proxy that injects auth
Require human approval for destructive or high-value actions
Log every tool call to an immutable store the agent cannot access
Use ephemeral sessions so compromises do not persist across tasks

Common Pitfalls

  • "We have approval gates, so we are safe." Approval fatigue is real. After clicking "approve" 50 times in a row, humans stop reading the details. Combine approval gates with anomaly detection so that the system flags unusual requests rather than relying on humans to catch everything.

  • Overscoping "for development convenience." Running the agent with --privileged or --network=host during development is fine. Shipping it that way is not. The blast radius in production is your actual blast radius.

  • Logging tool names but not arguments. Knowing the agent called read_file is not useful. Knowing it called read_file("/etc/shadow") is. Log the full call with arguments, but sanitize any values that are themselves secrets.

  • Trusting the agent's self-reported actions. A compromised agent will lie about what it is doing. Your monitoring must observe the actual system calls, network traffic, and file access, not the agent's description of its own behavior.

Wrapping Up

Limiting blast radius comes down to one principle: never give the agent more access than the current task requires, and assume it will try to use all the access it has. Layer process isolation, least-privilege tools, network controls, credential separation, human checkpoints, audit logging, and ephemeral sessions. No single layer is enough, but together they turn a potential full-system compromise into a contained, recoverable incident.

Fazm is an open source macOS AI agent that applies these containment principles by default. Open source on GitHub.

Related Posts