Safety Problems at the Execution Layer - Not in the Prompt

M
Matthew Diakonov

Safety Problems at the Execution Layer - Not in the Prompt

Everyone focuses on prompt safety. System prompts full of "never do X" instructions. Content filters on inputs and outputs. Guardrails that intercept the LLM's planned response before it executes.

That work matters. But it is not where most real-world AI agent safety failures happen.

The dangerous part is the execution layer - the code that actually clicks buttons, runs shell commands, edits files, and sends messages. This is where intentions become actions, and where small implementation bugs create large, hard-to-reverse problems.

What the CVE Data Shows

A security audit of 2,614 MCP (Model Context Protocol) server implementations published in early 2026 found:

  • 82% have file operation vulnerabilities including path traversal attacks
  • 43% of documented CVEs fall into the exec/shell injection category
  • Over a third of surveyed implementations are susceptible to command injection

These are not theoretical vulnerabilities in obscure tools. Anthropic's own Filesystem MCP Server was found vulnerable to a path traversal bypass - attackers bypassed directory restrictions using ../../ sequences to read and write files anywhere on the system.

CVE-2025-59536 is a particularly concrete example: it exploits Claude Code's Hooks feature, which runs predefined shell commands at lifecycle events. An attacker who can write to .claude/settings.json in a repository can inject a malicious Hook that executes arbitrary code the moment a developer opens the project. The prompt layer has no visibility into this attack at all.

Why Execution Is Harder to Secure

A prompt-level safety check can catch an agent that plans to "delete all files in the home directory." It cannot catch execution bugs that produce the same outcome accidentally.

Path traversal. The agent constructs a file path from user input to write a report. A crafted input containing ../../etc/passwd sends the write somewhere unintended. The agent intended to write ~/Documents/report.txt. The execution code trusted that the path was valid.

# Vulnerable - trusts path from user input
def write_report(filename, content):
    path = os.path.join(REPORTS_DIR, filename)
    with open(path, "w") as f:
        f.write(content)

# Safe - validates path is within allowed directory
def write_report(filename, content):
    path = os.path.realpath(os.path.join(REPORTS_DIR, filename))
    if not path.startswith(os.path.realpath(REPORTS_DIR)):
        raise ValueError(f"Path escape attempt: {filename}")
    with open(path, "w") as f:
        f.write(content)

Shell injection. The agent calls a shell command with user-provided arguments that get interpreted as flags. rm report -rf is very different from rm "report -rf". Many agentic systems do not validate argument flags, treating any string from the model as safe input to the shell.

# Vulnerable - shell=True with model-provided input
import subprocess
def run_command(user_input):
    subprocess.run(f"process_file {user_input}", shell=True)

# Safe - list-form prevents injection, no shell=True
def run_command(filename):
    if not filename.replace("-", "").replace("_", "").isalnum():
        raise ValueError("Invalid filename")
    subprocess.run(["process_file", filename], shell=False)

Scope creep in loops. An agent iterating over files to rename them hits a permission error, retries with escalated permissions using a broader glob, and ends up modifying files outside its intended scope. Each step was a reasonable local decision. The cumulative effect was unauthorized modification of system files.

Race conditions. Two parallel agent actions both try to write to the same configuration file. Each reads the current state, makes its change, and writes back. The second write silently overwrites the first. Neither agent knows its changes were partially overwritten.

Building Execution-Layer Guardrails

Prompt safety is necessary but not sufficient. The execution layer needs its own independent security controls.

Allowlists for paths and commands. Do not rely on the LLM to avoid dangerous operations. Enforce it in code with explicit allowlists.

ALLOWED_PATHS = [
    os.path.expanduser("~/Documents"),
    os.path.expanduser("~/Downloads"),
    "/tmp/agent-workspace"
]

ALLOWED_COMMANDS = {
    "file_ops": ["read", "write", "list"],
    "git": ["status", "diff", "log", "add", "commit"],
    "shell": []  # No shell commands allowed
}

def validate_path(path):
    real_path = os.path.realpath(path)
    if not any(real_path.startswith(allowed) for allowed in ALLOWED_PATHS):
        raise SecurityError(f"Access denied: {path}")
    return real_path

def validate_command(tool, operation):
    allowed_ops = ALLOWED_COMMANDS.get(tool, [])
    if operation not in allowed_ops:
        raise SecurityError(f"Operation not permitted: {tool}.{operation}")

Sandboxed execution. Run agent actions in containers or restricted OS contexts. On macOS, the agent process can be restricted to specific directories and prevented from making network calls using sandbox profiles. On Linux, seccomp filters and user namespaces limit what system calls the agent process can make.

Action logging with rollback. Record every side effect with enough information to undo it. File writes log the previous content. Sent messages log the recipient and content. Database changes log the previous state. This is not just for debugging - it is the mechanism that lets you recover from an agent that made a series of individually-plausible mistakes.

class AuditedFileWriter:
    def __init__(self, audit_log_path):
        self.audit = open(audit_log_path, "a")

    def write(self, path, content):
        validated_path = validate_path(path)

        # Log previous state for rollback
        previous = None
        if os.path.exists(validated_path):
            with open(validated_path, "r") as f:
                previous = f.read()

        self.audit.write(json.dumps({
            "timestamp": datetime.utcnow().isoformat(),
            "action": "write",
            "path": validated_path,
            "previous": previous,
            "new_hash": hashlib.sha256(content.encode()).hexdigest()
        }) + "\n")

        with open(validated_path, "w") as f:
            f.write(content)

Rate limiting on destructive operations. Slow down deletes, sends, and multi-file writes. A 2-second delay on file deletion gives humans a chance to notice and intervene. A queue with a 5-minute hold on outbound emails allows review before messages go out. The cost in speed is trivial compared to the cost of a runaway agent sending emails to hundreds of contacts.

The Right Mental Model

Think of AI agent security as two separate problems that both need to be solved independently:

  1. What the agent decides to do - handled by prompt design, system instructions, and output filtering
  2. What the agent can actually do - handled by execution-layer code, allowlists, sandboxing, and audit logging

An agent with perfect prompt safety but no execution guardrails is a well-intentioned program with a bug in its action code. The bug will eventually manifest. An agent with execution guardrails but no prompt safety is a system that cannot be exploited for intended harm but might make well-intentioned mistakes.

Both layers need to be secure. But if you are building an agent system today and have limited engineering time, invest in the execution layer first. Prompt safety is easier to add incrementally. Execution-layer bugs are discovered in production.

This post was inspired by a discussion on r/artificial.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts