Ai Safety

11 articles about ai safety.

Why Guardian Models Fail Against Anticipated Attacks on AI Agents

March 18, 2026·6 min read

Guardian models and safety wrappers fail precisely when you need them. Prompt injection is OWASP's #1 LLM vulnerability. Here's what actually works for AI agent security.

ai-safetyagent-securityguardrailssafety-featuresadversarial

Machine-Enforceable Policy

March 18, 2026·2 min read

Most AI agent policies rely on the honor system. OS-level sandboxing has gaps. Until policy enforcement is machine-verifiable, agent safety depends on trust

ai-safetypolicysandboxingsecurityai-agents

Responsible AI Agent Development - Building Agents That Do No Harm

March 18, 2026·3 min read

How to build AI agents with safety guardrails, output validation, and scope limiting to prevent unintended actions and ensure responsible automation.

ai-safetyresponsible-aiguardrailsagent-developmentoutput-validation

What It Means to Have a Human

March 18, 2026·2 min read

The human in the loop catches mistakes the agent does not know it is making. This is not supervision - it is a fundamentally different kind of error detection.

human-in-the-loopai-safetyerror-detectionagent-trustai-agents

When AI Agents Undermine Human Judgment - The Automation Bias Problem

March 18, 2026·5 min read

The subtle danger is not agents making bad decisions. It is agents making decisions that look good enough that humans stop thinking. Research on automation bias and how to design against it.

ai-safetyhuman-judgmentagent-trustdecision-makingai-agentsautomation-bias

The Smart Knife Problem - Why AI Agents Should Be Tools, Not Autonomous Weapons

March 17, 2026·2 min read

AI agents work best as tools with clear boundaries, not autonomous systems making decisions without oversight. The smart knife problem explained.

ai-safetyagent-boundariesai-agenttrustdesktop-automation

AI Agent Failure Rates and the Desktop Permissions Problem

March 17, 2026·3 min read

AI agents fail more often than people think. When desktop agents can click anything and type anywhere, one hallucinated action can send emails or delete files.

ai-safetypermissionsdesktop-agentfailure-raterisk-management

AI Agent Security Is Backwards - Why Input Validation Matters More Than Output Verification

March 17, 2026·2 min read

Most AI agent security focuses on verifying outputs - did the click land correctly? But unsigned, unvalidated inputs are the real attack surface.

ai-safetyagent-securityinput-validationdesktop-agentprompt-injection

Designing a Tiered Permission System for AI Desktop Agents

March 14, 2026·3 min read

Full YOLO mode is dangerous and full approval mode is unusable. Tiered permissions with allowlists per action type hit the sweet spot.

permissionsai-safetyux-designdesktop-agentarchitecture

How to Build AI Agents You Can Actually Trust - Bounded Tools and Approval UX

March 13, 2026·3 min read

Giving AI agents broad system access is a recipe for disaster. How bounded tool interfaces and smart approval flows make desktop agents safe to use.

ai-safetyagent-designtrustuxdesktop-agent

Prompt Injection and AI Agents - Why Browser-Based Agents Have a Bigger Attack Surface