Sandboxing AI Agents: Why Container Isolation Isn't Enough
The most common answer to "how do we safely run AI agents?" is "put them in containers." It's not wrong. Container isolation limits the process surface, restricts filesystem access, and makes it harder for a misbehaving agent to affect the host. These are real benefits.
But containers sandbox the process, not the decisions. And with AI agents, the decisions are where the risk lives.
An agent running in a locked-down container can still drop your production database if it has credentials and a network path to it. It can still exfiltrate your codebase if it can reach an outbound HTTP endpoint. It can still delete six months of customer records if you gave it write access to the right bucket. The container didn't stop any of that — and it wasn't designed to.
This is the gap that most sandboxing strategies miss. Here's what actually needs to be isolated when you're running AI agents in production.
What Containers Actually Isolate
To be clear about what you're getting from container isolation:
- Process namespace: The agent can't see or signal other processes on the host
- Filesystem: Controlled mount points — the agent only sees what you explicitly give it
- System calls: Seccomp profiles can block dangerous syscalls (ptrace, mount, etc.)
- Resource limits: CPU, memory, and disk quotas prevent resource exhaustion
- Network namespace: With the right configuration, outbound traffic can be restricted
These are genuinely useful controls. If an agent tries to install a kernel module or fork-bomb the host, containers help. The problem is that AI agents don't usually attack the host directly. They use the access they've been legitimately given to do things you didn't intend.
The Decision Surface: What Containers Don't Touch
Consider an agent that has been given:
- A database connection string (to read and write application data)
- An S3 bucket credential (to store generated files)
- SSH access to a staging environment (to test deployments)
- A GitHub token (to open PRs and read code)
The container boundary doesn't protect any of those. The agent operates with those credentials from inside the container, and if it decides to run DELETE FROM users WHERE created_at < '2024-01-01' or aws s3 rm s3://production-backups --recursive, the container is silent. It let the agent execute. The damage happened in the external systems the agent reached through its granted credentials.
The decision surface — the set of actions the agent can choose to take — is entirely outside what container isolation controls.
The Three Layers That Actually Need Sandboxing
1. Command authorization
Every command the agent wants to execute should pass through an authorization layer before it runs. Not a capability check ("can this process execute bash?"), but a semantic check ("should this specific command run in this context, against this target, at this time?").
This is what a command approval gateway does. The agent submits a command. The gateway evaluates it against a whitelist, applies risk scoring, and either auto-approves, routes to a human reviewer, or denies. The command doesn't execute until that decision is made.
Container isolation can't do this. It operates at the process/syscall level, not at the semantic level of "what is this command actually trying to do?"
2. Credential scope and lifetime
The credentials you give an agent define its blast radius. A database credential with full read-write access to all tables is a much larger blast radius than one scoped to a specific schema with only the tables the agent legitimately needs.
Effective sandboxing at the credential layer means:
- Scoped credentials: read-only where write isn't needed, schema-scoped for databases, prefix-limited for object storage
- Short-lived credentials: tokens that expire in minutes or hours, not permanent API keys
- Session-scoped credentials: a fresh, limited credential issued per agent session, revocable when the session ends
This is runtime credential sandboxing. It doesn't prevent the agent from attempting a destructive action, but it limits the damage if the attempt succeeds.
3. Network egress control
Containers can restrict network egress, but most deployments don't configure this carefully. A common mistake is putting the agent on the same network segment as production databases "for convenience" and then relying on the agent to make good decisions about which queries to run.
Network sandboxing for agents should be explicit and minimal:
- Only the specific endpoints the agent legitimately needs should be reachable
- Production databases should be behind an additional approval gate, not just network-accessible from the agent container
- Outbound internet access should be explicitly allowed per-destination, not broadly permitted
This limits the exfiltration surface. An agent that's been manipulated through prompt injection can't easily send data to an attacker-controlled endpoint if the only outbound connections allowed go to your own services.
Sandboxing Levels by Risk
| What the agent can reach | Container isolation | Command authorization | Credential scope | Network egress control |
|---|---|---|---|---|
| Read-only data | ✓ | Optional | Read-only creds | Internal only |
| Write to non-critical data | ✓ | Recommended | Scoped write creds | Internal only |
| Write to customer data | Required | Required | Row/table-scoped creds | Allowlist only |
| Production infrastructure | Required | Required + human review | Session-scoped, short-lived | Strict allowlist |
| Financial systems / billing | Required | Required + multi-party | Minimal, per-operation | Strict allowlist |
Notice that container isolation appears in every row — it's a baseline, not a solution. The controls that vary by risk are command authorization, credential scope, and egress. Those are the layers that actually track the threat model for AI agents.
The Depth Problem: Defense in Depth Requires Multiple Layers
The reason this matters practically is that every sandboxing layer has failure modes:
- Container escape vulnerabilities exist and get patched — relying solely on container isolation means a CVE can change your risk posture overnight
- Command authorization bypasses can happen through prompt injection, where an attacker embeds instructions in data the agent processes — your whitelist may not catch creative variants
- Credential scope errors are common — someone grants broader access than needed "temporarily" and it becomes permanent
- Egress misconfigurations are common — broad "outbound allowed" rules that weren't tightened after initial setup
Defense in depth means any single layer failing doesn't compromise the whole system. An agent that escapes its container but hits command authorization before any command executes is still contained. An agent that bypasses command authorization but has only read credentials can't cause write damage. Each layer catches what the previous one misses.
Containers are one layer of a four-layer defense. Treating them as the primary sandbox is like relying solely on TLS and not implementing application-layer auth — it's not wrong, it's just not enough.
A Practical Baseline for Agent Sandboxing
If you're deploying AI agents in production today, here's the minimum viable sandbox:
- Container isolation — with seccomp profile, no privileged mode, read-only root filesystem where possible, explicit volume mounts only
- Command authorization gateway — every shell/script execution, every database write operation, every destructive API call routes through an approval layer before executing
- Scoped credentials — no agent gets credentials broader than what it demonstrably needs; review and tighten on a regular cadence
- Egress allowlist — explicit permit-list for outbound connections; default-deny everything else
- Session audit trail — full record of what the agent attempted, what was approved or denied, and what executed; queryable for incident review
That's four controls beyond container isolation. Each one addresses a failure mode that containers don't cover.
What This Looks Like in Practice
An agent that needs to manage customer data might run like this with proper sandboxing:
Agent submits: DELETE FROM inactive_users WHERE last_login < '2025-01-01' Command authorization gateway: risk_score: 87 (bulk deletion, customer data, irreversible) whitelist match: none → routed to human reviewer Reviewer sees: command: DELETE FROM inactive_users WHERE last_login < '2025-01-01' estimated rows: 12,400 reversible: no risk: CRITICAL Reviewer: Deny (wrong table — should be inactive_test_users) Agent receives: denied — command rejected by reviewer No rows deleted.
The container didn't stop this. The credential scope didn't stop it (the agent had legitimate write access). The egress allowlist didn't stop it (the database was on the internal network). The command authorization gateway stopped it — because a human saw what was about to happen and said no.
That's the layer containers can't replace.
The Summary
Container isolation is a necessary starting point for running AI agents safely. It's not sufficient. The threat model for AI agents isn't about process escapes or syscall exploits — it's about autonomous decisions made with legitimate credentials against external systems. The sandboxing layers that address that threat model are command authorization, credential scoping, and network egress control. Build all four layers. Treat them as defense in depth. No single layer is enough on its own.
Works alongside your existing container setup. Intercepts at the execution layer, not the process layer.
Join the waitlist