Every Layer Locked Down: Security Beyond the Sandbox

Q: "Why not just trust the sandbox?"

"Sandboxes prevent system-level damage, but they don't prevent all harm. An agent inside a sandbox can still leak API keys through LLM calls, overwrite important workspace files, or execute malicious skill code. Defense in depth means each layer catches what the layer below misses."

Q: "How do JWT-proxied API calls work?"

"When a sandbox needs to call an LLM, it requests a short-lived JWT from the proxy server. The JWT is valid for 10 minutes and scoped to that specific sandbox session. The proxy forwards the request to the real API using the actual key, which the sandbox never sees. Even if the JWT leaks, it's useless after expiration."

Q: "What happens if a skill fails the approval check?"

"It doesn't run. Period. There's no bypass, no 'trust this publisher' shortcut. Every skill, including recommended ones, requires individual user approval. You can read the full source code before deciding."

Q: "How does the REUSE-FIRST rule work for background agents?"

"Before creating any new file, background agents must check if a relevant file already exists in the workspace. This prevents duplicate files, accidental overwrites, and the slow workspace bloat that happens when agents keep creating new files instead of updating existing ones."

Sandboxed execution was the start. We added JWT-proxied API calls, file write restrictions, mandatory skill approval, and agent behavior guardrails — here's how each layer works.

We caught a sandbox task leaking an API key through an LLM prompt. The agent wasn’t doing anything malicious. It was summarizing a project’s configuration files and included the contents of .env in its context window. The LLM provider logged the prompt. Our API key was sitting in someone else’s database.

The sandbox did its job. The agent couldn’t touch the host system, couldn’t read other users’ files, couldn’t phone home to some attacker’s server. But it still managed to exfiltrate a secret through the one channel we’d left wide open: the LLM call itself.

That’s when we stopped thinking about security as a single wall and started building layers.

The sandbox is the floor, not the ceiling

We’ve talked about why we bet everything on E2B sandboxed execution. That decision still holds. Every agent task runs in an isolated container that gets destroyed after the task completes. No persistent access, no shared state between users, no way to compromise the host.

But a sandbox only protects against one category of threat: unauthorized system access. Inside the sandbox, an agent can still do real damage. It can leak secrets through API calls. It can overwrite workspace files that took hours to create. It can execute skill code that does something very different from what its description promises. We needed more layers.

Layer 2: JWT-proxied API calls

The most obvious attack surface was the LLM API key. If an agent has the key, it can send it anywhere — embedded in a prompt, appended to a URL, stuffed into a file that gets uploaded later.

So we removed the key entirely. Since February 18, sandboxes never see real API credentials. When an agent needs to call an LLM, it gets a short-lived JWT from our proxy server. The token is valid for 10 minutes and scoped to that specific sandbox session. The proxy server holds the actual API key and forwards the request.

If a JWT leaks, the attacker has 10 minutes and can only use it for the same type of request the sandbox was already making. Not great, but dramatically better than a permanent API key with full account access.

We extended this pattern to image generation too. As of March 15, Replicate API calls for cover images and other media go through the same proxy. One pattern, multiple providers, zero raw keys in sandboxes.

Layer 3: File write restrictions

The .env leak that started this whole effort pointed to an obvious fix: don’t let agents touch sensitive files. Since February 6, .env and .ssh files are read-only to agents. They can’t create them, modify them, or delete them.

But file protection goes beyond just blocking sensitive paths. We enforce designated directory restrictions — agents can only write to specific workspace locations. And as of March 26, we added stricter file creation discipline across all agent layers. Background agents follow explicit rules about where and when they can create files.

This sounds simple. It was not. Getting file restrictions right across every agent type, every execution path, and every edge case took weeks of iteration. And when restrictions aren’t enough, file versioning acts as the safety net — every edit is recoverable.

Layer 4: Mandatory skill approval

When we scanned every skill on ClawHub, we found 341 skills with at least one red flag. Obfuscated code, outbound network calls to unknown servers, file operations outside the workspace.

The fix wasn’t just flagging bad skills. It was making approval mandatory for all skills — especially important now that the v3.0 skills-first architecture routes every task through a skill. Since February 14, every skill requires explicit user approval before it can execute. No exceptions. No “trusted publisher” bypass. No auto-approve for popular skills.

On February 15, we shipped the source code review modal. Before you approve a skill, you can read every line of code it will run. This is the opposite of how most marketplaces work — they hide the code and ask you to trust a rating system. We show the code and let you decide.

Layer 5: Agent behavior guardrails

The subtlest problems weren’t about malice. They were about carelessness. Background agents would create duplicate files instead of updating existing ones. Agents would hallucinate dates for files they created. Workspaces would slowly fill with redundant content that nobody asked for.

Since March 19, background agents follow a REUSE-FIRST rule: before creating any new file, check if a relevant file already exists. Read first, then decide whether to create or update. This cut duplicate file creation significantly.

On March 24, we added date awareness enforcement. Agents now pull the actual current date instead of guessing. No more files timestamped three months in the future.

These aren’t flashy security features. They’re the kind of unglamorous discipline that keeps a system reliable at the pace we ship.

How the layers compose

Each layer catches what the layer below misses:

The sandbox stops system-level access. But agents can still leak keys through API calls.
JWT proxying stops key leakage. But agents can still damage workspace files.
File restrictions stop sensitive file damage. But agents can still run malicious skill code.
Skill approval stops malicious code. But agents can still behave carelessly.
Behavior guardrails stop carelessness.

No single layer is sufficient. A sandbox without JWT proxying leaks keys. JWT proxying without file restrictions still lets agents clobber your workspace. Skill approval without behavior guardrails still lets agents create a mess.

Defense in depth is not a buzzword here. It’s literally five layers, each addressing a different failure mode, each implemented because we saw the failure mode happen in production.

What’s next

We’re working on more granular permission scoping — letting users control exactly which directories an agent can access, which network endpoints it can reach, and which operations it can perform within a skill. We’re also building audit logging so you can see exactly what every agent did, in what order, and why.

Security isn’t a feature you ship once. It’s a practice you maintain. We’ve got five layers now. We’ll probably need more.

Questions about security

Why not just trust the sandbox?

Sandboxes prevent system-level damage, but they don't prevent all harm. An agent inside a sandbox can still leak API keys through LLM calls, overwrite important workspace files, or execute malicious skill code. Defense in depth means each layer catches what the layer below misses.

How do JWT-proxied API calls work?

When a sandbox needs to call an LLM, it requests a short-lived JWT from the proxy server. The JWT is valid for 10 minutes and scoped to that specific sandbox session. The proxy forwards the request to the real API using the actual key, which the sandbox never sees. Even if the JWT leaks, it's useless after expiration.

What happens if a skill fails the approval check?

It doesn't run. Period. There's no bypass, no 'trust this publisher' shortcut. Every skill, including recommended ones, requires individual user approval. You can read the full source code before deciding.

How does the REUSE-FIRST rule work for background agents?

Before creating any new file, background agents must check if a relevant file already exists in the workspace. This prevents duplicate files, accidental overwrites, and the slow workspace bloat that happens when agents keep creating new files instead of updating existing ones.