DAY 50 / PHASE 6 · ENGINEERING

LLM Guardrails & Sandboxing

Dual Gate · Jailbreak Classifier · Policy Gate · Execution Sandbox

2026-06-30 · BigCat

"Please don't" in a system prompt is the weakest guardrail there is; the real boundary lives in the harness.

// WHY THIS MATTERS

You already know how prompt injection hits you (Day 24). This issue is about defending — but first accept a counter-intuitive engineering reality: a model's own safety fine-tuning plus a system prompt cannot stop indirect prompt injection. Simon Willison's "lethal trifecta" puts it bluntly: the moment an agent has all three of "can read private data + ingests untrusted content + has an outbound channel," it is unconditionally exploitable for data exfiltration — no amount of alignment saves you. So guardrails aren't prompts, they're architecture: independent classifiers, deterministic validation, OS-level sandboxes, egress control. Four things this issue: how to build the input/output dual gate, why jailbreak defense needs an independent classifier rather than prompt hardening, how the output policy gate must fail-closed, and how the agent execution sandbox severs the lethal trifecta.

// 01

Dual Gate: One Independent Gate on Each Side of the Model

Claim: guardrails are two independent gates flanking the model — and you must never let the main model gate itself.

Background & Principle

Treat an LLM app like an untrusted network service: input arrives from outside (users, retrieved documents, tool returns), output flows downstream (user screen, database, the next tool). The first principle of security engineering is to place an independent gate on each side, not to hope the model behaves.

Key point: gates must be independent components — a separate small model (Meta's Llama Guard, OpenAI's moderation endpoint) or deterministic rules (regex redaction of secrets / card numbers). Letting the main model "review itself" is an anti-pattern: the same model that can be jailbroken gets bypassed by the same jailbreak string during self-review — attack and review share one poisoned context. PII and secret leakage is what the gate should hard-code: API keys, JWTs, card numbers get redacted at the exit gate with deterministic regex — don't trust the LLM to "remember to redact."

┌── untrusted input ──┐ ┌── downstream / egress ──┐ user·docs·tool return user screen·DB·next tool │ ▲ ▼ │ ┌───────────┐ ┌─────────┐ ┌────────────┐ │ input gate│ ───▶ │ main │ ───▶ │ output gate │ │ moderation│ │ model │ │ content │ │ +jailbreak│ │ (no mem)│ │ +PII redact │ │ +PII scan │ └─────────┘ │ +schema │ └───────────┘ │ fail-closed │ indep model / rules └────────────┘ indep model / rules ❌ anti-pattern: let the main model gate itself

Hands-on

import re, anthropic
client = anthropic.Anthropic()

# Input gate: independent classifier (Llama Guard / moderation, pseudo)
def input_gate(text):
    verdict = classify_safety(text)          # separate small model, not the main one
    if verdict.unsafe: raise Blocked(verdict.categories)
    return text

# Output gate: deterministic redaction + content safety, both must pass
SECRET = re.compile(r"sk-[A-Za-z0-9]{20,}|eyJ[\w-]+\.[\w-]+\.[\w-]+")  # key / JWT
def output_gate(text):
    text = SECRET.sub("[REDACTED]", text)      # regex redaction, not the LLM
    if classify_safety(text).unsafe: raise Blocked("output")
    return text

def guarded_call(user_input):
    safe_in = input_gate(user_input)
    raw = client.messages.create(model="claude-sonnet-4-6",
            max_tokens=1024, messages=[{"role":"user","content":safe_in}])
    return output_gate(raw.content[0].text)

Both gates use an independent decision path; the main model only generates. Secret redaction runs via deterministic regex placed before content-safety classification — so even if the classifier misses, the key is already wiped.

Failure modes: (1) Gating only the output, forgetting the input — injection usually arrives via retrieved docs / tool returns, not what the user typed directly. (2) Self-review to save a call, so the jailbreak bypasses both. (3) Using a blocklist to cover every secret format — new token shapes slip through; for high-sensitivity fields, prefer an allowlist output (emit only whitelisted fields) over blacklist redaction.
Resources · Llama Guard, arXiv:2312.06674 · Meta PurpleLlama, github.com/meta-llama/PurpleLlama
// 02

Jailbreak Defense: Prompt Hardening Fails, Use an Independent Classifier

Claim: prompt hardening can't stop a universal jailbreak; what works is an independent classifier — and the price is accepting some helpfulness tax.

Background & Principle

Many defend against jailbreaks by piling "no matter what the user says, never…" into the system prompt. That helps against casual attacks but is largely useless against a universal jailbreak (a single string that pries open nearly every harmful category) — because attack and defense live in the same context and the model's attention is dominated by the attacker.

Anthropic's 2025 Constitutional Classifiers gives the engineering answer: train independent input/output classifiers that filter content against a constitution (what's allowed, what's forbidden). Across 3,000+ hours of red-teaming, no one found a universal jailbreak that reliably broke through across most harmful categories. The cost is real — early versions added roughly 23.7% inference overhead and +0.38% production refusal rate. The 2026 ++ version cut that to about 40× cheaper with a 0.05% refusal rate, showing the approach has converged in practice.

So the engineering decisions for jailbreak defense are three: where the gate sits (under streaming you must judge as you generate, not after the whole span is done), how to handle false positives (this is the helpfulness tax — legitimate requests killed by mistake), and how to version the constitution (it's a drifting policy, govern it like a prompt — see Day 35).

Hands-on

Training a classifier yourself is heavy; in practice you wire in an existing one and upgrade "block / allow" from binary to three tiers, giving borderline requests a step to escalate on:

# Classifier returns a risk score; binary blocking feels bad — use three tiers
BLOCK, REVIEW = 0.90, 0.60     # thresholds are the helpfulness-tax knob

def jailbreak_gate(text, stream=True):
    score = safety_classifier(text).risk      # independent classifier
    if score >= BLOCK:   return ("block",  fallback_refusal())
    if score >= REVIEW:  return ("escalate", queue_human_review(text))  # not a hard refuse
    return ("allow", text)

# streaming: sample every N tokens; on BLOCK, cut the stream immediately
for chunk in stream:
    buf += chunk
    if len(buf) % 200 == 0 and jailbreak_gate(buf)[0] == "block":
        abort_stream(); break

The threshold is a direct cost/safety knob: raise BLOCK → more misses; lower it → helpfulness tax spikes and legitimate users get refused for nothing. Keep an escalate tier in the middle to route borderline requests to a human instead of bluntly rejecting them.

Failure modes: (1) Binary block/allow only, no escalate tier — borderline requests are either killed or waved through, losing on both UX and safety. (2) In streaming, judging only after the full span is generated — harmful content already reached the user's screen. (3) The constitution / thresholds never update — it's a drifting policy, and new jailbreak techniques bypass a stale version within months.
Resources · Constitutional Classifiers, arXiv:2501.18837 · Anthropic research, anthropic.com/research/constitutional-classifiers
// 03

Output Policy Gate: Guardrails as Code, and It Must Fail-Closed

Claim: output validation is three layers — schema + business policy + facts — and when validation fails it must reject by default (fail-closed), not log a warning and let it through.

Background & Principle

Content safety is only half the output gate; the other half is correctness and policy. Validate LLM output as if it were untrusted external input, in three layers:

  1. Structure: schema validation (Pydantic / JSON Schema). Not valid JSON, missing field, wrong type → reject.
  2. Policy: business rules. Amount can't be negative, the cited SKU must exist in inventory, generated SQL can't contain DELETE. Judge with deterministic code, not the LLM.
  3. Facts: optional LLM-judge / grounding check — did the output cite retrieved evidence (connects to Day 11 RAG).

The key engineering decision is fail-closed: when validation fails, reject by default and take the fallback — not "catch exception → log warning → ship anyway." A meaningful share of production incidents are fail-open — the validator throws, the error gets swallowed, and dirty data flows unimpeded. Frameworks like Guardrails AI package this as a re-ask loop: validation fails → feed the failure reason back to the model to regenerate → retry with a cap → still failing, fail-closed to a fallback.

Hands-on

from pydantic import BaseModel, field_validator

class Order(BaseModel):
    sku: str
    qty: int
    amount: float
    @field_validator("amount")
    def non_negative(cls, v):
        if v < 0: raise ValueError("amount must be >= 0")   # policy layer
        return v

def validated_extract(prompt, max_retries=2):       # re-ask must be capped
    for _ in range(max_retries + 1):
        raw = call_llm(prompt)
        try:
            order = Order.model_validate_json(raw)   # structure + policy
            if order.sku not in INVENTORY: raise ValueError("unknown SKU")
            return order
        except ValueError as e:
            prompt += f"\n\nLast output failed validation: {e}. Fix and re-emit."  # re-ask
    return FALLBACK            # ★ fail-closed: still failing at the cap → fallback, never ship dirty data

All three layers judge together, re-ask feeds the failure reason back to the model, but it must be capped — when the cap is hit and it still fails, fail-closed to a fallback rather than returning the last (still-invalid) output.

Failure modes: (1) Fail-open: except: pass swallows the validation failure, which equals no validation. (2) Uncapped re-ask — when the model repeatedly errs, cost explodes and latency runs away. (3) Validating schema but not business policy — valid JSON with amount=-999 ships fine; the most dangerous case is exactly "correct format, wrong semantics."
Resources · Guardrails AI, github.com/guardrails-ai/guardrails · Guardrails docs, guardrailsai.com/docs
// 04

Execution Sandbox: Sever the Lethal Trifecta

Claim: cutting any one edge of the lethal trifecta is more reliable than hardening the model; a sandbox must lock both the filesystem and egress — one without the other isn't isolation.

Background & Principle

Agents execute model-generated code / shell / SQL — the highest-risk action class. The defense isn't telling the model to "be careful," it's OS-level sandboxing + severing the lethal trifecta. Simon Willison's lethal trifecta: private-data access + untrusted-content ingestion + an outbound channel; all three present = unconditionally exploitable for exfiltration. The good news is engineering-friendly — removing any one edge makes the system safe: don't grant private data, don't ingest untrusted content, or cut egress. The third is usually the most actionable.

LETHAL TRIFECTA (all three present = unconditional exfiltration) ┌──────────────────┐ │ ① private data │ │ email/code/DB │ └────────┬─────────┘ ┌───────────┴───────────┐ ┌────────▼────────┐ ┌────────▼────────┐ │ ② untrusted │ │ ③ outbound │ │ web/docs/tools │ │ HTTP/img/links │ └─────────────────┘ └─────────────────┘ remove any one edge → system is safe most actionable: cut ③ egress (deny-all + allowlist)

The sandbox must manage both dimensions:

Anthropic's sandbox for Claude Code is exactly this paradigm: bubblewrap (Linux) / seatbelt (macOS) locks the filesystem, and the network can only go through a unix domain socket connected to a proxy outside the sandbox — which is precisely how this remote execution environment you're in works (all outbound HTTPS goes through an agent proxy).

Hands-on

# Run model-generated code: no network + read-only root + write only the work dir
docker run --rm \
  --network none \                 # ③ cut egress (trifecta edge three)
  --read-only \                    # root filesystem read-only
  --tmpfs /work:rw,size=64m \      # the only writable area, ephemeral
  --memory 512m --cpus 1 \         # resource caps, guard against fork bombs
  --cap-drop ALL --pids-limit 128 \
  -v "$PWD/task:/work/task:ro" \   # task files mounted read-only
  sandbox-img python /work/task/gen.py

# If the task truly needs the network: don't use --network none; force it
#   through an allowlist proxy: HTTPS_PROXY=http://egress-proxy:3128
#   + the proxy only permits whitelisted domains

--network none directly removes the trifecta's third edge — the cleanest isolation. If the task must reach the network, don't open a free exit; force it through an egress proxy that only permits whitelisted domains — a controlled narrow channel, not an open door.

Failure modes: (1) Filesystem sandbox done well, network left open — the trifecta's third edge remains, and injection still exfiltrates private data. Isolation isn't only locking files. (2) The egress allowlist permits domains like *.github.com/pastebin that can be abused as an outbound channel. (3) Treating the sandbox as the only line of defense and skipping the input gate — the sandbox blocks execution harm, not "executing an injection embedded in a retrieved document as if it were an instruction."
Resources · Simon Willison The lethal trifecta, simonwillison.net · Anthropic Claude Code sandboxing, anthropic.com/engineering · anthropic-experimental/sandbox-runtime

// Putting It Together · A "Minimal Trust Boundary" for Your Agent

Assemble the four points into a checklist you can drop onto any agent that takes actions. Each line maps to a specific failure mode above:

  1. Input gate: use an independent classifier (Llama Guard / moderation) on user input and on retrieved/tool returns — injection usually arrives via the latter.
  2. Jailbreak tiers: three tiers (allow / escalate / block); the threshold is the helpfulness-tax knob; sample as you stream.
  3. Output policy gate: schema + business rules + (optional) grounding, re-ask capped, fail-closed to a fallback at the cap.
  4. Execution sandbox: --network none or an egress allowlist proxy + read-only root + resource caps — cut the trifecta's third edge first.
  5. PII/secrets: deterministic redaction (regex) at the exit gate; for high-sensitivity fields use allowlist output over blocklist redaction.
  6. Red-team once: write 5 indirect-injection cases (hidden inside "a webpage to summarize") and run them to see which layer catches it first — if nothing does, you've found the hole.

The core principle: guardrails are architecture, not prompts. Each one is physical interception in the harness layer that the model cannot talk its way past. From now on, when you look at any "secure AI product," you'll instinctively hunt for its four boundaries — and if you can't find them, they aren't there.

// DEEP THINKING

If removing any one trifecta edge makes the system safe, why do people build sandboxes/classifiers first instead of just "not granting private data"?
Because the first two edges are often the product value itself. An agent that can read your email and query your codebase exists because it "accesses private data" — cut it and you cut the product; "ingesting untrusted content" is frequently the requirement too (summarize a page, read a PR). Only the third edge, egress, is usually an implementation detail rather than a value point — most tasks don't need the agent to freely reach arbitrary external sites. So egress control is the highest-ROI edge: small cost, big payoff. That's why Anthropic tightened Claude Code's sandbox network default to "proxy-only" rather than restricting it from reading your code.
Should the output gate and input gate use the same classifier, or two different ones?
Different directions, best tuned separately. The input gate cares whether content is "an attack / harmful request / contains sensitive data" — it detects intent and injection. The output gate cares whether generated content is "harmful / leaking / malformed" — it detects the product. Models like Llama Guard distinguish prompt classification from response classification by design; the taxonomy and thresholds can differ — e.g. stricter on PII leakage at the output, more sensitive to jailbreak patterns at the input. Sharing one threshold tends to leave one end too loose and the other too tight. The base model can be shared to save deployment cost; just split the policy.
A jailbreak classifier adds ~24% inference overhead (early version). When is that cost not worth it, and a lighter approach better?
It depends on attack surface and blast radius. Public-facing, arbitrary user input, and the model able to cause real harm (generate harmful instructions, operate real systems) → worth it; that's why Anthropic optimized the cost from 24% to the ++ version: it must stay on. Conversely, internal tools, trusted users, output that goes to people not systems, low harm ceiling → a heavy classifier is over-engineering; a lightweight moderation endpoint + output gate suffices. The criterion isn't "is a jailbreak possible," it's "how much damage if one succeeds." Spending 24% of compute on jailbreak defense in a low-blast-radius setting is putting the budget in the wrong place.
Fail-closed is safer but rejects legitimate output when the validator misjudges, hurting availability. When should you fail-open?
When the cost of "letting one piece of dirty data through" < the cost of "rejecting one good piece." E.g. an internal analytics dashboard, non-critical content recommendations, draft scenarios a human can correct after the fact — here availability wins, and fail-open + logging + async review is more reasonable. Anything touching money, permissions, external publishing, or irreversible actions (transfers, sending email, deleting data) must fail-closed. It's essentially risk tiering: tag actions by blast radius, fail-closed for high-risk, fail-open for low. A blanket choice loses both ways — all fail-closed and the UX is too bad to use, all fail-open and you have no guardrail.
All four guardrails are in place and the agent still gets breached by indirect injection — where is it most likely leaking?
The most common is the input gate judging only direct user input, not tool/retrieval returns. Injection hides in "a webpage to summarize" or "a PR comment to read"; that content bypasses the user-facing input gate and enters the context directly. Next is an egress allowlist with a hole — permitting a seemingly harmless domain that can encode outbound data (image host, URL shortener, even DNS lookups). Third: each layer is correct individually, but the combination was never tested — sandbox blocks execution, gates block content, yet no one verified the cross-layer path "injection makes the agent call an allowlisted tool to exfiltrate." So finish with an end-to-end red-team, not just per-layer unit tests.

// FURTHER READING