"Please don't" in a system prompt is the weakest guardrail there is; the real boundary lives in the harness.
You already know how prompt injection hits you (Day 24). This issue is about defending — but first accept a counter-intuitive engineering reality: a model's own safety fine-tuning plus a system prompt cannot stop indirect prompt injection. Simon Willison's "lethal trifecta" puts it bluntly: the moment an agent has all three of "can read private data + ingests untrusted content + has an outbound channel," it is unconditionally exploitable for data exfiltration — no amount of alignment saves you. So guardrails aren't prompts, they're architecture: independent classifiers, deterministic validation, OS-level sandboxes, egress control. Four things this issue: how to build the input/output dual gate, why jailbreak defense needs an independent classifier rather than prompt hardening, how the output policy gate must fail-closed, and how the agent execution sandbox severs the lethal trifecta.
Treat an LLM app like an untrusted network service: input arrives from outside (users, retrieved documents, tool returns), output flows downstream (user screen, database, the next tool). The first principle of security engineering is to place an independent gate on each side, not to hope the model behaves.
Key point: gates must be independent components — a separate small model (Meta's Llama Guard, OpenAI's moderation endpoint) or deterministic rules (regex redaction of secrets / card numbers). Letting the main model "review itself" is an anti-pattern: the same model that can be jailbroken gets bypassed by the same jailbreak string during self-review — attack and review share one poisoned context. PII and secret leakage is what the gate should hard-code: API keys, JWTs, card numbers get redacted at the exit gate with deterministic regex — don't trust the LLM to "remember to redact."
import re, anthropic
client = anthropic.Anthropic()
# Input gate: independent classifier (Llama Guard / moderation, pseudo)
def input_gate(text):
verdict = classify_safety(text) # separate small model, not the main one
if verdict.unsafe: raise Blocked(verdict.categories)
return text
# Output gate: deterministic redaction + content safety, both must pass
SECRET = re.compile(r"sk-[A-Za-z0-9]{20,}|eyJ[\w-]+\.[\w-]+\.[\w-]+") # key / JWT
def output_gate(text):
text = SECRET.sub("[REDACTED]", text) # regex redaction, not the LLM
if classify_safety(text).unsafe: raise Blocked("output")
return text
def guarded_call(user_input):
safe_in = input_gate(user_input)
raw = client.messages.create(model="claude-sonnet-4-6",
max_tokens=1024, messages=[{"role":"user","content":safe_in}])
return output_gate(raw.content[0].text)
Both gates use an independent decision path; the main model only generates. Secret redaction runs via deterministic regex placed before content-safety classification — so even if the classifier misses, the key is already wiped.
Many defend against jailbreaks by piling "no matter what the user says, never…" into the system prompt. That helps against casual attacks but is largely useless against a universal jailbreak (a single string that pries open nearly every harmful category) — because attack and defense live in the same context and the model's attention is dominated by the attacker.
Anthropic's 2025 Constitutional Classifiers gives the engineering answer: train independent input/output classifiers that filter content against a constitution (what's allowed, what's forbidden). Across 3,000+ hours of red-teaming, no one found a universal jailbreak that reliably broke through across most harmful categories. The cost is real — early versions added roughly 23.7% inference overhead and +0.38% production refusal rate. The 2026 ++ version cut that to about 40× cheaper with a 0.05% refusal rate, showing the approach has converged in practice.
So the engineering decisions for jailbreak defense are three: where the gate sits (under streaming you must judge as you generate, not after the whole span is done), how to handle false positives (this is the helpfulness tax — legitimate requests killed by mistake), and how to version the constitution (it's a drifting policy, govern it like a prompt — see Day 35).
Training a classifier yourself is heavy; in practice you wire in an existing one and upgrade "block / allow" from binary to three tiers, giving borderline requests a step to escalate on:
# Classifier returns a risk score; binary blocking feels bad — use three tiers
BLOCK, REVIEW = 0.90, 0.60 # thresholds are the helpfulness-tax knob
def jailbreak_gate(text, stream=True):
score = safety_classifier(text).risk # independent classifier
if score >= BLOCK: return ("block", fallback_refusal())
if score >= REVIEW: return ("escalate", queue_human_review(text)) # not a hard refuse
return ("allow", text)
# streaming: sample every N tokens; on BLOCK, cut the stream immediately
for chunk in stream:
buf += chunk
if len(buf) % 200 == 0 and jailbreak_gate(buf)[0] == "block":
abort_stream(); break
The threshold is a direct cost/safety knob: raise BLOCK → more misses; lower it → helpfulness tax spikes and legitimate users get refused for nothing. Keep an escalate tier in the middle to route borderline requests to a human instead of bluntly rejecting them.
Content safety is only half the output gate; the other half is correctness and policy. Validate LLM output as if it were untrusted external input, in three layers:
DELETE. Judge with deterministic code, not the LLM.The key engineering decision is fail-closed: when validation fails, reject by default and take the fallback — not "catch exception → log warning → ship anyway." A meaningful share of production incidents are fail-open — the validator throws, the error gets swallowed, and dirty data flows unimpeded. Frameworks like Guardrails AI package this as a re-ask loop: validation fails → feed the failure reason back to the model to regenerate → retry with a cap → still failing, fail-closed to a fallback.
from pydantic import BaseModel, field_validator
class Order(BaseModel):
sku: str
qty: int
amount: float
@field_validator("amount")
def non_negative(cls, v):
if v < 0: raise ValueError("amount must be >= 0") # policy layer
return v
def validated_extract(prompt, max_retries=2): # re-ask must be capped
for _ in range(max_retries + 1):
raw = call_llm(prompt)
try:
order = Order.model_validate_json(raw) # structure + policy
if order.sku not in INVENTORY: raise ValueError("unknown SKU")
return order
except ValueError as e:
prompt += f"\n\nLast output failed validation: {e}. Fix and re-emit." # re-ask
return FALLBACK # ★ fail-closed: still failing at the cap → fallback, never ship dirty data
All three layers judge together, re-ask feeds the failure reason back to the model, but it must be capped — when the cap is hit and it still fails, fail-closed to a fallback rather than returning the last (still-invalid) output.
except: pass swallows the validation failure, which equals no validation. (2) Uncapped re-ask — when the model repeatedly errs, cost explodes and latency runs away. (3) Validating schema but not business policy — valid JSON with amount=-999 ships fine; the most dangerous case is exactly "correct format, wrong semantics."
Agents execute model-generated code / shell / SQL — the highest-risk action class. The defense isn't telling the model to "be careful," it's OS-level sandboxing + severing the lethal trifecta. Simon Willison's lethal trifecta: private-data access + untrusted-content ingestion + an outbound channel; all three present = unconditionally exploitable for exfiltration. The good news is engineering-friendly — removing any one edge makes the system safe: don't grant private data, don't ingest untrusted content, or cut egress. The third is usually the most actionable.
The sandbox must manage both dimensions:
Anthropic's sandbox for Claude Code is exactly this paradigm: bubblewrap (Linux) / seatbelt (macOS) locks the filesystem, and the network can only go through a unix domain socket connected to a proxy outside the sandbox — which is precisely how this remote execution environment you're in works (all outbound HTTPS goes through an agent proxy).
# Run model-generated code: no network + read-only root + write only the work dir
docker run --rm \
--network none \ # ③ cut egress (trifecta edge three)
--read-only \ # root filesystem read-only
--tmpfs /work:rw,size=64m \ # the only writable area, ephemeral
--memory 512m --cpus 1 \ # resource caps, guard against fork bombs
--cap-drop ALL --pids-limit 128 \
-v "$PWD/task:/work/task:ro" \ # task files mounted read-only
sandbox-img python /work/task/gen.py
# If the task truly needs the network: don't use --network none; force it
# through an allowlist proxy: HTTPS_PROXY=http://egress-proxy:3128
# + the proxy only permits whitelisted domains
--network none directly removes the trifecta's third edge — the cleanest isolation. If the task must reach the network, don't open a free exit; force it through an egress proxy that only permits whitelisted domains — a controlled narrow channel, not an open door.
*.github.com/pastebin that can be abused as an outbound channel. (3) Treating the sandbox as the only line of defense and skipping the input gate — the sandbox blocks execution harm, not "executing an injection embedded in a retrieved document as if it were an instruction."
Assemble the four points into a checklist you can drop onto any agent that takes actions. Each line maps to a specific failure mode above:
--network none or an egress allowlist proxy + read-only root + resource caps — cut the trifecta's third edge first.The core principle: guardrails are architecture, not prompts. Each one is physical interception in the harness layer that the model cannot talk its way past. From now on, when you look at any "secure AI product," you'll instinctively hunt for its four boundaries — and if you can't find them, they aren't there.