Day 24 Hard Security OAuth2/OIDC JWT vs Session Secrets

Security — Who You Are, What You Can Do, Where Secrets LiveAuthN vs AuthZ, OAuth2/OIDC, JWT vs Session, Secret Management

Scenario + Requirements

Design auth for a multi-tenant SaaS: 5M registered users, 1M DAU, login peaks at 5,000 QPS. It serves four client types simultaneously — Web, mobile apps, a public API, and third-party integrations. Enterprise customers demand SSO with their own IdP (OIDC/SAML); compliance requires that keys be rotatable, actions auditable, and leaked credentials revocable instantly.

Security design has a counterintuitive core: it is almost entirely trade-offs, not "more strict is better". Make tokens never-revocable stateless JWTs and performance is great but you can't stop the bleeding after a leak; put keys in env vars and deploys are simple but rotation is a nightmare. This issue covers four layered concerns — the AuthN/AuthZ boundary, OAuth/OIDC delegation, the revocation cost of JWT vs Session, and the lifecycle of secrets. They map to security's three eternal questions: who are you, what can you do, where do secrets live.

Trust boundary: where is "untrusted outside" vs "authenticated inside"? Tokens are minted at the boundary and verified at every internal entry point.
AuthN frequency vs AuthZ frequency: login happens once a day; authorization checks happen on every request (5,000+ QPS) — AuthZ's performance budget is far tighter.
Blast-radius control (RTO): after a leak, can you invalidate a credential in minutes, or must you wait for natural expiry? This number drives token design.
Compliance: audit logs, rotation cadence, least privilege — these are design constraints, not afterthoughts.

High-Level Architecture

All external requests hit the API gateway first, which does coarse-grained token validation (signature, expiry, scope) — this is the authentication boundary. Authentication itself is delegated to the IdP / authorization server (self-hosted or Auth0/Okta), which issues access + refresh tokens. Once inside, each service runs a Policy Decision Point (PDP) for fine-grained authorization (can this user modify this tenant's record). No service stores keys in config — they fetch dynamic credentials at runtime from secret management (Vault / cloud KMS).

graph LR
    C["Client
Web/mobile/API"]
    GW["API gateway
AuthN: verify token"]
    IDP["Auth server/IdP
mint tokens"]
    SVC["Business svc
AuthZ: PDP decision"]
    SS["Secret mgmt
Vault/KMS"]
    DB[("User/perm store")]

    C -->|1 login| IDP
    IDP -->|2 access+refresh| C
    C -->|3 Bearer token| GW
    GW -->|4 verified| SVC
    GW -.->|public key/JWKS| IDP
    SVC -->|5 allowed?| SVC
    SVC -.->|dynamic cred| SS
    SVC --> DB

    classDef ext fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    classDef gate fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef core fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef sec fill:#0e2030,stroke:#5eead4,color:#e8eef5
    class C ext
    class GW gate
    class IDP,SVC core
    class SS,DB sec

AuthN happens once at the boundary; AuthZ at every internal entry; keys never land in config files

Key Technical Points

1. AuthN vs AuthZ — First Separate "Who You Are" from "What You Can Do"

One-line trade-off: Authentication can be done centrally once; authorization must be done at every resource access point — conflating them is the root of the Web's #1 vulnerability class.

Principle: Authentication answers "who are you," producing a trusted identity credential once at login. Authorization answers "may you do this operation on this resource," and must be evaluated on every resource access. They differ in frequency, location, and failure semantics: AuthN failure returns 401 Unauthorized (you haven't proven identity); AuthZ failure returns 403 Forbidden (identity is fine but you lack permission). The classic IDOR (Insecure Direct Object Reference) bug is treating AuthN as AuthZ — "the user is logged in," so we trust their request for /orders/456 without checking whether that order belongs to them. OWASP ranks this "Broken Access Control" as the top risk.

Trade-off: where authorization decisions live

Embedded (each service decides): ✅ zero network latency, no SPOF; ❌ policy logic scattered everywhere, policy changes require redeploying all services, hard to audit.
Centralized PDP (e.g. OPA / a central authz service): ✅ unified, auditable, hot-updatable policy; ❌ an extra hop per request (mitigate with sidecar local policy cache), PDP is on the critical path.
Relationship-based (Google Zanzibar model): ✅ great for fine-grained ReBAC like "share this doc with someone," globally consistent; ❌ complex, needs a dedicated permission-graph store.

# Distinguish 401 from 403, and check ownership at every access point
def get_order(req, order_id):
    user = authenticate(req)            # no identity -> 401
    if user is None:
        raise HTTP(401)                 # tell client to re-authenticate
    order = db.get_order(order_id)
    if order.tenant_id != user.tenant_id:   # AuthZ: verify resource ownership
        raise HTTP(403)                 # identity ok but unauthorized -> 403
    return order

Real-world cases:

Google Zanzibar: a unified authz system giving Drive/YouTube/Calendar globally consistent "who can access what" decisions; inspired SpiceDB and OpenFGA (Zanzibar paper).
OWASP: Broken Access Control sits atop the Top 10, with IDOR as its archetype.
AWS IAM: fully separates AuthN (credentials) from AuthZ (policy evaluation), running every API call through a policy engine.

2. OAuth 2.0 / OIDC — Don't Store Third-Party Passwords; Delegate Authorization

One-line trade-off: Trade "the complexity of an authorization server and redirects" for "never touching the user's password elsewhere, plus fine-grained authorization and revocation".

Principle: OAuth 2.0 is an authorization-delegation framework — it lets a user authorize a third-party app to access resources on their behalf without handing over a password, yielding an access token (scoped, representing "what you may do"). But OAuth is not an authentication protocol: an access token says "the bearer is authorized," not reliably "who the user is." OIDC (OpenID Connect) adds an identity layer on top, issuing an id_token (a JWT carrying identity claims) — that is the correct basis for "Sign in with Google." The only currently recommended flow is Authorization Code + PKCE: get a one-time code first, then exchange the code (with PKCE proof) for tokens, so tokens never appear in a browser URL. The old Implicit flow (token returned directly in the URL) is officially deprecated because tokens leak into history/referer/logs.

Trade-off: which grant flow

Authorization Code + PKCE: ✅ most secure, fits any user-present scenario (Web/mobile/SPA); ❌ multiple redirects, must handle code exchange.
Client Credentials: ✅ for service-to-service (no user), simple; ❌ trusted backends only, a leaked credential means full access.
Resource Owner Password (ROPC): ⚠️ collects the user's password directly to exchange for a token, no longer recommended, only a transitional crutch for first-party apps — it violates OAuth's whole point.

# Authorization Code + PKCE (client-side pseudo-code)
verifier  = random_urlsafe(64)                  # high-entropy secret
challenge = base64url(sha256(verifier))         # derived, public
# 1. redirect to auth server with challenge (NOT the verifier)
redirect(f"{AUTH}/authorize?client_id={CID}&response_type=code"
         f"&code_challenge={challenge}&code_challenge_method=S256&scope=openid")
# 2. after login+consent, auth server redirects back with code
# 3. exchange code + original verifier for tokens (a stolen code is useless w/o verifier)
tok = POST(f"{AUTH}/token", code=code, code_verifier=verifier, client_id=CID)
#   -> { access_token, refresh_token, id_token }  use id_token to confirm "who you are"

Real-world cases:

oauth.net / specs: Aaron Parecki's OAuth 2.0 site and OAuth 2.0 Simplified are the authoritative primers; PKCE is now the recommended default for all client types.
"Sign in with Google/Apple": both are OIDC — the third party gets an id_token to confirm identity and never touches your Google password.
Okta / Auth0: turn OAuth/OIDC into a managed IdP; enterprises use it for SSO, avoiding self-built auth servers.

3. JWT vs Session — The Cost of Stateless Is "Hard to Revoke"

One-line trade-off: JWT trades "can't be revoked instantly" for "verify every request without touching a store"; Session is the inverse.

Principle: A Session keeps state server-side (Redis/DB); the client holds only an opaque session_id and the server looks it up per request. A JWT signs the state (user ID, scope, expiry) into a self-contained token; the server only verifies the signature — that's stateless, verifiable independently on any service in any region. The cost: once issued, a JWT can't be recalled before it expires. A stolen JWT works unhindered until expiry; if a user's permissions change or they're banned, the old token still carries old rights. The industry compromise is short-lived access tokens (JWT, 5–15 min) + long-lived refresh tokens (stateful, revocable server-side): to revoke, invalidate the refresh token and the access token dies within minutes.

	Session (stateful)	JWT (stateless)
Per-request cost	one store lookup	local verify, zero lookup
Instant revocation	✅ delete session	❌ can't recall before expiry
Horizontal scaling	needs shared session store	✅ any node verifies alone
Permission change effect	immediate	takes effect on next refresh
Best for	monolith/single-DC, strong revocation	microservices/multi-region, short tokens

# Three fatal JWT verification pitfalls
claims = jwt.decode(
    token, key=PUBLIC_KEY,
    algorithms=["RS256"],     # 1. pin the algorithm! else attacker sets alg:none to bypass,
                              #    or RS256->HS256 to forge using "public key as HMAC secret"
    options={"require": ["exp", "iss", "aud"]})
assert claims["iss"] == TRUSTED_ISSUER   # 2. must check issuer
assert claims["aud"] == MY_API           # 3. must check audience, else a token minted
                                          #    for another service also works here
# 4. revocation: keep a jti blacklist or short TTL; pure stateless JWT can't "log out"

Real-world cases:

Auth0 / Tim McLean: the classic Critical vulnerabilities in JSON Web Token libraries exposed the alg:none and RS256→HS256 algorithm-confusion attacks (post), directly motivating JWT best practices.
IETF RFC 8725: JWT Best Current Practices — mandates pinning algorithms and validating aud/iss; essential reading.
Most large SaaS: use "short access token + revocable refresh token" to balance stateless performance with the ability to stop the bleeding.

4. Secret Management — Keys Aren't Config; They're Living Things That Expire and Leak

One-line trade-off: Trade "the operational and availability complexity of Vault/KMS" for "central audit, automatic rotation, and second-scale invalidation after a leak".

Principle: DB passwords, API keys, signing private keys — scattered across env vars, config files, even git — is secret sprawl: nobody knows how many copies exist, who can read them, or when they were last rotated. Mature solutions progress in three layers: ① central storage (Vault / cloud KMS) as a single source of truth plus audit; ② dynamic short-lived credentials — the app requests "a temporary DB account that lives one hour" from Vault at startup, auto-destroyed on expiry, leaving a tiny leak window; ③ workload identity — issue no static keys at all; use a platform-issued instance identity (AWS IAM Role, SPIFFE) to obtain access directly, so there's no long-lived secret to leak. Rotation goes from "company-wide password coordination" to "automatic background rolling."

Trade-off:

Env var / config file: ✅ zero dependency, simple; ❌ easily leaks into git/logs, rotation requires redeploy, no audit, no way to scope blast radius after a leak.
Central secret management (Vault/KMS): ✅ audit + rotation + fine-grained access; ❌ one more critical dependency (if Vault is down apps can't get keys — needs HA + caching), operational cost.
Dynamic credentials / workload identity: ✅ short-lived, minimal leak impact, rotation-free; ❌ high migration cost, needs platform-level identity infrastructure.

# Dynamic short-lived credentials: request a temp account at runtime, not a static password
lease = vault.read("database/creds/app-readonly")   # request one-time DB credentials
db = connect(user=lease.username, pw=lease.password) # TTL=1h, Vault auto-revokes on expiry
schedule_renew(lease, before_expiry="10m")           # renew or re-request
# Contrast: static DB_PASSWORD in env -> valid forever; a leak forces an all-hands reset+restart

Real-world cases:

HashiCorp Vault: dynamic secrets (on-demand short-lived DB/cloud credentials) + central audit + auto rotation — the de facto standard.
Netflix Metatron / Lemur: Metatron provisions workload identity certificates at instance boot; Lemur orchestrates and auto-rotates X.509 certificates.
Fly.io: Thomas Ptacek's "API Tokens: A Tedious Survey" systematically surveys token design from random strings to Macaroons — essential for token selection.
AWS IAM Roles: EC2/containers hold no long-lived keys; they exchange an instance role for auto-rotated temporary credentials.

Scaling & Optimization

Credential stuffing / brute force: rate-limit the login endpoint specifically (see Day 10) + failure-count lockout + MFA; store passwords with bcrypt/argon2 (slow hash + per-user salt), never bare SHA.
Token theft: put access tokens in httpOnly + Secure cookies to resist XSS theft, with SameSite / CSRF tokens against cross-site requests; go further with token binding / DPoP to bind a token to a client key so a stolen token is useless.
Service-to-service auth: mTLS for internal calls (a service mesh like Istio auto-issues/rotates certs); under zero trust, "the internal network" is no longer inherently trusted.
Passwordless trend: Passkeys / WebAuthn use device biometrics + public keys, eliminating phishable passwords entirely.
Bottleneck identification: authorization checks are on every request's hot path — give the PDP a local policy cache + short-lived decision cache; cache JWKS public keys so you don't fetch them per verification.

Common Pitfalls + Interview Questions

1. JWT in localStorage or a cookie? localStorage is readable by any XSS-injected script; an httpOnly cookie blocks XSS but needs CSRF defense (SameSite + CSRF token). No silver bullet — weigh by attack surface. Interviewers often probe whether you know each one's failure mode.

2. How does a pure stateless JWT implement "logout"? Strictly, it can't do instant logout. Either keep a jti blacklist (which makes it stateful again) or set access-token TTL to minutes and rely on refresh-token revocation. Acknowledging this trade-off is more professional than pretending you can revoke.

3. 401 or 403? Missing/bad credential → 401 (go re-authenticate); valid credential but no permission → 403. To prevent resource enumeration, some endpoints deliberately return 404 for unauthorized resources to hide their existence — that's an intentional security choice, not a missed check.

4. Forgetting to pin the algorithm during JWT verification: without an explicit algorithms list, the library trusts the token's own alg — an attacker sets none or switches RS256 to HS256 to forge tokens. Always pin algorithms and validate aud/iss.

5. Committing a secret to git: even if deleted later, it stays in history and may already be scraped. Once committed, consider it leaked — rotate immediately. This is exactly why dynamic short-lived credentials are far safer than static keys.

Deep-Dive Resources

oauth.net + "OAuth 2.0 Simplified" (Aaron Parecki): the authoritative entry point and accessible read for OAuth/OIDC.
IETF RFC 8725 — JWT Best Current Practices: the security checklist for JWT in production (pin algorithms, validate aud/iss).
Auth0 Blog — Critical vulnerabilities in JSON Web Token libraries (Tim McLean): the classic breakdown of alg:none and algorithm-confusion attacks.
Google Zanzibar paper + Fly.io "API Tokens: A Tedious Survey": deep references on fine-grained authorization and token design, respectively.
OWASP Top 10 + Cheat Sheet Series: practical defense checklists for Broken Access Control, Authentication, and Secrets.

Going Deeper (click to expand)

1. Your access-token TTL is 15 minutes. How long can a stolen token be used? Does cutting TTL to 1 minute eliminate the risk? At what cost?

Up to 15 minutes (a pure stateless JWT can't be recalled early). Cutting to 1 minute shrinks the leak window but doesn't eliminate risk: ① an attacker can do damage within that minute; ② the real danger is the refresh token — it's long-lived and continually mints new access tokens, so stealing it is a long-term pass.

Cost: access tokens must refresh every minute, spiking refresh QPS 15×, and refresh usually hits a stateful store (to check whether the refresh token is revoked), largely canceling the "stateless saves lookups" benefit.

The right answer: short access token + refresh-token rotation — issue a new refresh token on each refresh and void the old one; if a used (old) refresh token reappears (indicating concurrent stolen use), revoke the entire token chain immediately. That balances performance with blast-radius control.

2. Why is OAuth 2.0 "not an authentication protocol," and what goes wrong if you use an access token to decide "who the user is"?

An access token expresses authorization: "the bearer is allowed to access a resource." It doesn't guarantee "the bearer is a specific user." The classic bug is confused deputy / token substitution: app A receives an access token valid for it and assumes the corresponding user has logged in here — but that token might have been issued for a different app or audience, and an attacker can inject their own token to impersonate.

OIDC exists to plug this hole: it additionally issues an id_token (with aud=your client_id, iss, sub, nonce), explicitly stating "this identity was issued to your app." So "login" should use the id_token and validate aud/nonce, not treat an access token as an identity credential.

3. A frontend dev says "I store the JWT in localStorage and add a CSRF token, so it's safe." What's backwards here?

The attack surfaces are swapped. localStorage's threat is XSS (any injected script can localStorage.getItem the token), whereas a CSRF token defends against CSRF (cross-site forged requests that exploit the browser auto-sending cookies). A token in localStorage is not auto-attached by the browser, so it's barely exposed to CSRF; adding a CSRF token does nothing for its real XSS risk.

The correct combination is one of two targeted approaches: either httpOnly cookie (blocks XSS reads) + SameSite/CSRF token (blocks CSRF); or store the token in memory (non-persistent, gone on refresh, shrinking the XSS window). The real cure is eliminating XSS itself (CSP, output escaping). Layering on an unrelated defense is just "security theater."

4. A centralized PDP adds a hop per request. On a 5,000-QPS hot path, how do you keep it from becoming a latency and availability bottleneck?

Push policy down + evaluate locally: don't ask a central PDP per request; push policy to a sidecar next to each service (the OPA model) and evaluate against in-memory data — microsecond decisions. The center only distributes policy and collects audit.
Separate data from policy: policy changes slowly and can be cached; permission data (who's in which group) changes fast — sync it incrementally via bundles or cache with short TTL, tolerating second-scale eventual consistency.
Cache decision results: cache the same (user, action, resource) briefly, but weigh that revocation latency = cache TTL.
Explicit failure mode: when the PDP is unreachable, fail-open (allow, availability first) or fail-close (deny, security first)? Security contexts should almost always fail-close, though core read paths may need tiering.

This is isomorphic to Day 2 caching and Day 23 reliability: turn a strongly-consistent central decision into "local cache + async sync" to amortize hot-path cost.

5. Your signing private key leaked. Under stateless JWT, an attacker can forge any user's token. What are the full containment and recovery steps? What fundamental weakness of stateless does this expose?

Containment & recovery: ① rotate the signing key pair immediately — sign with the new private key, publish the new public key (JWKS); ② but if the old public key is still in JWKS/service caches, old forged tokens are still accepted — remove the old kid from JWKS and force verifiers to refresh; ③ reject all tokens signed with the old kid, effectively forcing everyone to re-login; ④ audit tokens issued during the window for signs of forged abuse.

The fundamental weakness: stateless security rests entirely on the secrecy of the signing key — the key is a single point. Once leaked, the blast radius is "every user since the last rotation," and because the system doesn't consult a store, it can't distinguish a "real token" from one "forged with the leaked key." This is exactly why you need periodic key rotation + multiple coexisting kids + private keys in an HSM/KMS that never leaves: turning a single point into a quickly-replaceable, isolatable component. By contrast, leaking a Session store is also bad, but you can invalidate everyone by clearing it — there's no "offline forgery" problem.