Day 24 Hard Security OAuth2/OIDC JWT vs Session Secrets

Security — Who You Are, What You Can Do, Where Secrets LiveAuthN vs AuthZ, OAuth2/OIDC, JWT vs Session, Secret Management

Scenario + Requirements

Design auth for a multi-tenant SaaS: 5M registered users, 1M DAU, login peaks at 5,000 QPS. It serves four client types simultaneously — Web, mobile apps, a public API, and third-party integrations. Enterprise customers demand SSO with their own IdP (OIDC/SAML); compliance requires that keys be rotatable, actions auditable, and leaked credentials revocable instantly.

Security design has a counterintuitive core: it is almost entirely trade-offs, not "more strict is better". Make tokens never-revocable stateless JWTs and performance is great but you can't stop the bleeding after a leak; put keys in env vars and deploys are simple but rotation is a nightmare. This issue covers four layered concerns — the AuthN/AuthZ boundary, OAuth/OIDC delegation, the revocation cost of JWT vs Session, and the lifecycle of secrets. They map to security's three eternal questions: who are you, what can you do, where do secrets live.

High-Level Architecture

All external requests hit the API gateway first, which does coarse-grained token validation (signature, expiry, scope) — this is the authentication boundary. Authentication itself is delegated to the IdP / authorization server (self-hosted or Auth0/Okta), which issues access + refresh tokens. Once inside, each service runs a Policy Decision Point (PDP) for fine-grained authorization (can this user modify this tenant's record). No service stores keys in config — they fetch dynamic credentials at runtime from secret management (Vault / cloud KMS).

graph LR
    C["Client
Web/mobile/API"] GW["API gateway
AuthN: verify token"] IDP["Auth server/IdP
mint tokens"] SVC["Business svc
AuthZ: PDP decision"] SS["Secret mgmt
Vault/KMS"] DB[("User/perm store")] C -->|1 login| IDP IDP -->|2 access+refresh| C C -->|3 Bearer token| GW GW -->|4 verified| SVC GW -.->|public key/JWKS| IDP SVC -->|5 allowed?| SVC SVC -.->|dynamic cred| SS SVC --> DB classDef ext fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef gate fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef core fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef sec fill:#0e2030,stroke:#5eead4,color:#e8eef5 class C ext class GW gate class IDP,SVC core class SS,DB sec

AuthN happens once at the boundary; AuthZ at every internal entry; keys never land in config files

Key Technical Points

1. AuthN vs AuthZ — First Separate "Who You Are" from "What You Can Do"

One-line trade-off: Authentication can be done centrally once; authorization must be done at every resource access point — conflating them is the root of the Web's #1 vulnerability class.

Principle: Authentication answers "who are you," producing a trusted identity credential once at login. Authorization answers "may you do this operation on this resource," and must be evaluated on every resource access. They differ in frequency, location, and failure semantics: AuthN failure returns 401 Unauthorized (you haven't proven identity); AuthZ failure returns 403 Forbidden (identity is fine but you lack permission). The classic IDOR (Insecure Direct Object Reference) bug is treating AuthN as AuthZ — "the user is logged in," so we trust their request for /orders/456 without checking whether that order belongs to them. OWASP ranks this "Broken Access Control" as the top risk.

Trade-off: where authorization decisions live
# Distinguish 401 from 403, and check ownership at every access point
def get_order(req, order_id):
    user = authenticate(req)            # no identity -> 401
    if user is None:
        raise HTTP(401)                 # tell client to re-authenticate
    order = db.get_order(order_id)
    if order.tenant_id != user.tenant_id:   # AuthZ: verify resource ownership
        raise HTTP(403)                 # identity ok but unauthorized -> 403
    return order
Real-world cases:

2. OAuth 2.0 / OIDC — Don't Store Third-Party Passwords; Delegate Authorization

One-line trade-off: Trade "the complexity of an authorization server and redirects" for "never touching the user's password elsewhere, plus fine-grained authorization and revocation".

Principle: OAuth 2.0 is an authorization-delegation framework — it lets a user authorize a third-party app to access resources on their behalf without handing over a password, yielding an access token (scoped, representing "what you may do"). But OAuth is not an authentication protocol: an access token says "the bearer is authorized," not reliably "who the user is." OIDC (OpenID Connect) adds an identity layer on top, issuing an id_token (a JWT carrying identity claims) — that is the correct basis for "Sign in with Google." The only currently recommended flow is Authorization Code + PKCE: get a one-time code first, then exchange the code (with PKCE proof) for tokens, so tokens never appear in a browser URL. The old Implicit flow (token returned directly in the URL) is officially deprecated because tokens leak into history/referer/logs.

Trade-off: which grant flow
# Authorization Code + PKCE (client-side pseudo-code)
verifier  = random_urlsafe(64)                  # high-entropy secret
challenge = base64url(sha256(verifier))         # derived, public
# 1. redirect to auth server with challenge (NOT the verifier)
redirect(f"{AUTH}/authorize?client_id={CID}&response_type=code"
         f"&code_challenge={challenge}&code_challenge_method=S256&scope=openid")
# 2. after login+consent, auth server redirects back with code
# 3. exchange code + original verifier for tokens (a stolen code is useless w/o verifier)
tok = POST(f"{AUTH}/token", code=code, code_verifier=verifier, client_id=CID)
#   -> { access_token, refresh_token, id_token }  use id_token to confirm "who you are"
Real-world cases:

3. JWT vs Session — The Cost of Stateless Is "Hard to Revoke"

One-line trade-off: JWT trades "can't be revoked instantly" for "verify every request without touching a store"; Session is the inverse.

Principle: A Session keeps state server-side (Redis/DB); the client holds only an opaque session_id and the server looks it up per request. A JWT signs the state (user ID, scope, expiry) into a self-contained token; the server only verifies the signature — that's stateless, verifiable independently on any service in any region. The cost: once issued, a JWT can't be recalled before it expires. A stolen JWT works unhindered until expiry; if a user's permissions change or they're banned, the old token still carries old rights. The industry compromise is short-lived access tokens (JWT, 5–15 min) + long-lived refresh tokens (stateful, revocable server-side): to revoke, invalidate the refresh token and the access token dies within minutes.

Session (stateful)JWT (stateless)
Per-request costone store lookuplocal verify, zero lookup
Instant revocation✅ delete session❌ can't recall before expiry
Horizontal scalingneeds shared session store✅ any node verifies alone
Permission change effectimmediatetakes effect on next refresh
Best formonolith/single-DC, strong revocationmicroservices/multi-region, short tokens
# Three fatal JWT verification pitfalls
claims = jwt.decode(
    token, key=PUBLIC_KEY,
    algorithms=["RS256"],     # 1. pin the algorithm! else attacker sets alg:none to bypass,
                              #    or RS256->HS256 to forge using "public key as HMAC secret"
    options={"require": ["exp", "iss", "aud"]})
assert claims["iss"] == TRUSTED_ISSUER   # 2. must check issuer
assert claims["aud"] == MY_API           # 3. must check audience, else a token minted
                                          #    for another service also works here
# 4. revocation: keep a jti blacklist or short TTL; pure stateless JWT can't "log out"
Real-world cases:

4. Secret Management — Keys Aren't Config; They're Living Things That Expire and Leak

One-line trade-off: Trade "the operational and availability complexity of Vault/KMS" for "central audit, automatic rotation, and second-scale invalidation after a leak".

Principle: DB passwords, API keys, signing private keys — scattered across env vars, config files, even git — is secret sprawl: nobody knows how many copies exist, who can read them, or when they were last rotated. Mature solutions progress in three layers: ① central storage (Vault / cloud KMS) as a single source of truth plus audit; ② dynamic short-lived credentials — the app requests "a temporary DB account that lives one hour" from Vault at startup, auto-destroyed on expiry, leaving a tiny leak window; ③ workload identity — issue no static keys at all; use a platform-issued instance identity (AWS IAM Role, SPIFFE) to obtain access directly, so there's no long-lived secret to leak. Rotation goes from "company-wide password coordination" to "automatic background rolling."

Trade-off:
# Dynamic short-lived credentials: request a temp account at runtime, not a static password
lease = vault.read("database/creds/app-readonly")   # request one-time DB credentials
db = connect(user=lease.username, pw=lease.password) # TTL=1h, Vault auto-revokes on expiry
schedule_renew(lease, before_expiry="10m")           # renew or re-request
# Contrast: static DB_PASSWORD in env -> valid forever; a leak forces an all-hands reset+restart
Real-world cases:

Scaling & Optimization

Common Pitfalls + Interview Questions

1. JWT in localStorage or a cookie? localStorage is readable by any XSS-injected script; an httpOnly cookie blocks XSS but needs CSRF defense (SameSite + CSRF token). No silver bullet — weigh by attack surface. Interviewers often probe whether you know each one's failure mode.
2. How does a pure stateless JWT implement "logout"? Strictly, it can't do instant logout. Either keep a jti blacklist (which makes it stateful again) or set access-token TTL to minutes and rely on refresh-token revocation. Acknowledging this trade-off is more professional than pretending you can revoke.
3. 401 or 403? Missing/bad credential → 401 (go re-authenticate); valid credential but no permission → 403. To prevent resource enumeration, some endpoints deliberately return 404 for unauthorized resources to hide their existence — that's an intentional security choice, not a missed check.
4. Forgetting to pin the algorithm during JWT verification: without an explicit algorithms list, the library trusts the token's own alg — an attacker sets none or switches RS256 to HS256 to forge tokens. Always pin algorithms and validate aud/iss.
5. Committing a secret to git: even if deleted later, it stays in history and may already be scraped. Once committed, consider it leaked — rotate immediately. This is exactly why dynamic short-lived credentials are far safer than static keys.

Deep-Dive Resources

Going Deeper (click to expand)

1. Your access-token TTL is 15 minutes. How long can a stolen token be used? Does cutting TTL to 1 minute eliminate the risk? At what cost?

Up to 15 minutes (a pure stateless JWT can't be recalled early). Cutting to 1 minute shrinks the leak window but doesn't eliminate risk: ① an attacker can do damage within that minute; ② the real danger is the refresh token — it's long-lived and continually mints new access tokens, so stealing it is a long-term pass.

Cost: access tokens must refresh every minute, spiking refresh QPS 15×, and refresh usually hits a stateful store (to check whether the refresh token is revoked), largely canceling the "stateless saves lookups" benefit.

The right answer: short access token + refresh-token rotation — issue a new refresh token on each refresh and void the old one; if a used (old) refresh token reappears (indicating concurrent stolen use), revoke the entire token chain immediately. That balances performance with blast-radius control.

2. Why is OAuth 2.0 "not an authentication protocol," and what goes wrong if you use an access token to decide "who the user is"?

An access token expresses authorization: "the bearer is allowed to access a resource." It doesn't guarantee "the bearer is a specific user." The classic bug is confused deputy / token substitution: app A receives an access token valid for it and assumes the corresponding user has logged in here — but that token might have been issued for a different app or audience, and an attacker can inject their own token to impersonate.

OIDC exists to plug this hole: it additionally issues an id_token (with aud=your client_id, iss, sub, nonce), explicitly stating "this identity was issued to your app." So "login" should use the id_token and validate aud/nonce, not treat an access token as an identity credential.

3. A frontend dev says "I store the JWT in localStorage and add a CSRF token, so it's safe." What's backwards here?

The attack surfaces are swapped. localStorage's threat is XSS (any injected script can localStorage.getItem the token), whereas a CSRF token defends against CSRF (cross-site forged requests that exploit the browser auto-sending cookies). A token in localStorage is not auto-attached by the browser, so it's barely exposed to CSRF; adding a CSRF token does nothing for its real XSS risk.

The correct combination is one of two targeted approaches: either httpOnly cookie (blocks XSS reads) + SameSite/CSRF token (blocks CSRF); or store the token in memory (non-persistent, gone on refresh, shrinking the XSS window). The real cure is eliminating XSS itself (CSP, output escaping). Layering on an unrelated defense is just "security theater."

4. A centralized PDP adds a hop per request. On a 5,000-QPS hot path, how do you keep it from becoming a latency and availability bottleneck?
  • Push policy down + evaluate locally: don't ask a central PDP per request; push policy to a sidecar next to each service (the OPA model) and evaluate against in-memory data — microsecond decisions. The center only distributes policy and collects audit.
  • Separate data from policy: policy changes slowly and can be cached; permission data (who's in which group) changes fast — sync it incrementally via bundles or cache with short TTL, tolerating second-scale eventual consistency.
  • Cache decision results: cache the same (user, action, resource) briefly, but weigh that revocation latency = cache TTL.
  • Explicit failure mode: when the PDP is unreachable, fail-open (allow, availability first) or fail-close (deny, security first)? Security contexts should almost always fail-close, though core read paths may need tiering.

This is isomorphic to Day 2 caching and Day 23 reliability: turn a strongly-consistent central decision into "local cache + async sync" to amortize hot-path cost.

5. Your signing private key leaked. Under stateless JWT, an attacker can forge any user's token. What are the full containment and recovery steps? What fundamental weakness of stateless does this expose?

Containment & recovery: ① rotate the signing key pair immediately — sign with the new private key, publish the new public key (JWKS); ② but if the old public key is still in JWKS/service caches, old forged tokens are still accepted — remove the old kid from JWKS and force verifiers to refresh; ③ reject all tokens signed with the old kid, effectively forcing everyone to re-login; ④ audit tokens issued during the window for signs of forged abuse.

The fundamental weakness: stateless security rests entirely on the secrecy of the signing key — the key is a single point. Once leaked, the blast radius is "every user since the last rotation," and because the system doesn't consult a store, it can't distinguish a "real token" from one "forged with the leaked key." This is exactly why you need periodic key rotation + multiple coexisting kids + private keys in an HSM/KMS that never leaves: turning a single point into a quickly-replaceable, isolatable component. By contrast, leaking a Session store is also bad, but you can invalidate everyone by clearing it — there's no "offline forgery" problem.