Design auth for a multi-tenant SaaS: 5M registered users, 1M DAU, login peaks at 5,000 QPS. It serves four client types simultaneously — Web, mobile apps, a public API, and third-party integrations. Enterprise customers demand SSO with their own IdP (OIDC/SAML); compliance requires that keys be rotatable, actions auditable, and leaked credentials revocable instantly.
Security design has a counterintuitive core: it is almost entirely trade-offs, not "more strict is better". Make tokens never-revocable stateless JWTs and performance is great but you can't stop the bleeding after a leak; put keys in env vars and deploys are simple but rotation is a nightmare. This issue covers four layered concerns — the AuthN/AuthZ boundary, OAuth/OIDC delegation, the revocation cost of JWT vs Session, and the lifecycle of secrets. They map to security's three eternal questions: who are you, what can you do, where do secrets live.
All external requests hit the API gateway first, which does coarse-grained token validation (signature, expiry, scope) — this is the authentication boundary. Authentication itself is delegated to the IdP / authorization server (self-hosted or Auth0/Okta), which issues access + refresh tokens. Once inside, each service runs a Policy Decision Point (PDP) for fine-grained authorization (can this user modify this tenant's record). No service stores keys in config — they fetch dynamic credentials at runtime from secret management (Vault / cloud KMS).
graph LR
C["Client
Web/mobile/API"]
GW["API gateway
AuthN: verify token"]
IDP["Auth server/IdP
mint tokens"]
SVC["Business svc
AuthZ: PDP decision"]
SS["Secret mgmt
Vault/KMS"]
DB[("User/perm store")]
C -->|1 login| IDP
IDP -->|2 access+refresh| C
C -->|3 Bearer token| GW
GW -->|4 verified| SVC
GW -.->|public key/JWKS| IDP
SVC -->|5 allowed?| SVC
SVC -.->|dynamic cred| SS
SVC --> DB
classDef ext fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
classDef gate fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef core fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef sec fill:#0e2030,stroke:#5eead4,color:#e8eef5
class C ext
class GW gate
class IDP,SVC core
class SS,DB sec
AuthN happens once at the boundary; AuthZ at every internal entry; keys never land in config files
One-line trade-off: Authentication can be done centrally once; authorization must be done at every resource access point — conflating them is the root of the Web's #1 vulnerability class.
Principle: Authentication answers "who are you," producing a trusted identity credential once at login. Authorization answers "may you do this operation on this resource," and must be evaluated on every resource access. They differ in frequency, location, and failure semantics: AuthN failure returns 401 Unauthorized (you haven't proven identity); AuthZ failure returns 403 Forbidden (identity is fine but you lack permission). The classic IDOR (Insecure Direct Object Reference) bug is treating AuthN as AuthZ — "the user is logged in," so we trust their request for /orders/456 without checking whether that order belongs to them. OWASP ranks this "Broken Access Control" as the top risk.
# Distinguish 401 from 403, and check ownership at every access point
def get_order(req, order_id):
user = authenticate(req) # no identity -> 401
if user is None:
raise HTTP(401) # tell client to re-authenticate
order = db.get_order(order_id)
if order.tenant_id != user.tenant_id: # AuthZ: verify resource ownership
raise HTTP(403) # identity ok but unauthorized -> 403
return order
One-line trade-off: Trade "the complexity of an authorization server and redirects" for "never touching the user's password elsewhere, plus fine-grained authorization and revocation".
Principle: OAuth 2.0 is an authorization-delegation framework — it lets a user authorize a third-party app to access resources on their behalf without handing over a password, yielding an access token (scoped, representing "what you may do"). But OAuth is not an authentication protocol: an access token says "the bearer is authorized," not reliably "who the user is." OIDC (OpenID Connect) adds an identity layer on top, issuing an id_token (a JWT carrying identity claims) — that is the correct basis for "Sign in with Google." The only currently recommended flow is Authorization Code + PKCE: get a one-time code first, then exchange the code (with PKCE proof) for tokens, so tokens never appear in a browser URL. The old Implicit flow (token returned directly in the URL) is officially deprecated because tokens leak into history/referer/logs.
# Authorization Code + PKCE (client-side pseudo-code)
verifier = random_urlsafe(64) # high-entropy secret
challenge = base64url(sha256(verifier)) # derived, public
# 1. redirect to auth server with challenge (NOT the verifier)
redirect(f"{AUTH}/authorize?client_id={CID}&response_type=code"
f"&code_challenge={challenge}&code_challenge_method=S256&scope=openid")
# 2. after login+consent, auth server redirects back with code
# 3. exchange code + original verifier for tokens (a stolen code is useless w/o verifier)
tok = POST(f"{AUTH}/token", code=code, code_verifier=verifier, client_id=CID)
# -> { access_token, refresh_token, id_token } use id_token to confirm "who you are"
One-line trade-off: JWT trades "can't be revoked instantly" for "verify every request without touching a store"; Session is the inverse.
Principle: A Session keeps state server-side (Redis/DB); the client holds only an opaque session_id and the server looks it up per request. A JWT signs the state (user ID, scope, expiry) into a self-contained token; the server only verifies the signature — that's stateless, verifiable independently on any service in any region. The cost: once issued, a JWT can't be recalled before it expires. A stolen JWT works unhindered until expiry; if a user's permissions change or they're banned, the old token still carries old rights. The industry compromise is short-lived access tokens (JWT, 5–15 min) + long-lived refresh tokens (stateful, revocable server-side): to revoke, invalidate the refresh token and the access token dies within minutes.
| Session (stateful) | JWT (stateless) | |
|---|---|---|
| Per-request cost | one store lookup | local verify, zero lookup |
| Instant revocation | ✅ delete session | ❌ can't recall before expiry |
| Horizontal scaling | needs shared session store | ✅ any node verifies alone |
| Permission change effect | immediate | takes effect on next refresh |
| Best for | monolith/single-DC, strong revocation | microservices/multi-region, short tokens |
# Three fatal JWT verification pitfalls
claims = jwt.decode(
token, key=PUBLIC_KEY,
algorithms=["RS256"], # 1. pin the algorithm! else attacker sets alg:none to bypass,
# or RS256->HS256 to forge using "public key as HMAC secret"
options={"require": ["exp", "iss", "aud"]})
assert claims["iss"] == TRUSTED_ISSUER # 2. must check issuer
assert claims["aud"] == MY_API # 3. must check audience, else a token minted
# for another service also works here
# 4. revocation: keep a jti blacklist or short TTL; pure stateless JWT can't "log out"
alg:none and RS256→HS256 algorithm-confusion attacks (post), directly motivating JWT best practices.One-line trade-off: Trade "the operational and availability complexity of Vault/KMS" for "central audit, automatic rotation, and second-scale invalidation after a leak".
Principle: DB passwords, API keys, signing private keys — scattered across env vars, config files, even git — is secret sprawl: nobody knows how many copies exist, who can read them, or when they were last rotated. Mature solutions progress in three layers: ① central storage (Vault / cloud KMS) as a single source of truth plus audit; ② dynamic short-lived credentials — the app requests "a temporary DB account that lives one hour" from Vault at startup, auto-destroyed on expiry, leaving a tiny leak window; ③ workload identity — issue no static keys at all; use a platform-issued instance identity (AWS IAM Role, SPIFFE) to obtain access directly, so there's no long-lived secret to leak. Rotation goes from "company-wide password coordination" to "automatic background rolling."
# Dynamic short-lived credentials: request a temp account at runtime, not a static password
lease = vault.read("database/creds/app-readonly") # request one-time DB credentials
db = connect(user=lease.username, pw=lease.password) # TTL=1h, Vault auto-revokes on expiry
schedule_renew(lease, before_expiry="10m") # renew or re-request
# Contrast: static DB_PASSWORD in env -> valid forever; a leak forces an all-hands reset+restart
httpOnly + Secure cookies to resist XSS theft, with SameSite / CSRF tokens against cross-site requests; go further with token binding / DPoP to bind a token to a client key so a stolen token is useless.httpOnly cookie blocks XSS but needs CSRF defense (SameSite + CSRF token). No silver bullet — weigh by attack surface. Interviewers often probe whether you know each one's failure mode.
jti blacklist (which makes it stateful again) or set access-token TTL to minutes and rely on refresh-token revocation. Acknowledging this trade-off is more professional than pretending you can revoke.
algorithms list, the library trusts the token's own alg — an attacker sets none or switches RS256 to HS256 to forge tokens. Always pin algorithms and validate aud/iss.
Up to 15 minutes (a pure stateless JWT can't be recalled early). Cutting to 1 minute shrinks the leak window but doesn't eliminate risk: ① an attacker can do damage within that minute; ② the real danger is the refresh token — it's long-lived and continually mints new access tokens, so stealing it is a long-term pass.
Cost: access tokens must refresh every minute, spiking refresh QPS 15×, and refresh usually hits a stateful store (to check whether the refresh token is revoked), largely canceling the "stateless saves lookups" benefit.
The right answer: short access token + refresh-token rotation — issue a new refresh token on each refresh and void the old one; if a used (old) refresh token reappears (indicating concurrent stolen use), revoke the entire token chain immediately. That balances performance with blast-radius control.
An access token expresses authorization: "the bearer is allowed to access a resource." It doesn't guarantee "the bearer is a specific user." The classic bug is confused deputy / token substitution: app A receives an access token valid for it and assumes the corresponding user has logged in here — but that token might have been issued for a different app or audience, and an attacker can inject their own token to impersonate.
OIDC exists to plug this hole: it additionally issues an id_token (with aud=your client_id, iss, sub, nonce), explicitly stating "this identity was issued to your app." So "login" should use the id_token and validate aud/nonce, not treat an access token as an identity credential.
The attack surfaces are swapped. localStorage's threat is XSS (any injected script can localStorage.getItem the token), whereas a CSRF token defends against CSRF (cross-site forged requests that exploit the browser auto-sending cookies). A token in localStorage is not auto-attached by the browser, so it's barely exposed to CSRF; adding a CSRF token does nothing for its real XSS risk.
The correct combination is one of two targeted approaches: either httpOnly cookie (blocks XSS reads) + SameSite/CSRF token (blocks CSRF); or store the token in memory (non-persistent, gone on refresh, shrinking the XSS window). The real cure is eliminating XSS itself (CSP, output escaping). Layering on an unrelated defense is just "security theater."
This is isomorphic to Day 2 caching and Day 23 reliability: turn a strongly-consistent central decision into "local cache + async sync" to amortize hot-path cost.
Containment & recovery: ① rotate the signing key pair immediately — sign with the new private key, publish the new public key (JWKS); ② but if the old public key is still in JWKS/service caches, old forged tokens are still accepted — remove the old kid from JWKS and force verifiers to refresh; ③ reject all tokens signed with the old kid, effectively forcing everyone to re-login; ④ audit tokens issued during the window for signs of forged abuse.
The fundamental weakness: stateless security rests entirely on the secrecy of the signing key — the key is a single point. Once leaked, the blast radius is "every user since the last rotation," and because the system doesn't consult a store, it can't distinguish a "real token" from one "forged with the leaked key." This is exactly why you need periodic key rotation + multiple coexisting kids + private keys in an HSM/KMS that never leaves: turning a single point into a quickly-replaceable, isolatable component. By contrast, leaking a Session store is also bad, but you can invalidate everyone by clearing it — there's no "offline forgery" problem.