Design the backend for a short-video / e-commerce discovery feed with 200M DAU (TikTok's "For You", Instagram Explore, Taobao's "Guess You Like"): on every pull-to-refresh, pick the 20 items most likely to keep the user engaged out of a corpus of ~1B items, end-to-end < 100ms. The hard part isn't "train a CTR model" — it's the funnel of compute and latency: you cannot run a heavy model over all 1B candidates.
graph LR
U["User request
user features"] --> RT["Realtime features
recent behavior"]
RT --> REC["Retrieval
two-tower+ANN / CF / rules
1B → thousands"]
REC --> PRE["Pre-rank
light distilled model
thousands → hundreds"]
PRE --> RANK["Ranking
heavy · multi-objective
hundreds → tens"]
RANK --> RR["Re-rank
diversity · spread · biz rules"]
RR --> OUT["Top-20 feed"]
FS[("Feature Store
embeddings / stats")] -.-> REC
FS -.-> PRE
FS -.-> RANK
classDef u fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef stage fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class U,RT,OUT u
class REC,PRE,RANK,RR stage
class FS store
Each stage cuts the candidate set by an order of magnitude; later models are heavier with finer features
Component roles: Retrieval uses extremely cheap methods (vector dot-product + ANN, collaborative filtering, operational rules) to coarse-filter a few thousand from a billion — optimizing for high recall, not precision. Pre-rank uses a lightweight model to cut thousands to hundreds, a compute buffer between retrieval and ranking. Ranking runs the most expensive multi-objective deep model, scoring each candidate precisely. Re-rank handles global constraints ranking can't — diversity, same-author spread, ad insertion. The Feature Store serves embeddings and statistical features uniformly so training and serving use the same source.
One-line trade-off: trade "ability to generalize to never-co-occurred / cold items" for "ability to precompute offline + accelerate with ANN".
Principle: retrieval must pull a few thousand from a billion in milliseconds; the core idea is mapping both users and items into the same vector space where relevant = close, then retrieving with ANN (HNSW, see Day 12). Three approaches: collaborative filtering (CF / matrix factorization) relies on co-occurrence ("people who watched A also watched B"), pure ID, no content features; content retrieval uses item text/image/category features, naturally recalls cold items; two-tower is the industry mainstream — a user tower and item tower each ingest arbitrary features to produce vectors, trained to pull positive pairs close and push negatives apart; at serving, item vectors are precomputed offline into an ANN index and the user vector is computed once online for a dot-product.
| Method | Generalize/cold | Features | Serving | Typical |
|---|---|---|---|---|
| item-CF / MF | Poor (needs co-occur) | Pure ID | Precomputed similarity | Amazon "bought also bought" |
| Content retrieval | Good (cold-friendly) | Content features | Vector ANN | New-item fallback |
| Two-tower | Good | Arbitrary features | Offline item vectors + ANN | YouTube / Instagram |
# Two-tower in-batch softmax retrieval (PyTorch-style pseudo-code)
u = user_tower(user_feats) # [B, d]
v = item_tower(item_feats) # [B, d] B positive items in batch
logits = u @ v.T / temperature # [B, B] diagonal = positive pairs
logits -= log_item_freq # sampling-bias correction: discount popular items as negatives
loss = cross_entropy(logits, labels=arange(B)) # treat the rest of the batch as negatives
# serving: item_tower precomputes all item vectors -> load into HNSW;
# online compute user_tower once, ANN top-k
One-line trade-off: every stage picks a point between "model accuracy" and "candidate volume it can handle", narrowing progressively.
Principle: a ranking model easily has millions of parameters, ingests hundreds of features, and does user×item target attention — scoring one candidate takes milliseconds, so scoring 1B candidates is a 10⁶× budget, impossible. Hence the staging: retrieval uses a dot-product (O(1) approximate search) from a billion to thousands; ranking uses a heavy model from hundreds to tens; a pre-rank buffer sits between — usually a distilled small model of the ranker, accuracy between retrieval and ranking, cutting thousands to hundreds so retrieval doesn't flood ranking. Each stage's objective differs: retrieval optimizes for not missing (high recall), ranking for ordering precisely (high AUC / calibrated pCTR).
One-line trade-off: trade "short-term experience loss from exploring new content" for "long-term gain from accumulating feedback data".
Principle: CF methods are helpless for items/users with no interactions — a structural defect. Two directions: item cold start relies on content features (the two-tower item tower ingests text/image/category, so new items still get a vector into retrieval), but an embedding alone isn't enough — you also need impressions to collect feedback; user cold start relies on onboarding interest selection, demographics, device/geo side features, plus fast probing. The core is explore-exploit: pure exploit (only show known high-scorers) means new content never gets data and new users get blockbusters. Use a multi-armed bandit (Thompson sampling / UCB) to allocate exploration budget by "uncertainty" — give items with high estimate variance more impressions.
# Thompson sampling to allocate exploration to new items (pseudo-code)
# each item keeps Beta(α, β): α=clicks+1, β=non-clicks+1
def pick(items):
return max(items, key=lambda it: beta_sample(it.alpha, it.beta))
# new item α=β=1 (uniform prior) -> high sampling variance -> chance to be explored;
# after feedback, update α/β, estimate converges, naturally shifts from explore to exploit
One-line trade-off: trade "a new generative paradigm with cold-start generalization" for "the engineering certainty and low latency of mature two-tower+ANN".
Principle: traditional retrieval is "learn embeddings → ANN nearest neighbors". Generative retrieval flips it: give each item a Semantic ID — quantize content embeddings via RQ-VAE into a sequence of semantic tokens (semantically close items share prefix tokens) — then train a Transformer to autoregressively "generate" the Semantic ID of the next item to recommend, turning retrieval into sequence generation where the Transformer itself is the index. Upside: semantically close items share tokens, naturally friendly to cold and long-tail items (a new item lands in a nearby token space as long as its content is similar). Another branch uses an LLM as ranker/feature extractor, leveraging world knowledge to understand item semantics and produce explainable recommendations.
The root is serving's precomputation need. Two-tower can retrieve from a billion items in milliseconds because item vectors can all be computed offline and loaded into an ANN index; online you compute the user vector once and run nearest-neighbor. The moment you introduce a user×item cross feature before scoring (e.g. "this user's historical CTR for this item's category"), the item vector depends on the specific user and can't be computed independently offline — you'd have to recompute it per user for a billion items, blowing the budget, and ANN can't be used (ANN requires fixed item vectors).
Ranking can cross because it faces only a few hundred candidates — it can afford to assemble features per (user, item) pair in real time and run target attention. This is exactly the point of funnel staging: defer the expensive crossing to the stage where candidates are few enough. So "two-tower for retrieval, cross model for ranking" isn't habit — it's a structure derived from compute constraints.
The "impressed-but-not-clicked" trap: retrieval training aims to distinguish relevant vs irrelevant out of a billion, but impressed-but-not-clicked items have already been filtered by the full retrieval→ranking funnel — they're "decent but unclicked" hard negatives, a distribution worlds apart from "a random item from the full corpus". Train only on them and the model learns to "pick among good items", and its discrimination collapses online when facing a billion truly random items (sample selection bias).
The fix: use full-corpus random / in-batch negatives as the bulk (to mimic the real retrieval distribution), with a few hard negatives for precision. The in-batch catch: others' positives in the batch become my negatives, and popular items appear in batches more often, getting repeatedly suppressed as negatives and thus underestimated. The fix is Yi et al 2019's sampling-bias correction — subtract log(item sampling frequency) from logits, discounting by popularity to recover an unbiased estimate.
Mechanism: a user clicks position 1 partly just because it's at position 1 (more visible), not because it's most relevant. If you treat clicks directly as a "relevant" label, the model learns the tautology "what's ranked high gets clicked". After deployment it ranks those items even higher → gets more clicks → the next training round is more confident → positive feedback locks it in, and relevant items once ranked low never recover; diversity and long-term satisfaction keep declining.
Correction: ① position as feature — feed display position as a training input, set it to a fixed value (e.g. 0) at serving, so the model outputs "position-independent relevance"; ② IPS (inverse propensity score) — weight samples by the examination probability of the position, so clicks further down get higher weight; ③ randomized traffic to shuffle positions and collect unbiased data. The core mindset: a click is a mix of relevance × examination probability, and you must strip out the examination part.
Order-of-magnitude estimate (don't memorize exact values): 10⁹ items × 64 dims × 4 bytes (float32) ≈ 256GB for raw vectors alone. HNSW also stores the graph adjacency, typically another 1.5–3× → on the order of 0.5–1TB. A single machine can't hold it, meaning the ANN index must be sharded (hash items across retrieval nodes, scatter-gather, echoing Day 12's search).
Architectural implications: ① higher dims aren't always better — 128 dims double memory over 64 and slow ANN, so trade off; ② to save memory use quantization (PQ, Day 12's IVF-PQ), trading precision for memory; ③ the index must support incremental inserts + periodic rebuilds; ④ this also explains the appeal of generative retrieval (TIGER) — Semantic IDs represent items with discrete tokens, potentially bypassing the "billion dense vectors all in memory" cost wall.
How it happens: the model can only learn feedback from content it has recommended — content never shown gets no click data. So high-scoring items get more impressions, more positive feedback, higher scores next round; cold/new content gets no impressions, stays "unknown", treated as low score. The data distribution narrows, users get locked into ever more homogeneous content, and on the creator side (a two-sided market) only the head survives, withering supply.
Why offline metrics worsen it: offline evaluation uses historical logs, which are themselves the product of the old model, carrying its preferences. A model that "exploits historical popularity more aggressively" often scores higher on offline AUC — because it aligns better with the logic that generated the logs — yet that's exactly the bubble accelerant. Good offline metrics ≠ healthy long-term online ecosystem.
How to break it: ① active exploration (bandits) gives low-confidence content an impression budget, continually replenishing diverse data; ② explicit diversity constraints at re-rank (MMR, DPP, same-author spread); ③ backstop with online A/B long-term retention, diversity, creator Gini coefficient and other ecosystem metrics, not just single-shot CTR; ④ off-policy evaluation to estimate "what if we used a different policy", escaping the old logs' self-justifying loop.