Day 13 Hard Recommendation Two-Tower Multi-stage Ranking Cold Start

Recommendation Systems — A Funnel From a Billion Candidates to TwentyRetrieval, Multi-stage Ranking, Cold Start, Generative Recommendation

Problem & Constraints

Design the backend for a short-video / e-commerce discovery feed with 200M DAU (TikTok's "For You", Instagram Explore, Taobao's "Guess You Like"): on every pull-to-refresh, pick the 20 items most likely to keep the user engaged out of a corpus of ~1B items, end-to-end < 100ms. The hard part isn't "train a CTR model" — it's the funnel of compute and latency: you cannot run a heavy model over all 1B candidates.

Scale: 1B-item corpus, 200M DAU, hundreds of thousands of QPS at peak.
Latency budget: < 100ms end-to-end, meaning the ranking model can only score a few hundred candidates; retrieval must pull a few thousand out of a billion in milliseconds.
Multi-objective: not just CTR, but completion rate, dwell time, conversion, and diversity — single-objective optimization drives people into filter bubbles.
Cold start: millions of new items daily, new users with no history — how to keep them from sinking.
Unreliable feedback: an impression without a click ≠ dislike (maybe unseen); conversion feedback may arrive days later (delayed feedback).

High-level Architecture (Multi-stage Funnel)

graph LR
    U["User request
user features"] --> RT["Realtime features
recent behavior"]
    RT --> REC["Retrieval
two-tower+ANN / CF / rules
1B → thousands"]
    REC --> PRE["Pre-rank
light distilled model
thousands → hundreds"]
    PRE --> RANK["Ranking
heavy · multi-objective
hundreds → tens"]
    RANK --> RR["Re-rank
diversity · spread · biz rules"]
    RR --> OUT["Top-20 feed"]
    FS[("Feature Store
embeddings / stats")] -.-> REC
    FS -.-> PRE
    FS -.-> RANK

    classDef u fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef stage fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class U,RT,OUT u
    class REC,PRE,RANK,RR stage
    class FS store

Each stage cuts the candidate set by an order of magnitude; later models are heavier with finer features

Component roles: Retrieval uses extremely cheap methods (vector dot-product + ANN, collaborative filtering, operational rules) to coarse-filter a few thousand from a billion — optimizing for high recall, not precision. Pre-rank uses a lightweight model to cut thousands to hundreds, a compute buffer between retrieval and ranking. Ranking runs the most expensive multi-objective deep model, scoring each candidate precisely. Re-rank handles global constraints ranking can't — diversity, same-author spread, ad insertion. The Feature Store serves embeddings and statistical features uniformly so training and serving use the same source.

Key Technical Points

1. Retrieval: Collaborative Filtering vs Content vs Two-Tower

One-line trade-off: trade "ability to generalize to never-co-occurred / cold items" for "ability to precompute offline + accelerate with ANN".

Principle: retrieval must pull a few thousand from a billion in milliseconds; the core idea is mapping both users and items into the same vector space where relevant = close, then retrieving with ANN (HNSW, see Day 12). Three approaches: collaborative filtering (CF / matrix factorization) relies on co-occurrence ("people who watched A also watched B"), pure ID, no content features; content retrieval uses item text/image/category features, naturally recalls cold items; two-tower is the industry mainstream — a user tower and item tower each ingest arbitrary features to produce vectors, trained to pull positive pairs close and push negatives apart; at serving, item vectors are precomputed offline into an ANN index and the user vector is computed once online for a dot-product.

Method	Generalize/cold	Features	Serving	Typical
item-CF / MF	Poor (needs co-occur)	Pure ID	Precomputed similarity	Amazon "bought also bought"
Content retrieval	Good (cold-friendly)	Content features	Vector ANN	New-item fallback
Two-tower	Good	Arbitrary features	Offline item vectors + ANN	YouTube / Instagram

Trade-off:

Two-tower's hard constraint: the user and item towers cannot cross features before scoring (otherwise item vectors can't be precomputed offline). This loses the "this specific user × this specific item" interaction signal — so two-tower is only fit for retrieval; ranking does the cross features.
Negatives are the crux: using only "impressed-but-not-clicked" as negatives introduces sample selection bias (those are easy negatives already filtered by retrieval). Mainstream uses in-batch negatives (others' positives in the same batch as my negatives) + full-corpus random negatives, with sampling-bias correction (popular items are sampled as negatives more often, so discount by frequency).
Multi-channel retrieval: no single channel covers all intents; production is a union of dozens of channels (two-tower + CF + popular + followed + geo…), each serving one retrieval intent.

# Two-tower in-batch softmax retrieval (PyTorch-style pseudo-code)
u = user_tower(user_feats)        # [B, d]
v = item_tower(item_feats)        # [B, d]  B positive items in batch
logits = u @ v.T / temperature    # [B, B] diagonal = positive pairs
logits -= log_item_freq           # sampling-bias correction: discount popular items as negatives
loss = cross_entropy(logits, labels=arange(B))  # treat the rest of the batch as negatives
# serving: item_tower precomputes all item vectors -> load into HNSW;
#          online compute user_tower once, ANN top-k

Real-world cases:

YouTube: "Deep Neural Networks for YouTube Recommendations" (RecSys 2016) splits recommendation into candidate generation + ranking; retrieval averages watch history into an embedding fed to a feed-forward net — establishing the two-tower retrieval paradigm.
Google / YouTube: "Sampling-Bias-Corrected Neural Modeling" (Yi et al, RecSys 2019) systematically covers sampling-bias correction for two-tower in-batch negatives, a classic reference.
Instagram Explore: uses two-tower for candidate generation (account embeddings + IGQL); cacheable user/item vectors make inference extremely cheap (Meta Engineering blog 2023).
Airbnb: "Real-time Personalization using Embeddings" (KDD 2018) learns listing embeddings from sessions, compressing listings into 32-dim vectors for real-time personalization.

2. The Multi-stage Funnel: Why Not One Model

One-line trade-off: every stage picks a point between "model accuracy" and "candidate volume it can handle", narrowing progressively.

Principle: a ranking model easily has millions of parameters, ingests hundreds of features, and does user×item target attention — scoring one candidate takes milliseconds, so scoring 1B candidates is a 10⁶× budget, impossible. Hence the staging: retrieval uses a dot-product (O(1) approximate search) from a billion to thousands; ranking uses a heavy model from hundreds to tens; a pre-rank buffer sits between — usually a distilled small model of the ranker, accuracy between retrieval and ranking, cutting thousands to hundreds so retrieval doesn't flood ranking. Each stage's objective differs: retrieval optimizes for not missing (high recall), ranking for ordering precisely (high AUC / calibrated pCTR).

Trade-off:

One fewer stage: retrieval directly to ranking → ranking is flooded with thousands of candidates, timing out or forced to simplify the model, dropping accuracy.
One more stage: each added stage adds an RPC and maintenance cost, and misaligned objectives across stages "fight" — what retrieval likes, ranking dislikes, hurting funnel efficiency. You must align pre-rank to ranking (distillation).
Calibration: ranking outputs used as true probabilities (multiplied by bids, mixing objectives) require pCTR calibration; multi-objective is fused via MMoE multi-task, and the weights themselves are a product decision.

Real-world cases:

YouTube: candidate generation → ranking two-stage is the industry template; the ranking stage directly optimizes expected watch time rather than CTR, avoiding clickbait.
Pinterest: PinSage (KDD 2018) uses graph convolution on a 3B-node graph to generate pin embeddings for retrieval, then ranking — the GNN-retrieval benchmark.
Instagram Explore: explicitly splits candidate generation + ranking, and uses distillation to let lightweight retrieval approximate the heavy ranker (Meta Engineering blog).

3. Cold Start: New Users and New Items, Both Directions

One-line trade-off: trade "short-term experience loss from exploring new content" for "long-term gain from accumulating feedback data".

Principle: CF methods are helpless for items/users with no interactions — a structural defect. Two directions: item cold start relies on content features (the two-tower item tower ingests text/image/category, so new items still get a vector into retrieval), but an embedding alone isn't enough — you also need impressions to collect feedback; user cold start relies on onboarding interest selection, demographics, device/geo side features, plus fast probing. The core is explore-exploit: pure exploit (only show known high-scorers) means new content never gets data and new users get blockbusters. Use a multi-armed bandit (Thompson sampling / UCB) to allocate exploration budget by "uncertainty" — give items with high estimate variance more impressions.

# Thompson sampling to allocate exploration to new items (pseudo-code)
# each item keeps Beta(α, β): α=clicks+1, β=non-clicks+1
def pick(items):
    return max(items, key=lambda it: beta_sample(it.alpha, it.beta))
# new item α=β=1 (uniform prior) -> high sampling variance -> chance to be explored;
# after feedback, update α/β, estimate converges, naturally shifts from explore to exploit

Trade-off:

Too much exploration: users get bothered by irrelevant content, short-term retention drops.
Too conservative: new content/creators get no traffic, the supply side withers (death spiral of a two-sided market); recommendation degrades to "the hotter, the more shown".
ε-greedy vs bandit: ε-greedy is simple but explores blindly; a contextual bandit explores directionally by context — more efficient but engineering-complex.

Real-world cases:

Spotify: uses a contextual bandit framework (its explore-exploit method is often cited as BaRT, "Bandits for Recommendations as Treatments") to balance familiarity vs discovery and give cold-start tracks/podcasts exposure.
Airbnb: new listings have no interaction history, so content-feature embeddings bring them into retrieval, then iterate with real booking feedback (KDD 2018).
YouTube / TikTok-like: new videos are probed with small traffic first, then scaled up once CTR/completion hits thresholds — essentially tiered exploration.

4. Generative Recommendation: Semantic ID and LLMs

One-line trade-off: trade "a new generative paradigm with cold-start generalization" for "the engineering certainty and low latency of mature two-tower+ANN".

Principle: traditional retrieval is "learn embeddings → ANN nearest neighbors". Generative retrieval flips it: give each item a Semantic ID — quantize content embeddings via RQ-VAE into a sequence of semantic tokens (semantically close items share prefix tokens) — then train a Transformer to autoregressively "generate" the Semantic ID of the next item to recommend, turning retrieval into sequence generation where the Transformer itself is the index. Upside: semantically close items share tokens, naturally friendly to cold and long-tail items (a new item lands in a nearby token space as long as its content is similar). Another branch uses an LLM as ranker/feature extractor, leveraging world knowledge to understand item semantics and produce explainable recommendations.

Trade-off:

Generative vs two-tower+ANN: ✅ end-to-end, good cold-start generalization, unifies retrieval and ranking; ❌ autoregressive decoding latency is higher than one ANN query, billion-item industrial deployment is still early, controllability/diversity need tuning.
LLM as ranker: ✅ strong semantic understanding, explainable; ❌ inference cost and latency are orders of magnitude, hard to fit a 100ms online path, mostly used for offline feature generation or small-candidate re-ranking.
Pragmatic path: use a large model offline to produce better item embeddings fed back into two-tower retrieval — get the semantic gain without paying online latency.

Real-world cases:

Google TIGER: "Recommender Systems with Generative Retrieval" (NeurIPS 2023) proposes Semantic ID + autoregressive generative retrieval; the paper shows better recall for cold and long-tail items.
YouTube / Google: explore Semantic ID in production recommendation, injecting content semantics into large-scale recommendation models.

Scaling & Optimization

Realtime features: a user's last N interactions are critical to CTR; use Kafka (Day 8) + Flink to write behavior into the Feature Store within seconds for retrieval/ranking to read live.
Training-serving parity: training and online must use the same feature logic, otherwise training-serving skew makes online far worse than offline — the Feature Store's core value is unifying this.
Embedding updates: the item corpus changes daily, so item-tower vectors must be recomputed periodically + incrementally loaded into ANN; user embeddings can update near-real-time online.
Multi-task (MMoE): CTR/completion/conversion share a base with per-task heads, avoiding one model per objective and easing objective conflict.
Online learning: time-sensitive scenarios (news, short video) update ranking near-real-time to track distribution drift.

Common Pitfalls + Interview Questions

Pitfall: using only "impressed-but-not-clicked" as negatives. These are hard negatives already filtered by retrieval — biased distribution. Retrieval training must mix in full-corpus random negatives with sampling-bias correction, or online recall quality collapses.
Pitfall: ignoring position bias. A user clicks position 1 partly just because it's at position 1, not because it's more relevant. Using clicks as labels directly self-reinforces; you must de-bias position (position as a feature in training, set to zero at serving).
Pitfall: single-objective CTR optimization. It breeds clickbait and inflated dwell with falling satisfaction. YouTube's early shift from CTR to watch time was exactly for this.
Pitfall: feedback loops (echo chamber). The model only recommends what it has recommended, data narrows, diversity collapses. Break it with exploration + re-rank diversity.
Interview Q: why can't two-tower do user×item cross features before scoring? (Answer: item vectors must be precomputed offline into ANN; crossing couples the two towers.)
Interview Q: what are the objectives and metrics of retrieval, pre-rank, and ranking? Why can't they merge?
Interview Q: how to cold-start a new item? How does explore-exploit quantify exploration budget?
Interview Q: how to handle delayed feedback (conversion arriving days later) for labels? (Answer: waiting window, negative correction, importance weighting.)

Deep Resources

YouTube: "Deep Neural Networks for YouTube Recommendations" (Covington et al, RecSys 2016) — the foundational two-stage retrieval+ranking paper.
Google: "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations" (Yi et al, RecSys 2019) — two-tower retrieval and negative-sampling bias correction.
Pinterest: "Graph Convolutional Neural Networks for Web-Scale Recommender Systems" (PinSage, KDD 2018) — industrial-scale GNN retrieval.
Meta Engineering blog: Scaling the Instagram Explore recommendations system (2023) — the full engineering picture of multi-stage retrieval+ranking+distillation.
Google: "Recommender Systems with Generative Retrieval" (TIGER, NeurIPS 2023) — Semantic ID and generative retrieval.

Deeper Reflection (click to expand)

1. Why can't two-tower retrieval do user×item cross features before scoring? What's the root cause? Why can ranking?

The root is serving's precomputation need. Two-tower can retrieve from a billion items in milliseconds because item vectors can all be computed offline and loaded into an ANN index; online you compute the user vector once and run nearest-neighbor. The moment you introduce a user×item cross feature before scoring (e.g. "this user's historical CTR for this item's category"), the item vector depends on the specific user and can't be computed independently offline — you'd have to recompute it per user for a billion items, blowing the budget, and ANN can't be used (ANN requires fixed item vectors).

Ranking can cross because it faces only a few hundred candidates — it can afford to assemble features per (user, item) pair in real time and run target attention. This is exactly the point of funnel staging: defer the expensive crossing to the stage where candidates are few enough. So "two-tower for retrieval, cross model for ranking" isn't habit — it's a structure derived from compute constraints.

2. How to choose retrieval negatives? Why is "impressed-but-not-clicked" a trap? What's the catch with in-batch negatives?

The "impressed-but-not-clicked" trap: retrieval training aims to distinguish relevant vs irrelevant out of a billion, but impressed-but-not-clicked items have already been filtered by the full retrieval→ranking funnel — they're "decent but unclicked" hard negatives, a distribution worlds apart from "a random item from the full corpus". Train only on them and the model learns to "pick among good items", and its discrimination collapses online when facing a billion truly random items (sample selection bias).

The fix: use full-corpus random / in-batch negatives as the bulk (to mimic the real retrieval distribution), with a few hard negatives for precision. The in-batch catch: others' positives in the batch become my negatives, and popular items appear in batches more often, getting repeatedly suppressed as negatives and thus underestimated. The fix is Yi et al 2019's sampling-bias correction — subtract log(item sampling frequency) from logits, discounting by popularity to recover an unbiased estimate.

3. Using clicks as labels for ranking self-reinforces position bias. How to correct it? Why does ignoring it get worse over time?

Mechanism: a user clicks position 1 partly just because it's at position 1 (more visible), not because it's most relevant. If you treat clicks directly as a "relevant" label, the model learns the tautology "what's ranked high gets clicked". After deployment it ranks those items even higher → gets more clicks → the next training round is more confident → positive feedback locks it in, and relevant items once ranked low never recover; diversity and long-term satisfaction keep declining.

Correction: ① position as feature — feed display position as a training input, set it to a fixed value (e.g. 0) at serving, so the model outputs "position-independent relevance"; ② IPS (inverse propensity score) — weight samples by the examination probability of the position, so clicks further down get higher weight; ③ randomized traffic to shuffle positions and collect unbiased data. The core mindset: a click is a mix of relevance × examination probability, and you must strip out the examination part.

4. Estimate: 1B items, two-tower 64-dim embedding — how much memory for the item index? What does it imply architecturally?

Order-of-magnitude estimate (don't memorize exact values): 10⁹ items × 64 dims × 4 bytes (float32) ≈ 256GB for raw vectors alone. HNSW also stores the graph adjacency, typically another 1.5–3× → on the order of 0.5–1TB. A single machine can't hold it, meaning the ANN index must be sharded (hash items across retrieval nodes, scatter-gather, echoing Day 12's search).

Architectural implications: ① higher dims aren't always better — 128 dims double memory over 64 and slow ANN, so trade off; ② to save memory use quantization (PQ, Day 12's IVF-PQ), trading precision for memory; ③ the index must support incremental inserts + periodic rebuilds; ④ this also explains the appeal of generative retrieval (TIGER) — Semantic IDs represent items with discrete tokens, potentially bypassing the "billion dense vectors all in memory" cost wall.

5. Recommendation feedback loops create filter bubbles. How does this second-order effect happen? Why does purely optimizing offline metrics worsen it? How to break it?

How it happens: the model can only learn feedback from content it has recommended — content never shown gets no click data. So high-scoring items get more impressions, more positive feedback, higher scores next round; cold/new content gets no impressions, stays "unknown", treated as low score. The data distribution narrows, users get locked into ever more homogeneous content, and on the creator side (a two-sided market) only the head survives, withering supply.

Why offline metrics worsen it: offline evaluation uses historical logs, which are themselves the product of the old model, carrying its preferences. A model that "exploits historical popularity more aggressively" often scores higher on offline AUC — because it aligns better with the logic that generated the logs — yet that's exactly the bubble accelerant. Good offline metrics ≠ healthy long-term online ecosystem.

How to break it: ① active exploration (bandits) gives low-confidence content an impression budget, continually replenishing diverse data; ② explicit diversity constraints at re-rank (MMR, DPP, same-author spread); ③ backstop with online A/B long-term retention, diversity, creator Gini coefficient and other ecosystem metrics, not just single-shot CTR; ④ off-policy evaluation to estimate "what if we used a different policy", escaping the old logs' self-justifying loop.

← Back to index