Day 12 Hard Search Inverted Index Elasticsearch Vector Search

Search Systems — From Inverted Index to Semantic RetrievalInverted Index, Lucene Segments, Sharding, Vector ANN

Problem & Constraints

Design the backend for a 1-billion-document IM / e-commerce search (Discord message search, Taobao product search): a user types keywords and you must return the most relevant 20 results within 200ms. The hard part isn't "install Elasticsearch" — it's that writes and queries must both be fast. Documents pour in at 50K/s, queries hit at 50K QPS, and you must support prefix completion, typo tolerance, and "find similar meaning" semantic retrieval.

Scale: 1B docs × ~1KB body; the inverted index inflates to 30%–150% of raw text (depending on whether you store positions / term vectors).
Writes: 50K docs/s with near-real-time visibility (write-to-searchable < 1s), but not at the cost of query latency.
Queries: 50K QPS, p99 < 200ms; both BM25 keyword matching and embedding semantic recall.
Relevance: top-20 ranking quality is the product's lifeline — inaccurate search is no search.
Constraint: a single Lucene index caps at ~2.1 billion docs (MAX_DOC = 2^31), forcing sharding from day one.

High-Level Architecture

graph TD
    subgraph Write Path
      SRC[("Source DB / CDC")] --> Q["Index Queue
Kafka"]
      Q --> IW["Index Workers
bulk"]
      IW --> SEG["Immutable Segment
Lucene"]
    end
    subgraph Query Path
      U["User query"] --> CO["Coordinator
query router"]
      CO -->|scatter| S1["Shard 1
inverted+HNSW"]
      CO -->|scatter| S2["Shard 2"]
      CO -->|scatter| S3["Shard N"]
      S1 -->|top-k| CO
      S2 -->|top-k| CO
      S3 -->|top-k| CO
      CO -->|gather + rerank| U
    end
    SEG -.refresh 1s.-> S1

    classDef src fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    classDef pipe fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef shard fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef coord fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    class SRC src
    class Q,IW,SEG pipe
    class S1,S2,S3 shard
    class CO,U coord

Writes go async through a queue in batches → produce immutable segments; queries scatter-gather across shards, the coordinator merges and reranks

Component roles: writes don't hit the index directly — they go through Kafka first to decouple, smooth bursts, and guarantee at-least-once; Index Workers batch them into bulk writes (single-doc writes to Lucene are slow). Each shard is an independent Lucene index holding both an inverted index (keywords) and an HNSW graph (vectors). The Coordinator broadcasts the query to all shards, merges each shard's top-k, and does final reranking.

Key Technical Points

1. The Inverted Index: Heart of a Search Engine

One-line trade-off: spend space and write cost to buy O(query terms) instead of O(documents) retrieval.

Principle: a forward index maps "doc → terms"; search needs the inverse. An inverted index maps "term → list of docs containing it (the posting list)". Searching "red dress" means taking the two posting lists and intersecting them. Each posting also stores term frequency (tf) and positions (for phrase queries). Scoring uses BM25: higher tf, shorter doc, rarer term (IDF) → higher score. The term dictionary itself is compressed with an FST (finite state transducer): millions of terms fit in a few MB while still supporting prefix matching.

Trade-off:

Storing positions / term vectors: ✅ enables phrase search, highlighting, more-like-this; ❌ doubles index size. Turn it off if you don't need phrase search.
Posting-list compression: doc ids are ascending → delta encoding + VByte/PFor; dense high-frequency terms use Roaring Bitmaps for fast intersection. The cost is CPU for decompression.
Prefix inversion (instant search): build posting lists for every prefix "e/en/eng…" of "engine" so search-as-you-type works; the cost is several-fold index bloat — Algolia's approach.

# posting-list intersection (with skip-list acceleration, galloping)
def intersect(p1, p2):           # p1 short, p2 long
    out = []
    for doc in p1:
        # use skip pointer on p2 to jump to position >= doc
        if p2.advance_to(doc) == doc:
            out.append(doc)      # in both lists → match
    return out
# key: intersect the shortest list first to maximize pruning

Real-world cases:

Lucene / Elasticsearch: FST dictionary + Roaring Bitmap postings, the foundation of most search.
Algolia: builds inverted lists for all prefixes for millisecond instant search (see their Inside the Algolia Engine series).
Postgres GIN: full-text search's inverted index is literally called a Generalized Inverted Index.

2. Immutable Segments & "Online vs Offline Indexing"

One-line trade-off: visibility latency vs write throughput vs query fragmentation — a three-way tug-of-war.

Principle: a Lucene segment is immutable once flushed. New docs don't modify old segments; they accumulate in an in-memory buffer and are periodically refreshed into a new small segment to become searchable (ES default 1s — the origin of "near-real-time / NRT"). Deletes/updates aren't in-place; they write a tombstone marker, filtered at query time and purged only during merge. Small segments pile up and slow queries, so a background merge combines them into big ones.

Trade-off (refresh interval):

Short refresh (1s): ✅ writes visible fast; ❌ many segment fragments, heavy merge pressure, severe write amplification.
Long refresh (30s) or offline rebuild: ✅ high throughput, big segments query faster; ❌ data visible later. When bulk-loading historical data, turn refresh off entirely (set -1), then re-enable.
Online incremental vs offline full rebuild: changing the analyzer / mapping often can't be done online — you rebuild a new index offline and atomically swap the alias (zero-downtime reindex).

Real-world cases:

Discord: messages are lazily indexed — not written in real time; a queue lets workers pull batches and bulk-write to ES to exploit batch throughput (see How Discord Indexes Trillions of Messages).
Elasticsearch: refresh_interval and force-merge are the first knobs for tuning write throughput.
LinkedIn (Galene): online incremental indexing + offline Hadoop full builds running in parallel.

3. Sharding, Scatter-Gather & the Deep-Pagination Trap

One-line trade-off: sharding gives parallelism and horizontal scale, but merging and tail latency are the price.

Principle: 1B docs don't fit one machine, so split into N shards by doc-id hash or business dimension (Discord shards by guild/DM). A query runs scatter-gather: the coordinator broadcasts to every shard, each shard computes its local top-k and returns it, and the coordinator merges into a global top-k. Replicas provide HA and absorb read QPS.

Deep-pagination trap: to get results 10000–10020 (from=10000,size=20), every shard must return its first 10020 results to the coordinator for sorting — memory and network blow up linearly with page number. The fix is search_after (a cursor on the last result's sort value) or scroll (snapshot iteration). Never let users jump to page N.

Trade-off (number of shards):

More shards: ✅ higher parallelism, smaller shards; ❌ wider fan-out per query, the slowest shard sets p99 (tail-latency amplification), bigger merge cost.
Routing key: route a tenant's data to one shard so queries hit a single shard, avoiding broadcast — but risks a hot shard.
Inconsistent IDF: BM25's IDF is computed per-shard, so uneven doc distribution makes cross-shard scores incomparable; DFS query-then-fetch corrects this with global stats.

Real-world cases:

Discord: routes shards by guild/DM; hit the single-Lucene-index MAX_DOC ≈ 2B hard cap and went multi-cluster (now ~40 ES clusters, p50 < 100ms).
Elasticsearch: query-then-fetch in two phases — shards first return id+score, then after merging only the winning docs' bodies are fetched, avoiding shipping all bodies.

4. Vector Retrieval: From Keywords to Meaning

One-line trade-off: recall vs memory vs build/update cost — ANN algorithms pick a point in this triangle.

Principle: keyword search can't match "notebook computer" ≈ "laptop". Map both documents and queries to embeddings via a model so that semantically close = vectors close. Brute-forcing distances over 1B vectors is infeasible, so use ANN (approximate nearest neighbor). HNSW builds a multi-layer navigation graph: sparse long edges on top layers approach quickly, dense short edges on lower layers refine; a query greedily descends from the top, with near-logarithmic complexity.

Algorithm	Recall	Memory	Updates	Best for
HNSW	High	Large (all in RAM)	Hard (deletes messy)	Low latency, medium scale
IVF-PQ	Medium	Small (quantized)	Easier	Billion-scale, RAM-frugal
Brute force (flat)	100%	Large	Easy	Under a few million, exact

# HNSW query: greedy descent layer by layer
def search(q, entry, top_layer):
    cur = entry
    for layer in range(top_layer, 0, -1):     # coarse navigation up top
        cur = greedy_nearest(q, cur, layer)   # hop to a closer neighbor
    # layer 0: beam search with a candidate heap, return ef candidates
    return beam_search(q, cur, layer=0, ef=128)

Hybrid is the production answer: pure vectors lose exact matches (model numbers like "A1706", names); pure keywords lose meaning. In production you recall via BM25 and vectors in parallel, merge with RRF (reciprocal rank fusion), then rerank the top-100 with a cross-encoder. The next post (recommendation systems) digs into reranking.

Real-world cases:

Lucene / Elasticsearch: built-in HNSW for kNN queries, mixable with BM25 in the same engine.
pgvector: a Postgres extension supporting HNSW/IVFFlat, sparing small teams a separate vector DB.
Faiss (Meta): the industrial benchmark for IVF-PQ, the memory-compression scheme for billion-scale vectors.

Scaling & Optimization

Read/write split: physically isolate the indexing cluster (writes) from the query cluster (reads) so merge storms don't drag down queries; replicas scale query QPS.
Hot/cold tiering: keep recent data on SSD hot tier, force-merge history into big segments and move to cheap warm/cold tiers (standard for Discord/log systems).
Query caching: cache result sets and filter bitmaps for frequent queries (echoing Day 2).
Multi-stage retrieval: recall (BM25+vector, fast and broad) → coarse rank → fine rank (expensive model only on top-100), trading compute for relevance.
CDC into the index: use Debezium to stream DB binlog changes into Kafka then the index, keeping search eventually consistent with the primary (echoing Day 7 Outbox).

Pitfalls & Interview Questions

Pitfall: treating search like the DB's LIKE. ES isn't a transactional store, refresh lags, and it isn't strongly consistent; it's derived data. The source of truth is always the primary DB, and you must be able to rebuild from it anytime.
Pitfall: picking shard count by gut and never changing it. ES primary shard count is fixed at index creation; changing it requires reindex. Do capacity estimation first (10–50GB per shard recommended).
Pitfall: ignoring tokenization and relevance tuning. If CJK tokenization, synonyms, and stemming are wrong, fast recall is still wrong recall. Relevance is the product core, not an infra detail.
Interview Q: why are Lucene segments immutable? A: lock-free concurrent reads, friendly to OS page cache and compression, simple crash recovery; the cost is tombstone-based deletes and background merges.
Interview Q: why can cross-shard BM25 scores be inaccurate, and how do you fix it? (per-shard IDF → DFS global stats)
Interview Q: why does from+size deep pagination explode, and what replaces it? (each shard fetches from+size → search_after cursor)
Interview Q: why is deleting a doc from an HNSW vector index tricky? (in-place deletion breaks graph connectivity; usually soft-delete + periodic rebuild)

Further Resources

DDIA — Designing Data-Intensive Applications (Kleppmann), Ch. 3 "Storage and Retrieval" covers full-text indexing and LSM ideas.
Discord Engineering: How Discord Indexes Trillions of Messages (multi-cluster ES, Rust, the MAX_DOC limit).
Algolia Blog: Inside the Algolia Engine series (inverted index, prefix instant search, indexing vs query separation).
HNSW paper: Malkov & Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (arXiv 1603.09320, IEEE TPAMI 2020).
Elasticsearch: The Definitive Guide (official) on segments, refresh, scatter-gather, query-then-fetch.

Going Deeper (click to expand)

1. Setting refresh_interval to 1s vs 30s — what are the second-order effects on write throughput, query latency, and disk? Why set it to -1 during bulk import?

Each refresh produces a new small segment. 1s: segments spawn fast, segment count explodes → queries must scan more segments (a query merges per-segment results), and merge threads frantically combine small segments, raising write amplification (the same data read/written repeatedly) and CPU/disk IO. 30s: accumulate longer and flush a bigger segment — fewer segments, less merge, higher write throughput, at the cost of 30s visibility.

Bulk import sets -1 (disable auto refresh): when backfilling history nobody searches it in real time, so flushing segments every second is pure waste — turn it off so all docs accumulate in the buffer, then manually refresh once + force-merge into big segments. This can boost import throughput several-fold. It's the extreme choice on the "visibility latency vs write throughput" trade-off: during import, sacrifice all visibility for throughput.

2. A user searches a super-hot term (e.g. "sale" matching 50M docs) — what happens to scatter-gather, and how do you save it?

Every shard must traverse this enormous posting list to score BM25 and take its local top-k; CPU spikes. The coordinator merges per-shard results. The problem isn't merging (only top-k is shipped) but each shard scanning a massive posting list and scoring, plus the slowest shard setting overall latency.

Fixes: ① WAND / Block-Max WAND dynamic pruning — maintain the current top-k score threshold and skip docs that can't make the top-k, scoring far fewer candidates; ② cache hot-term results; ③ add filters to shrink the candidate set (time range, category); ④ use two-phase: cheap recall then fine rank. Lucene's BMW is the default query accelerator, designed exactly for these long-tail hot terms.

3. Now that semantic vector search exists, why not drop the BM25 inverted index entirely and go all-vector?

Vectors excel at "similar meaning" but lose exact matching: searching a product model "A1706", an order number, a name, or a code symbol, the embedding blends it with a pile of "roughly similar" things, while the user wants that exact one. Vectors also: ① recall approximately (ANN misses); ② can't explain why something matched (inverted indexes can point to which term hit); ③ require running a model for both index and query, raising cost; ④ filters and facets are dirt-cheap on an inverted index but awkward on a pure vector store.

So production is hybrid: BM25 for precision and explainability, vectors for semantic recall, fused with RRF. Google, Taobao, and Elasticsearch all run both in parallel rather than picking one. "All-vector" is a common interview over-simplification trap.

4. Estimate: 1B docs, 1KB each — roughly how big is the inverted index? How many shards and how much memory?

Order-of-magnitude (just the magnitude, don't memorize exact values): raw text is 1B × 1KB ≈ 1TB. The inverted index is typically 30%–150% of raw text — storing positions/term vectors approaches or exceeds raw, while plain doc-id postings are much smaller. Take a middle value of about 0.5–1TB index.

Sharding: 10–50GB per shard recommended (too big = slow recovery, long merges); take 30GB → about 20–40 primary shards. Add one replica and docs/storage double.

Memory: search lives on OS page cache (keep hot segments resident) and JVM heap (FST dictionary, caches). Rule of thumb: leave enough filesystem cache to hold the hot segments. If you add HNSW vectors: 1B × 768 dims × 4 bytes ≈ 3TB, so keeping all of HNSW in RAM is unrealistic → you must quantize with IVF-PQ, or only build vector indexes for hot data. This step often exposes a candidate's lack of intuition that "vectors are an order of magnitude more expensive than text".

5. How do the search index and primary DB stay consistent? Why is dual-write an anti-pattern, and what's correct? (ties to Day 7)

Dual-write (the app writes DB and ES simultaneously) is an anti-pattern: the two writes aren't atomic, so DB succeeds while ES fails (or vice versa) → permanent inconsistency, with no retry boundary and out-of-order races under concurrency. This is exactly the cross-system dual-write Day 7's distributed transactions warn against.

Correct approach: ① Outbox + CDC — the app writes only the primary (with an outbox table, or reading the binlog directly), and Debezium reliably streams changes into Kafka; index workers consume and write ES. The primary is the single source of truth and ES is derived data, rebuildable in full anytime. ② Message processing must be idempotent (upsert by doc id, version-numbered to drop stale out-of-order updates). ③ Accept eventual consistency: search results briefly lagging the primary is fine (echoing Day 6). The key mindset: a search index is always a "rebuildable cache", never the source of truth.

← Back to index