Day 12 Hard Search Inverted Index Elasticsearch Vector Search

Search Systems — From Inverted Index to Semantic RetrievalInverted Index, Lucene Segments, Sharding, Vector ANN

Problem & Constraints

Design the backend for a 1-billion-document IM / e-commerce search (Discord message search, Taobao product search): a user types keywords and you must return the most relevant 20 results within 200ms. The hard part isn't "install Elasticsearch" — it's that writes and queries must both be fast. Documents pour in at 50K/s, queries hit at 50K QPS, and you must support prefix completion, typo tolerance, and "find similar meaning" semantic retrieval.

High-Level Architecture

graph TD
    subgraph Write Path
      SRC[("Source DB / CDC")] --> Q["Index Queue
Kafka"] Q --> IW["Index Workers
bulk"] IW --> SEG["Immutable Segment
Lucene"] end subgraph Query Path U["User query"] --> CO["Coordinator
query router"] CO -->|scatter| S1["Shard 1
inverted+HNSW"] CO -->|scatter| S2["Shard 2"] CO -->|scatter| S3["Shard N"] S1 -->|top-k| CO S2 -->|top-k| CO S3 -->|top-k| CO CO -->|gather + rerank| U end SEG -.refresh 1s.-> S1 classDef src fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef pipe fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef shard fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef coord fill:#1a2530,stroke:#64c8ff,color:#e8eef5 class SRC src class Q,IW,SEG pipe class S1,S2,S3 shard class CO,U coord

Writes go async through a queue in batches → produce immutable segments; queries scatter-gather across shards, the coordinator merges and reranks

Component roles: writes don't hit the index directly — they go through Kafka first to decouple, smooth bursts, and guarantee at-least-once; Index Workers batch them into bulk writes (single-doc writes to Lucene are slow). Each shard is an independent Lucene index holding both an inverted index (keywords) and an HNSW graph (vectors). The Coordinator broadcasts the query to all shards, merges each shard's top-k, and does final reranking.

Key Technical Points

1. The Inverted Index: Heart of a Search Engine

One-line trade-off: spend space and write cost to buy O(query terms) instead of O(documents) retrieval.

Principle: a forward index maps "doc → terms"; search needs the inverse. An inverted index maps "term → list of docs containing it (the posting list)". Searching "red dress" means taking the two posting lists and intersecting them. Each posting also stores term frequency (tf) and positions (for phrase queries). Scoring uses BM25: higher tf, shorter doc, rarer term (IDF) → higher score. The term dictionary itself is compressed with an FST (finite state transducer): millions of terms fit in a few MB while still supporting prefix matching.

Trade-off:
# posting-list intersection (with skip-list acceleration, galloping)
def intersect(p1, p2):           # p1 short, p2 long
    out = []
    for doc in p1:
        # use skip pointer on p2 to jump to position >= doc
        if p2.advance_to(doc) == doc:
            out.append(doc)      # in both lists → match
    return out
# key: intersect the shortest list first to maximize pruning
Real-world cases:

2. Immutable Segments & "Online vs Offline Indexing"

One-line trade-off: visibility latency vs write throughput vs query fragmentation — a three-way tug-of-war.

Principle: a Lucene segment is immutable once flushed. New docs don't modify old segments; they accumulate in an in-memory buffer and are periodically refreshed into a new small segment to become searchable (ES default 1s — the origin of "near-real-time / NRT"). Deletes/updates aren't in-place; they write a tombstone marker, filtered at query time and purged only during merge. Small segments pile up and slow queries, so a background merge combines them into big ones.

Trade-off (refresh interval):
Real-world cases:

3. Sharding, Scatter-Gather & the Deep-Pagination Trap

One-line trade-off: sharding gives parallelism and horizontal scale, but merging and tail latency are the price.

Principle: 1B docs don't fit one machine, so split into N shards by doc-id hash or business dimension (Discord shards by guild/DM). A query runs scatter-gather: the coordinator broadcasts to every shard, each shard computes its local top-k and returns it, and the coordinator merges into a global top-k. Replicas provide HA and absorb read QPS.

Deep-pagination trap: to get results 10000–10020 (from=10000,size=20), every shard must return its first 10020 results to the coordinator for sorting — memory and network blow up linearly with page number. The fix is search_after (a cursor on the last result's sort value) or scroll (snapshot iteration). Never let users jump to page N.
Trade-off (number of shards):
Real-world cases:

4. Vector Retrieval: From Keywords to Meaning

One-line trade-off: recall vs memory vs build/update cost — ANN algorithms pick a point in this triangle.

Principle: keyword search can't match "notebook computer" ≈ "laptop". Map both documents and queries to embeddings via a model so that semantically close = vectors close. Brute-forcing distances over 1B vectors is infeasible, so use ANN (approximate nearest neighbor). HNSW builds a multi-layer navigation graph: sparse long edges on top layers approach quickly, dense short edges on lower layers refine; a query greedily descends from the top, with near-logarithmic complexity.

AlgorithmRecallMemoryUpdatesBest for
HNSWHighLarge (all in RAM)Hard (deletes messy)Low latency, medium scale
IVF-PQMediumSmall (quantized)EasierBillion-scale, RAM-frugal
Brute force (flat)100%LargeEasyUnder a few million, exact
# HNSW query: greedy descent layer by layer
def search(q, entry, top_layer):
    cur = entry
    for layer in range(top_layer, 0, -1):     # coarse navigation up top
        cur = greedy_nearest(q, cur, layer)   # hop to a closer neighbor
    # layer 0: beam search with a candidate heap, return ef candidates
    return beam_search(q, cur, layer=0, ef=128)
Hybrid is the production answer: pure vectors lose exact matches (model numbers like "A1706", names); pure keywords lose meaning. In production you recall via BM25 and vectors in parallel, merge with RRF (reciprocal rank fusion), then rerank the top-100 with a cross-encoder. The next post (recommendation systems) digs into reranking.

Real-world cases:

Scaling & Optimization

Pitfalls & Interview Questions

Further Resources

Going Deeper (click to expand)

1. Setting refresh_interval to 1s vs 30s — what are the second-order effects on write throughput, query latency, and disk? Why set it to -1 during bulk import?

Each refresh produces a new small segment. 1s: segments spawn fast, segment count explodes → queries must scan more segments (a query merges per-segment results), and merge threads frantically combine small segments, raising write amplification (the same data read/written repeatedly) and CPU/disk IO. 30s: accumulate longer and flush a bigger segment — fewer segments, less merge, higher write throughput, at the cost of 30s visibility.

Bulk import sets -1 (disable auto refresh): when backfilling history nobody searches it in real time, so flushing segments every second is pure waste — turn it off so all docs accumulate in the buffer, then manually refresh once + force-merge into big segments. This can boost import throughput several-fold. It's the extreme choice on the "visibility latency vs write throughput" trade-off: during import, sacrifice all visibility for throughput.

2. A user searches a super-hot term (e.g. "sale" matching 50M docs) — what happens to scatter-gather, and how do you save it?

Every shard must traverse this enormous posting list to score BM25 and take its local top-k; CPU spikes. The coordinator merges per-shard results. The problem isn't merging (only top-k is shipped) but each shard scanning a massive posting list and scoring, plus the slowest shard setting overall latency.

Fixes: ① WAND / Block-Max WAND dynamic pruning — maintain the current top-k score threshold and skip docs that can't make the top-k, scoring far fewer candidates; ② cache hot-term results; ③ add filters to shrink the candidate set (time range, category); ④ use two-phase: cheap recall then fine rank. Lucene's BMW is the default query accelerator, designed exactly for these long-tail hot terms.

3. Now that semantic vector search exists, why not drop the BM25 inverted index entirely and go all-vector?

Vectors excel at "similar meaning" but lose exact matching: searching a product model "A1706", an order number, a name, or a code symbol, the embedding blends it with a pile of "roughly similar" things, while the user wants that exact one. Vectors also: ① recall approximately (ANN misses); ② can't explain why something matched (inverted indexes can point to which term hit); ③ require running a model for both index and query, raising cost; ④ filters and facets are dirt-cheap on an inverted index but awkward on a pure vector store.

So production is hybrid: BM25 for precision and explainability, vectors for semantic recall, fused with RRF. Google, Taobao, and Elasticsearch all run both in parallel rather than picking one. "All-vector" is a common interview over-simplification trap.

4. Estimate: 1B docs, 1KB each — roughly how big is the inverted index? How many shards and how much memory?

Order-of-magnitude (just the magnitude, don't memorize exact values): raw text is 1B × 1KB ≈ 1TB. The inverted index is typically 30%–150% of raw text — storing positions/term vectors approaches or exceeds raw, while plain doc-id postings are much smaller. Take a middle value of about 0.5–1TB index.

Sharding: 10–50GB per shard recommended (too big = slow recovery, long merges); take 30GB → about 20–40 primary shards. Add one replica and docs/storage double.

Memory: search lives on OS page cache (keep hot segments resident) and JVM heap (FST dictionary, caches). Rule of thumb: leave enough filesystem cache to hold the hot segments. If you add HNSW vectors: 1B × 768 dims × 4 bytes ≈ 3TB, so keeping all of HNSW in RAM is unrealistic → you must quantize with IVF-PQ, or only build vector indexes for hot data. This step often exposes a candidate's lack of intuition that "vectors are an order of magnitude more expensive than text".

5. How do the search index and primary DB stay consistent? Why is dual-write an anti-pattern, and what's correct? (ties to Day 7)

Dual-write (the app writes DB and ES simultaneously) is an anti-pattern: the two writes aren't atomic, so DB succeeds while ES fails (or vice versa) → permanent inconsistency, with no retry boundary and out-of-order races under concurrency. This is exactly the cross-system dual-write Day 7's distributed transactions warn against.

Correct approach: ① Outbox + CDC — the app writes only the primary (with an outbox table, or reading the binlog directly), and Debezium reliably streams changes into Kafka; index workers consume and write ES. The primary is the single source of truth and ES is derived data, rebuildable in full anytime. ② Message processing must be idempotent (upsert by doc id, version-numbered to drop stale out-of-order updates). ③ Accept eventual consistency: search results briefly lagging the primary is fine (echoing Day 6). The key mindset: a search index is always a "rebuildable cache", never the source of truth.