DAY 40 / PHASE 4 · ENGINEERING

Data Pipeline for AI

Embedding ETL · Incremental Index · Index Versioning · Feature Store

2026-06-19 · BigCat

Every model upgrade makes things better — but the vector store you feed it may still live in last quarter's coordinate system. The data pipeline is the real infrastructure that keeps RAG from decaying over time.

// WHY THIS MATTERS

Most RAG demos die at "it works once": embed all docs in one shot, load into a vector store, ship. The real pain starts on day two. A doc changes one line — do you re-embed the whole store or just that chunk? Will a deleted page still get retrieved? You found a better embedding model — what happens to the old vectors? The "clicks in the last 7 days" feature you feed the reranker — is training computing the same value as serving? None of these are model problems; they are data-pipeline engineering problems, and every one of them silently corrodes production quality without throwing an error. This issue dissects the AI data pipeline into its four most collapse-prone load-bearing points: the cost structure of embedding ETL, incremental indexing's diff and delete, index migration on embedding-model upgrade, and offline/online consistency in the feature store. Each comes with a copy-pasteable skeleton and its failure mode.

// 01

Embedding ETL: Content-Hash Dedup Is the Pipeline's Choke Point

Claim: the embedding call is the most expensive, slowest step in the pipeline. Cache it as a pure function instead of recomputing every time.

Background & Principle

An AI data pipeline is essentially extract → chunk → embed → load. Of the first three, extract and chunk are nearly free (CPU, milliseconds); embed is the bottleneck: it crosses the network, bills per token, and hits rate limits. So the first engineering principle is to treat embed(chunk) as a pure function: identical text yields identical vectors, which means you can cache on hash(model_id + chunk_text). The moment the same chunk appears in multiple docs, or one doc gets re-ingested repeatedly, cache hits cut cost and latency by an order of magnitude. The second principle is batching: embedding APIs take a batch per call, and one call of 100 chunks far outthroughputs 100 single calls. The third is to decouple chunk from embed — chunking strategy gets retuned constantly, but as long as the text is unchanged you should not pay to re-embed it.

┌──────── Embedding ETL: hash cache is the core valve ────────┐ │ │ │ raw docs ─▶ extract ─▶ chunk ─▶ ┌──────────────┐ │ │ (ms-level, ~free) │ hash(model+ │ │ │ │ chunk_text) │ │ │ └──────┬───────┘ │ │ hit? ──┤ │ │ ┌── yes ─────┘── no ──┐ │ │ ▼ ▼ │ │ cached vector ┌──────────┐ │ │ (0 cost/latency) │ embed() │ ◀ only paid/ │ │ │ batch 100 │ rate-limited │ │ └────┬─────┘ │ │ └──────┬────────────┘ │ │ ▼ │ │ upsert → vector DB │ └─────────────────────────────────────────────────────────────┘

Hands-on

# Embedding as pure function: content hash as cache key + batching
import hashlib, json

MODEL = "voyage-3"  # bake model_id into the key; swapping models auto-misses

def key(text): return hashlib.sha256(f"{MODEL}:{text}".encode()).hexdigest()

def embed_batch(chunks, cache):
    todo = [c for c in chunks if key(c) not in cache]
    if todo:                                  # pay only for misses
        vecs = client.embed(todo, model=MODEL).embeddings
        for c, v in zip(todo, vecs): cache[key(c)] = v
    return [cache[key(c)] for c in chunks]

Baking model_id into the cache key is the setup for the next section on swapping models — a model change auto-invalidates every key, forcing a re-embed instead of silently mixing vectors from two coordinate systems.

Failure mode: using "doc ID + chunk index" as the cache key. Edit the doc content and the chunk's text changes but its ID/index doesn't — the cache hits and returns the stale vector. A silently returned stale embedding is harder to catch than an error. The cache key must include a hash of the content itself, not just position.
Going deeper · Anthropic Contextual Retrieval (prepend per-chunk context before embedding; −35% retrieval failures), anthropic.com/news/contextual-retrieval
// 02

Incremental Indexing: Diff-Driven Upsert — Don't Forget the Delete Tombstone

Claim: full re-embedding every time is a beginner's tax. The right way is to diff by content hash and touch only changed chunks — and handle deletes explicitly.

Background & Principle

A document corpus is alive: adds, edits, deletes happen daily. Full rebuilds are tolerable while docs are few, but past ten thousand they become hours of daily compute plus a multiplied API bill. The core of incremental indexing is maintaining a "chunk fingerprint table" (chunk_id → content_hash). Each ingest is compared to the last to derive three operations: add (new hash), change (same id, different hash → re-embed + upsert), and delete (present last time, gone now → remove from the vector store). Most people do the first two; deletes are almost always missed — because they "don't error." The consequence of a missed delete is ghost recall: the doc is long gone but its vector still sits in the store getting retrieved, and the model generates from stale or wrong content. Deletes must go through a tombstone flow that guarantees the vector truly disappears from the store.

Hands-on

# Three diff states: add / change / delete (delete is the easy miss)
def reindex(new_chunks, fingerprint_store, vdb):
    old = fingerprint_store.load()        # {chunk_id: hash}
    new = {c.id: c.hash for c in new_chunks}

    changed = [c for c in new_chunks
               if old.get(c.id) != c.hash]      # add + change
    deleted = old.keys() - new.keys()     # ← key: was there, now gone

    if changed: vdb.upsert(embed_batch([c.text for c in changed]))
    if deleted: vdb.delete(ids=list(deleted))  # tombstone, kills ghost recall
    fingerprint_store.save(new)

The fingerprint table is itself a cheap "data version snapshot" — when something breaks you can reconcile it against the vector store to find exactly which entries drifted between index and source.

Failure mode: incremental does upsert only, never delete. Six months later the store is full of residue from deleted docs, recall quality slowly degrades, and the cause is untraceable. Second trap: re-embedding the whole doc as one unit — fix one typo and you re-embed all of its dozens of chunks. Diff granularity must be chunk, not document.
Going deeper · DVC (git-for-data; version and reproduce datasets and pipeline stages), doc.dvc.org/user-guide/pipelines
// 03

Embedding Versioning: Swapping the Model = Swapping the Coordinate System — Blue-Green Rebuild

Claim: vectors from different embedding models do not live in the same space. Mixing them is computing distances across two coordinate systems — the result is noise.

Background & Principle

This is the most insidious and costliest class of incident. Every embedding-model upgrade (even a minor version of the same model) changes the geometry of the vector space: the cosine similarity between the same sentence embedded by model A and model B is meaningless. If "to save money" you re-embed only new docs and keep old vectors from the old model, the whole index now mixes two incomparable coordinate systems and recall degrades randomly from then on — without erroring. The right approach borrows blue-green from deployment: rebuild the new model fully into a separate new index, dual-write during the transition, A/B the recall quality, and only after confirming no regression atomically switch traffic, keeping the old index for rollback. Key discipline: write the embedding_model version into each vector's metadata, so mixing can be asserted away at the query layer.

┌──── Swapping embedding model: blue-green index migration ────┐ │ │ │ write ──┬──▶ [Blue index] model v1 ◀─ live queries │ │ │ (old, full) │ │ └──▶ [Green index] model v2 │ │ (new, full re-embed in background) │ │ │ │ │ A/B recall quality │ │ (offline eval set) │ │ │ no regression? │ │ ▼ atomic swap │ │ write ──┬──▶ [Green index] model v2 ◀─ live queries │ │ └──▶ [Blue index] kept → rollback │ │ │ │ ✗ anti-pattern: mixing v1+v2 vectors in one index → distort │ └──────────────────────────────────────────────────────────────┘

Hands-on

# Bind model version into vector metadata; assert same source at query time
vdb.upsert(id=cid, vector=v, metadata={
    "embedding_model": "voyage-3",   # ← version fingerprint
    "indexed_at": ts, "doc_id": doc})

def search(q, vdb, model):
    assert vdb.meta("embedding_model") == model, \
        "query model != index model — coordinate systems differ!"
    return vdb.query(embed(q, model=model), top_k=10)
Failure mode: assuming equal dimensions mean compatibility. text-embedding-3 and some open model both output 1024 dims — shapes line up, code doesn't error — but the semantic spaces are entirely different and recall is pure random. Dimension compatibility ≠ space compatibility. There is no "incremental upgrade" for swapping models, only a full rebuild.
Going deeper · Chip Huyen, Designing Machine Learning Systems — the chapters on data/model versioning and iteration are the standard reference for data-pipeline version governance.
// 04

Feature Store: Offline/Online Consistency to Prevent Training-Serving Skew

Claim: the features you feed a reranker / personalization must be the same value, computed as of the same point in time, in training and serving — otherwise offline metrics look great while production collapses.

Background & Principle

RAG is not only vectors. Reranker ordering, personalized recall, and filter rules often need structured features: a user's clicks in the last 7 days, doc popularity, author authority. Two classic traps live here. First, training-serving skew: training computes features with offline batch SQL while serving uses a different real-time code path — the two logics inevitably drift, so offline eval rises while production drops. Second, data leakage: when building training samples, feature values must be taken as of the prediction's point in time, not "now" — otherwise the model peeks at the future during training, offline scores inflate, and reality bites on launch. A feature store (e.g., Feast) exists for exactly these two: an offline store for training and an online store for low-latency serving, with the same feature definition written once and point-in-time-correct joins implemented on the offline side.

Hands-on

# point-in-time join: take features as of the prediction time, no future leak
# training sample = (entity, event_timestamp of prediction, label)
training_df = store.get_historical_features(
    entity_df=labels,                  # has an event_timestamp column
    features=["user:clicks_7d", "doc:popularity"],
).to_df()   # Feast takes values ≤ event_timestamp, not "now"

# Online serving: same feature definition, low-latency online store
feats = store.get_online_features(
    features=["user:clicks_7d", "doc:popularity"],
    entity_rows=[{"user_id": u, "doc_id": d}],
).to_dict()   # training/serving reuse one definition → no skew
Failure mode: backfilling features for historical samples with one SELECT ... WHERE now(). That stamps three-month-old samples with today's click counts — textbook point-in-time leakage, an absurdly beautiful offline AUC, and a production score of zero. For any time-dependent feature, when backfilling always ask: did this value actually exist at the moment the prediction happened?
Going deeper · Feast docs (open-source feature store; offline/online dual store + point-in-time), docs.feast.dev

// Combined Drill · Give Your RAG a "Maintainable" Data Pipeline

String the four points into one refactor: evolve your "one-off script" RAG into a pipeline that can update daily, swap models, and not decay.

  1. Hash cache layer (§1): add a hash(model_id + chunk_text) cache before embed. Run today's full ingest twice; confirm the second pass hits near 100% at near-zero cost.
  2. Fingerprint table + diff (§2): build a chunk_id → content_hash table and turn ingest into a three-state diff. Test delete specifically: remove a doc and confirm its vector truly leaves the store and is no longer recalled.
  3. Version metadata (§3): write embedding_model on every vector and add a query-layer assertion. Simulate a model swap: build a green index with a full re-embed, A/B blue/green recall on a fixed eval set, and switch only after confirming no regression.
  4. Point-in-time features (§4): if you have reranker features, change backfill SQL to take values by event_timestamp. Deliberately write a "backfill with now()" control and see how much it inflates offline metrics — feel firsthand how tempting the leak is.

After these four steps, your RAG graduates from "the demo runs" to "production can sustain it." From then on, looking at any vector-retrieval system, you'll instinctively ask four questions: is embedding cached, are incremental deletes clean, how do model swaps migrate, can features leak — instead of just tuning top_k.

// GLOSSARY

Embedding ETL
The pipeline that extracts, chunks, embeds, and loads raw content into a vector store. Embed is its only paid/rate-limited step.
Content Hash
A hash of the content itself (not position/ID), used as cache key and diff basis. Content changes, hash changes.
Incremental Indexing
Re-embedding only changed chunks instead of a full rebuild, driven by fingerprint-table diff.
Tombstone
Explicitly removing a deleted item's vector from the store to prevent ghost recall.
Ghost Recall
Vectors of deleted/stale docs still getting retrieved, polluting generation.
Blue-Green Index
A migration strategy: old and new indexes coexist, A/B compared, then atomically swapped. Essential when swapping embedding models.
Embedding Space
The vector geometry defined by an embedding model. Spaces from different models are incomparable.
Feature Store
A system that manages ML features uniformly — offline for training, online for serving, defined once (e.g., Feast).
Training-Serving Skew
Drift between training and serving feature logic, making offline look good while production breaks.
Point-in-Time Correctness
Taking feature values as of the prediction time when building training samples, preventing future-data leakage.

// DEEP THINKING

Contextual Retrieval prepends context to each chunk before embedding. Does this break §1's content-hash cache?
It does — and that's exactly the design point. The "context + original chunk" after augmentation is the text actually embedded, so the hash must be computed over the augmented string. The cost: the source chunk is unchanged, but if the context-generating LLM or prompt changes, the cache should invalidate. The practical move is to treat "context generation" as a versioned pipeline stage too and bake its version number into the hash key — the same idea as baking model_id into the key: any input that changes the final vector must enter the cache key.
Are §2's fingerprint table, §3's blue-green index, and §1's hash cache three independent mechanisms or one?
They are the same "content-addressing" idea unfolded at three scales. The hash cache decides "does this chunk need to be re-embedded for pay" (chunk-level, reused across time); the fingerprint table decides "which chunks changed this ingest" (corpus-level, comparing adjacent runs); blue-green decides "the whole space changed, how to migrate safely" (index-level, model upgrade). All three share one invariant: use the fingerprint of content/version, not position or time, to decide identity and invalidation. Grasp this thread and you won't memorize them as three isolated tricks.
"Swapping the embedding model requires a full rebuild" is very expensive. Is there a shortcut to migrate without rebuilding?
Research has tried "vector space alignment" (learn a linear/small network mapping the old space to the new), but it's nearly untrustworthy in practice: the mapping is lossy, errors amplify in recall, and you can't verify the alignment quality is uniform. The reliable way to save money is not avoiding re-embedding but lowering its unit price — §1's cache makes unchanged chunks free, batching cuts per-unit cost, and off-peak batch APIs earn discounts. Conclusion: the rebuild can't be bypassed, but it can be made cheap and gradual. Spending on the safety of the blue-green swap beats betting on an alignment model.
Are §4's point-in-time leakage and §2's missed-delete ghost recall the same class of error?
Yes — both break temporal consistency. Ghost recall is "deleted in the past, the current index hasn't caught up" (index lags reality); point-in-time leakage is "a current value stamped onto a past sample" (the feature runs ahead of reality). One lags, one leads, but the root cause is the same: the valid-time of data wasn't taken seriously. Treat every datum as a timestamped event and make every read explicit about "as of which point," and both bug classes disappear together. This is why feature-store and CDC/event-sourcing thinking transfers to RAG.
For a solo developer doing RAG, are these four engineering layers necessary or over-design?
It's staged. Static docs, a few hundred, model never changing — skip all four; a full script suffices, and forcing the layers is over-engineering. But hit any single signal and add the matching layer: docs change daily → §2 incremental; want to try a better embedding → §3 version metadata (even one line of metadata saves a future incident); wired up reranker features → §4 point-in-time. The criterion isn't "project size" but "will data change, will the model change." For a system that changes, the earlier you bury fingerprints and versions, the later it bites back.

// FURTHER READING