Every model upgrade makes things better — but the vector store you feed it may still live in last quarter's coordinate system. The data pipeline is the real infrastructure that keeps RAG from decaying over time.
Most RAG demos die at "it works once": embed all docs in one shot, load into a vector store, ship. The real pain starts on day two. A doc changes one line — do you re-embed the whole store or just that chunk? Will a deleted page still get retrieved? You found a better embedding model — what happens to the old vectors? The "clicks in the last 7 days" feature you feed the reranker — is training computing the same value as serving? None of these are model problems; they are data-pipeline engineering problems, and every one of them silently corrodes production quality without throwing an error. This issue dissects the AI data pipeline into its four most collapse-prone load-bearing points: the cost structure of embedding ETL, incremental indexing's diff and delete, index migration on embedding-model upgrade, and offline/online consistency in the feature store. Each comes with a copy-pasteable skeleton and its failure mode.
An AI data pipeline is essentially extract → chunk → embed → load. Of the first three, extract and chunk are nearly free (CPU, milliseconds); embed is the bottleneck: it crosses the network, bills per token, and hits rate limits. So the first engineering principle is to treat embed(chunk) as a pure function: identical text yields identical vectors, which means you can cache on hash(model_id + chunk_text). The moment the same chunk appears in multiple docs, or one doc gets re-ingested repeatedly, cache hits cut cost and latency by an order of magnitude. The second principle is batching: embedding APIs take a batch per call, and one call of 100 chunks far outthroughputs 100 single calls. The third is to decouple chunk from embed — chunking strategy gets retuned constantly, but as long as the text is unchanged you should not pay to re-embed it.
# Embedding as pure function: content hash as cache key + batching
import hashlib, json
MODEL = "voyage-3" # bake model_id into the key; swapping models auto-misses
def key(text): return hashlib.sha256(f"{MODEL}:{text}".encode()).hexdigest()
def embed_batch(chunks, cache):
todo = [c for c in chunks if key(c) not in cache]
if todo: # pay only for misses
vecs = client.embed(todo, model=MODEL).embeddings
for c, v in zip(todo, vecs): cache[key(c)] = v
return [cache[key(c)] for c in chunks]
Baking model_id into the cache key is the setup for the next section on swapping models — a model change auto-invalidates every key, forcing a re-embed instead of silently mixing vectors from two coordinate systems.
A document corpus is alive: adds, edits, deletes happen daily. Full rebuilds are tolerable while docs are few, but past ten thousand they become hours of daily compute plus a multiplied API bill. The core of incremental indexing is maintaining a "chunk fingerprint table" (chunk_id → content_hash). Each ingest is compared to the last to derive three operations: add (new hash), change (same id, different hash → re-embed + upsert), and delete (present last time, gone now → remove from the vector store). Most people do the first two; deletes are almost always missed — because they "don't error." The consequence of a missed delete is ghost recall: the doc is long gone but its vector still sits in the store getting retrieved, and the model generates from stale or wrong content. Deletes must go through a tombstone flow that guarantees the vector truly disappears from the store.
# Three diff states: add / change / delete (delete is the easy miss)
def reindex(new_chunks, fingerprint_store, vdb):
old = fingerprint_store.load() # {chunk_id: hash}
new = {c.id: c.hash for c in new_chunks}
changed = [c for c in new_chunks
if old.get(c.id) != c.hash] # add + change
deleted = old.keys() - new.keys() # ← key: was there, now gone
if changed: vdb.upsert(embed_batch([c.text for c in changed]))
if deleted: vdb.delete(ids=list(deleted)) # tombstone, kills ghost recall
fingerprint_store.save(new)
The fingerprint table is itself a cheap "data version snapshot" — when something breaks you can reconcile it against the vector store to find exactly which entries drifted between index and source.
This is the most insidious and costliest class of incident. Every embedding-model upgrade (even a minor version of the same model) changes the geometry of the vector space: the cosine similarity between the same sentence embedded by model A and model B is meaningless. If "to save money" you re-embed only new docs and keep old vectors from the old model, the whole index now mixes two incomparable coordinate systems and recall degrades randomly from then on — without erroring. The right approach borrows blue-green from deployment: rebuild the new model fully into a separate new index, dual-write during the transition, A/B the recall quality, and only after confirming no regression atomically switch traffic, keeping the old index for rollback. Key discipline: write the embedding_model version into each vector's metadata, so mixing can be asserted away at the query layer.
# Bind model version into vector metadata; assert same source at query time
vdb.upsert(id=cid, vector=v, metadata={
"embedding_model": "voyage-3", # ← version fingerprint
"indexed_at": ts, "doc_id": doc})
def search(q, vdb, model):
assert vdb.meta("embedding_model") == model, \
"query model != index model — coordinate systems differ!"
return vdb.query(embed(q, model=model), top_k=10)
text-embedding-3 and some open model both output 1024 dims — shapes line up, code doesn't error — but the semantic spaces are entirely different and recall is pure random. Dimension compatibility ≠ space compatibility. There is no "incremental upgrade" for swapping models, only a full rebuild.
RAG is not only vectors. Reranker ordering, personalized recall, and filter rules often need structured features: a user's clicks in the last 7 days, doc popularity, author authority. Two classic traps live here. First, training-serving skew: training computes features with offline batch SQL while serving uses a different real-time code path — the two logics inevitably drift, so offline eval rises while production drops. Second, data leakage: when building training samples, feature values must be taken as of the prediction's point in time, not "now" — otherwise the model peeks at the future during training, offline scores inflate, and reality bites on launch. A feature store (e.g., Feast) exists for exactly these two: an offline store for training and an online store for low-latency serving, with the same feature definition written once and point-in-time-correct joins implemented on the offline side.
# point-in-time join: take features as of the prediction time, no future leak
# training sample = (entity, event_timestamp of prediction, label)
training_df = store.get_historical_features(
entity_df=labels, # has an event_timestamp column
features=["user:clicks_7d", "doc:popularity"],
).to_df() # Feast takes values ≤ event_timestamp, not "now"
# Online serving: same feature definition, low-latency online store
feats = store.get_online_features(
features=["user:clicks_7d", "doc:popularity"],
entity_rows=[{"user_id": u, "doc_id": d}],
).to_dict() # training/serving reuse one definition → no skew
SELECT ... WHERE now(). That stamps three-month-old samples with today's click counts — textbook point-in-time leakage, an absurdly beautiful offline AUC, and a production score of zero. For any time-dependent feature, when backfilling always ask: did this value actually exist at the moment the prediction happened?
String the four points into one refactor: evolve your "one-off script" RAG into a pipeline that can update daily, swap models, and not decay.
hash(model_id + chunk_text) cache before embed. Run today's full ingest twice; confirm the second pass hits near 100% at near-zero cost.chunk_id → content_hash table and turn ingest into a three-state diff. Test delete specifically: remove a doc and confirm its vector truly leaves the store and is no longer recalled.embedding_model on every vector and add a query-layer assertion. Simulate a model swap: build a green index with a full re-embed, A/B blue/green recall on a fixed eval set, and switch only after confirming no regression.After these four steps, your RAG graduates from "the demo runs" to "production can sustain it." From then on, looking at any vector-retrieval system, you'll instinctively ask four questions: is embedding cached, are incremental deletes clean, how do model swaps migrate, can features leak — instead of just tuning top_k.
model_id into the key: any input that changes the final vector must enter the cache key.