DAY 48 / PHASE 5 · ENGINEERING

Knowledge Bases & GraphRAG

Document Parsing · Table/Chart · Knowledge Graph · GraphRAG

2026-06-27 · BigCat

Most RAG doesn't lose on the retrieval algorithm — the documents were already shattered before they entered the store.

// WHY THIS MATTERS

Most RAG failures aren't in the retrieval algorithm — they're upstream: the documents were already shattered before ingestion. Two-column PDFs misordered, tables sliced in half by chunking, cross-document entity relationships never built. So vanilla RAG can answer "what does clause 3.2 say" but not "which duplicate obligations exist across this whole contract set" — that kind of global sensemaking question is a structural blind spot for top-k retrieval: the answer lives in no single chunk, only in the global distribution. GraphRAG is built for exactly this: first use an LLM to extract an entity knowledge graph from the corpus, pre-generate summaries for graph communities, then query via two paths — local (entity neighborhood) / global (community-summary rollup). The price: indexing a million-token corpus burns dozens of times the LLM cost of building a vanilla store. This issue covers the four most-overlooked layers of knowledge-base engineering — document parsing, tables/charts, graph construction, the GraphRAG trade-off — and when it's a weapon vs. over-engineering.

Knowledge-base Pipeline (GraphRAG end to end) raw docs INDEX time (expensive · one-off) QUERY time ┌────────┐ parse ┌─────────────────────────┐ │PDF/table│ ─────▶ │ 1 entity/relation extract│ │ /chart │ layout │ (1 LLM call per chunk) │ └────────┘ aware │ 2 entity resolution │ │ 3 community detection │ │ (Leiden) │ │ 4 pre-gen community summ.│ └───────────┬─────────────┘ │ knowledge graph + summaries ┌─────────────┴──────────────────┐ ▼ ▼ LOCAL search GLOBAL search entity→neighbor→rel→comm ask each summary → map-reduce "how do X and Y relate" "what are the main themes" cheap · local expensive · global sensemaking

// 01

Document Parsing: from "extract text" to "restore layout & reading order"

Claim: the ceiling on RAG quality is fixed at the parsing layer — if misordered text goes in before chunking, no reranker downstream can save it.

Background & Principle

Most RAG demos use PyPDF2 / pdfplumber to pull a raw text stream — fine for single-column born-digital PDFs, but on real documents (two-column papers, contracts with headers/footers, scans) it produces three silent injuries: (1) reading-order scramble — two columns interleaved by y-coordinate into one line; (2) headers/footers/watermarks bleeding into body text, polluting embeddings; (3) tables flattened into structureless token streams. Layout-aware parsers (Docling, unstructured, LlamaParse) first do layout detection (bounding boxes for title / paragraph / table / figure), then reassemble by reading order and route block types separately. Docling uses an RT-DETR-family layout model, emitting a structured document with types and reading order (IBM, arXiv 2408.09869). Engineering point: parsing isn't a one-off ETL — it's a first-class citizen alongside chunking. Wrong layout, every chunk boundary is wrong.

Hands-on

from docling.document_converter import DocumentConverter

conv = DocumentConverter()
doc  = conv.convert("contract.pdf").document

# Export by structure, not a raw text stream — keep heading levels & tables
md = doc.export_to_markdown()           # table→md table, heading→#

# Key: chunk by "reading order + block type", not by char count
for item in doc.iterate_items():
    if item.label == "section_header":
        section = item.text             # attach a section path to each chunk
    elif item.label == "table":
        yield serialize_table(item)      # tables as their own chunk, see §2
    elif item.label == "text":
        yield {"text": item.text, "section": section}

Failure mode: scans / handwriting / complex multi-level merged tables — the layout model's confidence is low and it misorders silently without erroring. For mission-critical docs, add a parse-confidence threshold alert + sampled human spot-checks; don't trust blindly. Reverse trap: running Docling on pure single-column born-digital docs is overkill — pdfplumber is enough and ~10× faster.

Going deeper · Docling technical report (IBM, arXiv 2408.09869) + source docling-project/docling · unstructured.io

// 02

Tables / Charts: keeping structural semantics from dissolving in a chunk

Claim: tables are a structural blind spot for RAG — naive chunking drops "2023 revenue" and its number into different chunks, and what comes back is fragments stripped of their row/column relationships.

Background & Principle

A table's semantics live in its 2D structure: a cell's meaning is jointly determined by its row header + column header. Flattened to text, "152.3" loses the binding "it is 2023 Q3 North America revenue." Two engineering routes: (1) structural serialization — convert to markdown / HTML / one-record-per-row JSON so the LLM still sees row/column alignment at generation time; (2) visual route — for inherently text-free charts (bar/line), have a vision model produce a structured description or a multimodal embedding. In practice a hybrid is most robust: serialize small tables to markdown into the text index; split large tables into per-row records with header + caption injected; route charts through vision-caption and index alongside the caption. This shares its root with Anthropic's Contextual Retrieval — inject context (table caption / section) before embedding to lift recall substantially.

Hands-on

# Table → one context-bearing record per row (prevents row/col loss)
def serialize_table(table, doc_title, section):
    headers = table.header_row          # ["Quarter","Region","Rev(M)"]
    for row in table.body_rows:
        cells = dict(zip(headers, row))
        # Key: each row carries headers + caption + section → still readable on recall
        ctx = f"[doc:{doc_title}|sec:{section}|table:{table.caption}] "
        yield ctx + "; ".join(f"{h}={v}" for h,v in cells.items())
        # "...table:Quarterly Rev] Quarter=2023Q3; Region=NA; Rev(M)=152.3"

# Charts (no text) → vision model emits a structured description, then index
desc = vision_model("describe axes/trend/extrema, output JSON", image=chart_png)

Failure mode: a large table spanning pages (header on the prior page, data on the next) gets split into two blocks and rows lose their header — do table-merge at the parse layer. Serializing very wide tables to markdown blows the chunk token budget; here you must use row-as-record. Most dangerous: vision descriptions fabricate numbers that don't exist — verify key figures against the source table; never let a vision model be the data source.

Going deeper · Anthropic Introducing Contextual Retrieval, anthropic.com/news/contextual-retrieval

// 03

Knowledge-graph construction: extraction cost, entity resolution, incremental update

Claim: LLM-based entity/relation extraction is GraphRAG's foundation — and its most expensive, most drift-prone step. Cost is O(corpus tokens), not O(queries).

Background & Principle

Graph construction is three steps: (1) entity/relation extraction — per chunk, have the LLM emit (entity, type, desc) and (source, relation, target); (2) entity resolution — "OpenAI", "OpenAI Inc.", "the company" must resolve to the same node, or the graph fragments into islands; (3) community detection — algorithms like Leiden cluster densely connected nodes into communities, and you pre-generate a summary per community for global search. Two engineering realities: extraction cost grows linearly with the corpus — indexing a million-token corpus can cost tens to hundreds of times the retrieval cost; and schema choice decides everything — open extraction recalls broadly but is noisy and explodes node count, while schema-guided (a whitelist of types) is precise but misses out-of-domain. Incremental update is another pain point: LightRAG's (HKUDS, arXiv 2410.05779) incremental algorithm avoids full rebuilds, whereas early Microsoft GraphRAG nearly re-ran community detection to change a single document.

Hands-on

# schema-guided extraction (type whitelist; control noise & node blowup)
EXTRACT = """Extract entities and relations using only these types:
entities: [Person, Org, Product, Tech, Money]
relations: [acquires, invests, releases, partners, belongs_to]
output JSON: {"entities":[{"name","type","desc"}],
              "relations":[{"source","relation","target"}]}
text: {chunk}"""

# Entity resolution: don't build the graph straight from extraction — resolve aliases first
def resolve(raw):
    # 1) normalize(case/suffix) 2) embedding nearest-neighbor cluster 3) LLM arbitrates edge cases
    clusters = cluster_by_embedding(normalize(raw), threshold=0.9)
    return [merge(c) for c in clusters]   # aliases merge into one node

Failure mode: skip entity resolution → the graph fragments into a pile of single nodes, and global search degrades into noise. Open extraction without a whitelist → node count explodes linearly with the corpus, and community detection yields a swarm of meaningless tiny clusters. The LLM will fabricate edges that don't exist (hallucinated edges), especially forcing cross-paragraph links in long chunks; mitigate by requiring every relation to locate to a source span — provenance validation.

Going deeper · LightRAG (HKUDS, arXiv 2410.05779) github.com/HKUDS/LightRAG · Microsoft GraphRAG docs

// 04

When GraphRAG is worth it, and when it's over-engineering

Claim: GraphRAG isn't an upgraded vanilla RAG — it's a different tool built for "global sensemaking questions." For 80% of factoid queries, using it is over-engineering.

Background & Principle

The core insight of GraphRAG (Edge et al., arXiv 2404.16130): RAG excels at "local retrieval" (the answer sits in a few chunks), but structurally fails on queries like "What are the main themes?" that require surveying the whole corpus — the answer is in no single chunk, but in the global distribution. Its solution is two retrieval paths: local search (fan out from query entities to neighbors + relations + their communities, good for "how do X and Y relate") and global search (ask the question against every community summary, then map-reduce rollup into a total answer, good for "main themes / trends"). The cost is high: indexing extracts per-chunk over millions of tokens plus community summaries, and a global query scans many communities. Decision framework: small corpus / factoid queries / budget-sensitive → vanilla (or hybrid) RAG; large corpus and thematic, multi-hop, connect-the-dots queries → go graph. The middle ground: cheaper variants — LightRAG (incremental, dual-level retrieval), HippoRAG (Personalized PageRank single-step multi-hop, 6–13× faster and 10–20× cheaper than iterative retrieval, NeurIPS 2024 / arXiv 2405.14831), and Microsoft's DRIFT (fuses local + global).

Hands-on

# Routing: classify the query type first; don't brute-force GraphRAG everywhere
def route(query):
    qtype = classify(query)   # factoid / relational / global
    if qtype == "factoid":
        return vanilla_rag(query)             # top-k vector retrieval, cheapest
    if qtype == "relational":
        return graphrag.local_search(query)   # entity-neighborhood fan-out
    return graphrag.global_search(query)      # community-summary rollup

# global search is essentially map-reduce over community summaries
#   map:    each summary independently yields "partial answer + relevance score"
#   reduce: filter by score → merge into the final answer (this step makes it costly)

Failure mode: treating GraphRAG as the default RAG — for a factoid like "what does clause 3.2 say," it's several times slower, tens of times costlier, and not necessarily a better answer. Second trap: a frequently-updated corpus on early GraphRAG (full community rebuild) → runaway index cost; for high update rates pick LightRAG's incremental path. Third: global search's rollup "smooths over" detail and is unreliable when you need exact numbers — fall back to local / vanilla.

Going deeper · GraphRAG From Local to Global (Edge et al., arXiv 2404.16130) + microsoft/graphrag · HippoRAG (arXiv 2405.14831)

// Capstone · Build yourself a "dual-mode knowledge base"

query type → tool choice factoid "what does clause 3.2 say" ─▶ Vanilla RAG (top-k) cheapest relational "which firms did A acquire" ─▶ GraphRAG local mid multi-hop "who did A's portfolio fund" ─▶ HippoRAG (PPR step) mid·fast global "main themes in the corpus" ─▶ GraphRAG global priciest high churn "new docs every day" ─▶ LightRAG (incremental) skip rebuild

Parse layer: use Docling to convert 50 mixed docs (papers / contracts / reports) to structured markdown; serialize tables row-as-record with headers injected. Spot-check 5 against layout.
Build two indexes: A = vanilla vector store (chunk + embedding); B = GraphRAG (schema-guided extraction + community summaries). Record B's indexing token cost — it'll shock you.
Write 12 questions: 4 factoid / 4 relational / 4 global, with known ground truth.
Routing comparison: factoid → A, relational / global → B; record cost / latency / accuracy per class.
Three first-hand conclusions: (a) parse quality decides everything — a mis-laid-out doc is wrong in both indexes; (b) GraphRAG isn't worth it for factoids; (c) on global questions vanilla fails outright and only GraphRAG answers — the empirical proof of §4's claim.
Advanced: swap B for LightRAG, add 3 new docs, compare incremental update vs. full rebuild cost.

Once you've done this, you'll instinctively skin any "knowledge base / GraphRAG product" — how clean is its parse layer? does the graph do entity resolution? is the query local or global? — instead of being dazzled by "graph-enhanced" marketing.

// Deep Thinking

GraphRAG indexing cost is O(corpus), vanilla is O(queries). When does this prepaid cost pay off?

The break-even depends on whether "query volume × per-query delta" covers the one-off indexing delta. GraphRAG indexing is expensive in per-chunk extraction + community summaries, but amortized over many queries the per-query cost isn't necessarily higher. Pays off when: the corpus is relatively stable (no frequent rebuilds), query volume is high, and many queries are global questions vanilla can't answer — here the prepaid cost buys capability, not savings. Doesn't pay off: high-churn corpus (rebuilds can't be amortized), sparse queries, or all-factoid queries (the global structure goes unused). At root it's the classic batch-precompute vs. online-compute trade-off.

How do entity-resolution errors get amplified in global search?

In local search, one wrong entity affects only that query; in global search, one wrong entity pollutes the whole graph topology. If "OpenAI" doesn't merge with "OpenAI Inc.", each connects to only half its edges, and community detection splits nodes that should share a cluster into different communities — so the community summaries are themselves wrong. And global search does map-reduce over those summaries, so the error is amplified twice (pre-generation + rollup) and is invisible. That's why entity resolution is foundation, not optimization: it errs at index time, every downstream query inherits it, and nothing throws.

Document-parsing errors are silent (no error, but downstream pollution). How do you build observability for it engineering-wise?

The parse layer doesn't throw like code; when it's wrong the text just "looks a bit off." Instrument three places: (1) parse time — record the layout model's confidence per block, flag low-confidence pages for review; (2) chunk time — monitor anomalous chunks (ultra-short/long, all-numeric, full of header residue), the fingerprints of parse failure; (3) retrieval time — sample recalled chunks for LLM self-assessment "is this text coherent and readable." Layer a gold standard on top: a fixed set of probe queries with "known answer in doc X table Y" run periodically — if recall drops, parsing regressed.

A table's semantics are 2D, but the LLM context is a 1D sequence — what's the essence of this mismatch? Why is the vision route sometimes better?

The essence: linearizing a 2D structure forces a traversal order (row-major or column-major), and either one pushes the "other dimension's" adjacency far apart in token distance — two vertically adjacent numbers in a column may be separated by a full row of text after serialization. The model rebuilds relations via attention across distance, and the farther apart, the more error-prone. The vision route sidesteps linearization: row/column adjacency is preserved spatially in the image, and the vision encoder operates directly on 2D, making it more robust for "read row R column C" localization. The cost: vision fabricates numbers and is token-expensive — so exact figures still go back to the structured source.

Global search is map-reduce over community summaries — is this isomorphic to multi-agent orchestrator-workers?

Structurally isomorphic, semantically different. Both are "split → process in parallel → aggregate": a community summary ≈ a worker handling a block, rollup ≈ the orchestrator aggregating. But the key difference: GraphRAG's "split" is precomputed at index time (communities are fixed before the query), workers process static summaries rather than dynamic subtasks, and there's no inter-agent IPC or decision drift — which is exactly why it's more controllable than true multi-agent. Put differently, GraphRAG global search is an "orchestrator-workers workflow with a frozen topology," moving multi-agent's costliest, most error-prone "dynamic split" offline. That also explains its cost structure: expensive in offline graph-building, cheap in the determinism of online queries.

// Further Reading

From Local to Global: A Graph RAG Approach (Edge et al.) — the foundational GraphRAG paper, local/global dual path
microsoft/graphrag — official implementation + DRIFT search (fuses local+global)
LightRAG: Simple and Fast RAG (HKUDS) — dual-level retrieval + incremental update, cheaper graph RAG
HippoRAG (NeurIPS 2024) — Personalized PageRank single-step multi-hop, 6–13× faster than iterative retrieval
Anthropic · Contextual Retrieval — inject context before chunking, large recall gains