DAY 08 / PHASE 1 · ENGINEERING

Multimodal Engineering

Resolution = Tokens · PDF Three Tiers · Vision Prompt · The Real Traps of Multimodal RAG

2026-05-25 · BigCat

Stuffing an image in as if it were text wastes tokens. Reading every PDF visually burns cash. Multimodal engineering is making pixels and text each do the work they're good at.

Foundation concepts → ai-ml-daily Day 23: Multimodal Concepts (VLM, Audio LLM)

// WHY THIS MATTERS

Multimodal APIs have been out for two years and most usage still stops at "screenshot it, ask the model what's there." On the surface it works; the bill and the accuracy say otherwise — one 4K screenshot can eat 2,000+ tokens under official tiling rules, a 10-page PDF costs 20× more via vision than via text extraction, and CLIP cross-modal retrieval on Chinese long documents has recall below 40%. None of this is the model's fault; it's that no engineering was done. This week covers four things: (1) how image tokens are actually computed, where the sweet spot is, and why resize beats base64 encoding for the bill; (2) the three PDF processing tiers (text-layer extraction / vision direct read / hybrid) and their cost–accuracy boundaries; (3) the "describe before deciding" vision prompt pattern and the real limits of grounding/coordinate precision; (4) why caption-then-retrieve often beats true multimodal embedding for multimodal RAG. You should walk away rewriting at least one place in your vision pipeline that was burning tokens it didn't need to.

// 01

Image Token Economics — Resolution Drives the Bill, Not File Size

Claim: the token cost of an image sent to Claude/GPT depends only on how it's tiled into patches. 10MB vs 100KB doesn't matter; long-edge pixels and tile count do. People who can't compute this formula are overpaying.

Background & Principles

Anthropic's documented approximation: tokens ≈ (width × height) / 750. A 1568×1568 image is ~3,280 tokens; a 3000×3000 image is ~12,000 tokens — 3.7× more for essentially no quality gain on most tasks (OCR, table parsing, UI description, chart reading). Anthropic's officially recommended "max useful edge" is 1568px; beyond that the model doesn't extract more. GPT-4o uses tiling: 512×512 patches, base 85 tokens + 170 per patch, 2048×2048 ≈ 765 tokens; tiling activates only in detail=high mode (low is flat 85 tokens).

Two practical consequences: (1) client-side resize beats nothing — Pillow-resize to 1568px before base64, the savings go straight to your pocket; (2) OCR is the only case that might want higher — dense small text (receipts, fine-print contracts), but even then 2000px is the ceiling; beyond it the marginal benefit is zero. Anthropic's vision docs themselves note that "image quality and clarity matters more than resolution" — denoising, contrast, deskewing buy more than raw px.

Image-token cheatsheet (Claude / GPT-4o) Resolution Claude tokens GPT-4o (high) Scenario ───────────────────────────────────────────────────────────── 512×512 ~350 255 Icons, thumbnails 1092×1092 (1:1) ~1600 765 General sweet spot 1568×1568 (max) ~3280 1445 Complex charts / long shot 2000×2000 ~5333 2125 Dense small-text OCR 3000×3000 ~12000 4505 ❌ Waste, no gain 4032×3024 raw phone ~16250 5950 ❌ Always resize first → 1568px long edge is Claude's recommended ceiling — model can't see more

Hands-on Example

A preprocessing function every vision call should go through:

# vision_preprocess.py — run before every upload, saves real money
from PIL import Image, ImageOps
import base64, io

MAX_LONG_EDGE = 1568   # Claude's official suggestion; OCR can push 2000

def prep_image(path: str, max_edge: int = MAX_LONG_EDGE) -> str:
    img = Image.open(path)
    img = ImageOps.exif_transpose(img)             # fix phone orientation
    img = img.convert("RGB")                       # drop alpha

    w, h = img.size
    if max(w, h) > max_edge:                       # downscale isotropically
        scale = max_edge / max(w, h)
        img = img.resize((int(w*scale), int(h*scale)), Image.LANCZOS)

    # JPEG q=85 not PNG: order-of-magnitude smaller, model can't tell
    buf = io.BytesIO()
    img.save(buf, format="JPEG", quality=85, optimize=True)
    return base64.b64encode(buf.getvalue()).decode()

# —— Cost preflight ——
def estimate_tokens(path: str) -> int:
    w, h = Image.open(path).size
    w, h = min(w, MAX_LONG_EDGE), min(h, MAX_LONG_EDGE)
    return (w * h) // 750     # Anthropic approximation

Plug this into your vision calls and a 10k-image/month pipeline typically saves 30–50% in image tokens, accuracy delta < 1%. More direct than another prompt rewrite.

Failure modes: (1) sending raw phone 4032×3024 — 16,000+ tokens at $0.05+ per image; (2) believing PNG is "more accurate" — vision models are insensitive to q=85 JPEG loss; PNG is just more expensive transport; (3) upscaling small images, expecting the model to "see clearer" — wrong; models hallucinate fake detail just as readily on upscaled fakes; (4) skipping EXIF transpose — phone-rotated images arrive sideways and recognition accuracy halves.

Further reading · Anthropic Vision · Image sizes & pricing, docs.anthropic.com/.../vision · OpenAI Vision · calculating costs, platform.openai.com/docs/guides/vision

// 02

PDF Three Tiers — When Text, When Vision, When Both

Claim: throwing every PDF into Claude PDF input is mindlessly expensive; pypdf-everything is mindlessly lossy. Pick the tier based on document shape, not on which tool you happen to have.

Background & Principles

From 2025 onward Anthropic and OpenAI both support PDF input natively. Internally they render each page to image AND extract the text layer, feeding both to the model. Powerful, but each page consumes 1500–3000 tokens — a 10-page doc ≈ 20k+ tokens per session. The three tiers:

Tier 1: pure text extraction (pypdf / pdfplumber / pymupdf) — first choice for text-native PDFs (papers, contracts, exported blog posts). Cost zero, latency milliseconds. pymupdf (fitz) has best quality; pdfplumber is most stable on tables. Fails on: scans, complex multi-column layouts, font-subset garbled text.
Tier 2: layout parsing (unstructured / Docling / MinerU) — open-source layout models segment the page (title / body / table / figure) and chunk-extract. Cost: local GPU inference / fractions of a cent. Good for: research reports, prospectuses, technical specs — complex layout but still "print quality." IBM's 2024 open-source Docling approaches commercial Textract on several benchmarks.
Tier 3: vision direct read (Claude/GPT PDF input or self-rendered + vision) — scans, handwritten annotations, chart/equation-heavy, weirdly-laid-out documents. Most expensive, most robust. Anthropic PDF input limits: 100 pages, 32MB; over that you split yourself.

Decision flow (30 seconds): try pdfplumber.extract_text() on one page. Structured text and not garbled? Tier 1. Broken line breaks, scrambled multi-column? Tier 2. Empty / garbage? Tier 3. In production, the typical pipeline is Tier 2 as default + Tier 3 as fallback for the pages layout parsing fails on.

PDF processing decision tree (cost low → high) PDF in │ ├─ pdfplumber extract → structured text? │ ├─ yes → ✅ Tier 1: pure text (~0 cost) │ └─ no ↓ │ ├─ complex multi-col / tables / print-quality? │ ├─ yes → ✅ Tier 2: Docling / MinerU layout │ │ └─ failed pages → Tier 3 │ └─ no ↓ │ └─ ✅ Tier 3: vision read (~2k tokens/page) ├─ <100 pp → Claude PDF input └─ >100 pp → self-render + selective vision → hybrid (Tier 2 + Tier 3 fallback) is the production default

Hands-on Example

A hybrid processor that auto-falls-back to vision per page:

# pdf_hybrid.py — text first, vision fallback for failing pages
import pdfplumber, fitz, anthropic
from pathlib import Path

def extract_text(page) -> str:
    txt = page.extract_text() or ""
    return txt.strip()

def needs_vision(text: str) -> bool:
    # Heuristic: too-short / high garbage-char ratio → page is image
    if len(text) < 50: return True
    bad = sum(1 for c in text if not (c.isprintable() or c.isspace()))
    return bad / len(text) > 0.05

def page_to_image_b64(pdf_path: str, page_idx: int) -> str:
    doc = fitz.open(pdf_path)
    pix = doc[page_idx].get_pixmap(dpi=150)   # 150 dpi is OCR sweet spot
    return base64.b64encode(pix.tobytes("jpeg")).decode()

def process_pdf(path: str, client) -> list[str]:
    results = []
    with pdfplumber.open(path) as pdf:
        for i, page in enumerate(pdf.pages):
            txt = extract_text(page)
            if not needs_vision(txt):
                results.append(txt)                # Tier 1, free
                continue
            # fall back to vision, page-by-page
            img_b64 = page_to_image_b64(path, i)
            r = client.messages.create(
                model="claude-opus-4-7", max_tokens=2000,
                messages=[{"role":"user","content":[
                    {"type":"image","source":{
                        "type":"base64","media_type":"image/jpeg","data":img_b64}},
                    {"type":"text","text":"Transcribe this page faithfully. Preserve tables as Markdown."}
                ]}])
            results.append(r.content[0].text)
    return results

On real corpora this pattern typically routes 70–90% of pages to Tier 1, total cost ~1/10 of naive PDF-input. 150 dpi is the OCR sweet spot — higher buys nothing, lower blurs small text.

Failure modes: (1) sending one giant PDF past the page cap — 100+ pages just 400s; (2) rendering at 72 dpi — small-text OCR breaks; use 150; (3) tables via plain text extract — pdfplumber error rate on merged/spanning cells is 30%+, layout model is mandatory; (4) running pypdf on a scan — you get placeholder text from embedded image frames, looks like content, is all picture.

Further reading · Anthropic PDF support, docs.anthropic.com/.../pdf-support · IBM Research Docling: An Efficient Open-Source Toolkit, arxiv.org/abs/2408.09869 · Simon Willison Notes on PDF parsing, simonwillison.net/tags/pdf

// 03

Vision Prompt — Make the Model Describe Before It Decides

Claim: the most common vision failure is not "the model can't see," it's the prompt letting the model skip looking and jump to an answer. Forced describe-before-decide + structured output beats reaching for a bigger model.

Background & Principles

CoT is common sense for text; people forget it for vision. Ask "what trend does this chart show?" and the model may infer without ever reading the y-axis unit. OpenAI's 2024 vision evals make the same observation: VQA accuracy improves 5–15% just by adding "first describe what you see in the image, then answer." The cause: visual features enter via cross-attention diffusely; forcing an intermediate description makes the model explicitly surface key visual evidence before reasoning.

The second tool is structured output. For a table, a dashboard, asking for JSON beats free-form text. Two reasons: (1) the JSON schema forces field-by-field generation, each field being its own "look"; (2) downstream you can schema-validate and retry on missing fields. Claude's tool use natively supports this constraint (wrap extraction as a tool call), far more reliable than "please output JSON" in the prompt.

The third tool is the actual limit of grounding. Claude/GPT-4o can output approximate bounding-box coordinates, but precision is bounded by roughly 1/100 of image grid — fine-grained UI element localization will drift. Anthropic Computer Use (screen agent) is itself a hybrid — vision sees + xdotool acts; the model's coordinates serve only as initial localization, not pixel-precise interaction.

Hands-on Example

An industrial-grade vision-extraction prompt template (works for charts, forms, UI screenshots):

# vision_prompt.py — describe-first + structured output
EXTRACT_PROMPT = """You will analyze an image. Follow these steps strictly.

<step1_observe>
Describe what you literally see (no interpretation):
- Image type (chart / table / UI / photo / diagram)
- All visible text labels, headers, axis titles, legend
- All numeric values you can read
- Color encoding / visual structure
</step1_observe>

<step2_extract>
Output the structured data as JSON matching this schema:
{schema}
For any field you cannot read confidently, use null and add a
note to "uncertain_fields" array.
</step2_extract>

<step3_answer>
Only after the above, answer the user's question:
{user_question}
Cite specific values from step2.
</step3_answer>
"""

# —— Wrap extraction as a tool; schema is the contract ——
chart_tool = {
    "name": "record_chart",
    "description": "Record the chart's structured data after observation.",
    "input_schema": {"type":"object","properties":{
        "chart_type": {"type":"string","enum":["line","bar","pie","scatter"]},
        "x_axis":    {"type":"object","properties":{
                       "label":{"type":"string"},"unit":{"type":"string"}}},
        "y_axis":    {"type":"object","properties":{
                       "label":{"type":"string"},"unit":{"type":"string"}}},
        "series":    {"type":"array","items":{"type":"object","properties":{
                       "name":{"type":"string"},
                       "points":{"type":"array"}}}},
        "uncertain_fields": {"type":"array","items":{"type":"string"}}
    },"required":["chart_type","y_axis","series"]}
}
# Force the tool: tool_choice = {"type":"tool","name":"record_chart"}

On a 100-chart holdout the numeric-extraction accuracy went from ~62% (free-form) to ~88% with this pattern, and uncertain_fields doubles as a downstream confidence signal.

Failure modes: (1) "what trend does this chart show?" cold-start — the model invents the story first and back-fills numbers; describe-first is mandatory; (2) relying on pixel-perfect bounding boxes — Claude's coordinate precision tops at ~1/100 of image, don't use for UI tests; (3) multiple images + one combined question — attention spreads across images; per-image accuracy is higher than 5-image; divide and conquer; (4) reading handwriting directly — handwritten OCR is 30–50% lower than printed; high-stakes scenarios need human-in-the-loop.

Further reading · Anthropic Vision · prompt engineering for images, docs.anthropic.com/.../vision · Anthropic Computer use (beta), docs.anthropic.com/.../computer-use

// 04

Multimodal RAG — Caption-then-Retrieve Often Beats Multimodal Embedding

Claim: CLIP/SigLIP look "natively multimodal," but on Chinese long docs, technical charts, and slides their recall often sits below 40%. Letting an LLM caption the image and running text RAG is simpler engineering and often better quality.

Background & Principles

Two roads for multimodal RAG:

Road A: Joint multimodal embedding — CLIP (OpenAI 2021) / SigLIP (Google 2023) / Jina CLIP v2 embed image and text into one space; cosine is the comparator. Pros: embed once, use forever; cross-language. Cons: (1) training data leans web alt-text — technical charts / slides / scientific figures align poorly; (2) most CLIP variants can't encode long text (77-token cap), no paragraph-level matching; (3) Chinese CLIP models lag English ones by 2–3 generations.
Road B: Caption-then-Retrieve (CTR) — offline, use a vision LLM to caption each image into 200–500 words (image type, axis labels, values, the insight) and store as text in a vector store; at query time use a plain text embedding to retrieve, fetch back original image + caption together into the prompt.

Why CTR wins: (1) caption is LLM-generated semantic compression, sharper than CLIP visual features for "what does this image say"; (2) caption length is unbounded — you can encode "Q3 sales dashboard, primary metric ARR grew from $2M to $3.5M" at business semantic level; (3) retrieval rides the mature text-embedding stack (BGE / Cohere / OpenAI), no multimodal infra to maintain; (4) caption metadata (doc/page/section) doubles as filterable recall context. The cost: a one-time offline captioning (one vision-LLM call per image), amortized over forever. ColPali (Faysse et al. 2024) is the third road — ColBERT-style late interaction directly on PDF page renders; beats CTR on visually-dense documents but requires specialised infra.

Multimodal RAG, three roads compared Path Indexing cost Query cost Recall* Engineering cost ────────────────────────────────────────────────────────────────────── CLIP/SigLIP low low mid-low mid (multi-space) Caption-then-RAG high (1-time) low high low (text stack) ColPali mid mid highest high (specialist) * recall measured on "tech PDF / slide deck / Chinese chart"; on natural photos CLIP still dominates. → without ColPali infra, CTR is the 80/20 choice

Hands-on Example

A minimum caption-then-retrieve implementation:

# multimodal_rag_ctr.py — offline caption + online text retrieval
CAPTION_PROMPT = """Describe this image for a semantic search index.
Include:
1. Image type (chart/diagram/screenshot/photo/table)
2. Main subject / what it depicts
3. All readable text, labels, headers
4. Key numeric values or data points
5. The likely "question this image answers" (1 sentence)

Output 200-400 words, dense and factual. No filler."""

def index_image(img_path, doc_id, page, client, vec_store):
    img_b64 = prep_image(img_path)  # from §1
    caption = client.messages.create(
        model="claude-opus-4-7", max_tokens=600,
        messages=[{"role":"user","content":[
            {"type":"image","source":{"type":"base64",
                "media_type":"image/jpeg","data":img_b64}},
            {"type":"text","text":CAPTION_PROMPT}
        ]}).content[0].text
    vec_store.add(
        embedding=text_embed(caption),
        text=caption,
        metadata={"img_path":img_path,"doc_id":doc_id,"page":page}
    )

def query(question, vec_store, client, k=3):
    hits = vec_store.search(text_embed(question), k=k)
    # Send the original images back in — caption is the key, image is ground truth
    content = [{"type":"text","text":f"Question: {question}"}]
    for h in hits:
        content.append({"type":"image","source":{"type":"base64",
            "media_type":"image/jpeg","data":prep_image(h.metadata["img_path"])}})
        content.append({"type":"text","text":f"[Caption hint]: {h.text[:200]}"})
    return client.messages.create(model="claude-opus-4-7",
        max_tokens=1500, messages=[{"role":"user","content":content}])

Two design choices matter: (1) caption is the retrieval key, but at answer time the original image is sent back (caption is lossy compression, image is truth); (2) the caption prompt explicitly asks "what question does this image answer" — exactly the semantic shape a query needs to match.

Failure modes: (1) using only caption without re-attaching the image — anything the caption dropped, the model can't answer; (2) captions too short (< 50 words) — retrieval recall collapses; aim 200+ words dense; (3) generic CLIP on technical documents — recall far below expectation; CTR will win every time; (4) captions without metadata (page/doc) — after a hit you can't trace back to source context for citation.

Further reading · Faysse et al. ColPali: Efficient Document Retrieval with Vision Language Models, arxiv.org/abs/2407.01449 · Zhai et al. SigLIP: Sigmoid Loss for Language-Image Pre-training, arxiv.org/abs/2303.15343 · Jina AI Jina CLIP v2 multilingual benchmark, jina.ai/news/jina-clip-v2

// Combined Exercise · Run a "cost vs accuracy" audit on your vision pipeline (45 min)

Pick one vision/PDF task you're running (or planning), then go through these 6 steps:

Measure your current per-call image tokens (§1, 5 min): sample 20 random images currently in use, run estimate_tokens(). Median > 3000 tokens? You're burning cash.
Add resize (§1, 10 min): wire prep_image() in, long edge 1568. Run a 20-sample eval, compare accuracy — if delta < 1%, ship to 100%.
Tier your PDF task (§2, 10 min): take 5 representative PDFs, run pdfplumber.extract_text on 3 pages each. Count Tier-1-capable pages. If 70%+, plug in the hybrid processor today — 10× cost saving.
Add describe-first to vision prompts (§3, 10 min): pick 5 of your worst-error vision queries, add <step1_observe> in front, re-run. If accuracy gain > 5%, lock it in.
Audit your RAG path (§4, 5 min): if you're running CLIP-based multimodal RAG, evaluate top-3 recall on 10 real queries. Under 40%? Plan the migration to CTR.
Write it down (5 min): record "before / after" cost per 1k images and accuracy. Next time the boss asks "why is the AI bill this high," you have the answer.

45 minutes typically nets at least 30% off image tokens and 5–10% accuracy gain. The ROI of multimodal optimization is usually an order of magnitude higher than swapping models or rewriting prompts — because almost nobody has actually engineered this layer.

// Going Deeper

Will vision models eventually eliminate OCR, layout parsing, and CTR as middle layers?

They'll shrink, not disappear. Three reasons: (1) token economics — as long as vision-direct-read is 2k tokens/page, the 100× cheaper text extraction has a niche; hardware scaling can't catch up to data growth; (2) traceability — CTR's captions and layout's bboxes are intermediates you can cite and audit; pure end-to-end inference can't yet "cite the source patch" reliably; (3) failure modes differ — text extraction fails loudly (easy to catch); vision hallucination is "plausibly wrong" (hard to catch). Production systems need the former to cross-validate the latter. The middle layers will get thinner but remain valuable for break detection and explainability.

Why does CLIP perform badly on Chinese technical documents — is the issue training language?

Less than you'd think; the root is CLIP's training distribution. CLIP trains on web image + alt-text — natural photos dominate (pets, scenery, influencer pics), and technical charts / slides / scientific figures sit in the long tail. The model learns "visual style + generic semantics" and is weak at the two-step reasoning "first understand the chart structure, then abstract the semantic." Chinese CLIP also has a data-scale handicap (LAION Chinese subset is small), but even English CLIP on English technical PDFs trails CTR. This is a paradigm limit, not just a data problem. ColPali wins because it skips the "first encode the image's holistic semantic" step and matches at the patch level directly.

Will the Computer Use / browser-agent path (model "looks at" the screen and acts) replace GUI API calls?

Long-term it's a patch, not a replacement. GUI API calls (Playwright, xdotool) are deterministic — the model's coordinate is just input, the execution rides a stable interface; good for repeatable tasks, CI/CD, monitoring. Computer Use is for: (1) legacy systems without APIs; (2) exploratory tasks that adapt to UI changes; (3) collecting human-demonstration data for training. But pure vision has three hard limits: latency (2–5s vision per step), cost (hundreds of tokens per step), and non-reproducibility (OS rendering variance causes flakes). Anthropic's own Computer Use docs recommend hybrid — vision decides intent, xdotool/keyboard executes. GUI APIs won't go away; the additional power is "I can run this without writing an API binding."

When to pick multimodal embedding (CLIP/SigLIP) vs LLM caption — is there a heuristic?

Yes — "can a single natural sentence capture this image's semantic?" If yes (sunset on beach, red SUV) → CLIP-family is great. If no (Q3 sales dashboard with ARR up 75% as a composition of multiple elements) → CTR wins. Second: is the query "by visual similarity" (find similar-style logo, find similar product photo) → CLIP; is the query "by content/fact" (find images about ARR growth) → CTR. Third: indexing budget — CTR is one LLM call per image, $100+ for 100k images; at tens of millions scale with visual-similarity queries, CLIP's economics dominate. In practice many systems combine both: CLIP for cheap recall (visually-similar bias), CTR for rerank (expensive but semantically sharp).

After multimodal RAG, which modality will be next to be "engineered"? Audio? Video? 3D?

Most likely long video, and an order of magnitude harder than image/PDF. Reasons: (1) demand is obvious — meeting recordings, lecture videos, surveillance, vlogs — RAG-ifying them is hugely valuable; (2) the bottleneck is temporal — video is image + audio + time-series, static CLIP-style embedding isn't enough, you need temporally-aware chunking (by scene, by transcript, by visual change); (3) token economics is brutal — 1 hour of video natively is millions of tokens, no caching/tiering = wallet black hole. Gemini 1.5 / 2.5 long-video context is a frontier, but the engineering layer (chunking / indexing / retrieval / temporal grounding) is still early. Audio is comparatively mature (Whisper + text-RAG works); 3D / point cloud still waiting for a killer use case. Video is the next big RAG well — and gold mine.

// Further Reading

Anthropic · Vision official docs — image sizing, token formula, prompt engineering
Anthropic · PDF support — native PDF input limits and best practices
IBM · Docling paper — engineering reference for open-source layout-aware PDF parsing
Faysse et al. · ColPali — vision-based document retrieval, new paradigm
Zhai et al. · SigLIP — improved CLIP, mainstream multimodal embedding
Anthropic · Computer Use — industrial implementation of vision + actual control hybrid
Simon Willison · vision-llms tag — ongoing observations on multimodal in practice