Stuffing an image in as if it were text wastes tokens. Reading every PDF visually burns cash. Multimodal engineering is making pixels and text each do the work they're good at.
Multimodal APIs have been out for two years and most usage still stops at "screenshot it, ask the model what's there." On the surface it works; the bill and the accuracy say otherwise — one 4K screenshot can eat 2,000+ tokens under official tiling rules, a 10-page PDF costs 20× more via vision than via text extraction, and CLIP cross-modal retrieval on Chinese long documents has recall below 40%. None of this is the model's fault; it's that no engineering was done. This week covers four things: (1) how image tokens are actually computed, where the sweet spot is, and why resize beats base64 encoding for the bill; (2) the three PDF processing tiers (text-layer extraction / vision direct read / hybrid) and their cost–accuracy boundaries; (3) the "describe before deciding" vision prompt pattern and the real limits of grounding/coordinate precision; (4) why caption-then-retrieve often beats true multimodal embedding for multimodal RAG. You should walk away rewriting at least one place in your vision pipeline that was burning tokens it didn't need to.
Anthropic's documented approximation: tokens ≈ (width × height) / 750. A 1568×1568 image is ~3,280 tokens; a 3000×3000 image is ~12,000 tokens — 3.7× more for essentially no quality gain on most tasks (OCR, table parsing, UI description, chart reading). Anthropic's officially recommended "max useful edge" is 1568px; beyond that the model doesn't extract more. GPT-4o uses tiling: 512×512 patches, base 85 tokens + 170 per patch, 2048×2048 ≈ 765 tokens; tiling activates only in detail=high mode (low is flat 85 tokens).
Two practical consequences: (1) client-side resize beats nothing — Pillow-resize to 1568px before base64, the savings go straight to your pocket; (2) OCR is the only case that might want higher — dense small text (receipts, fine-print contracts), but even then 2000px is the ceiling; beyond it the marginal benefit is zero. Anthropic's vision docs themselves note that "image quality and clarity matters more than resolution" — denoising, contrast, deskewing buy more than raw px.
A preprocessing function every vision call should go through:
# vision_preprocess.py — run before every upload, saves real money
from PIL import Image, ImageOps
import base64, io
MAX_LONG_EDGE = 1568 # Claude's official suggestion; OCR can push 2000
def prep_image(path: str, max_edge: int = MAX_LONG_EDGE) -> str:
img = Image.open(path)
img = ImageOps.exif_transpose(img) # fix phone orientation
img = img.convert("RGB") # drop alpha
w, h = img.size
if max(w, h) > max_edge: # downscale isotropically
scale = max_edge / max(w, h)
img = img.resize((int(w*scale), int(h*scale)), Image.LANCZOS)
# JPEG q=85 not PNG: order-of-magnitude smaller, model can't tell
buf = io.BytesIO()
img.save(buf, format="JPEG", quality=85, optimize=True)
return base64.b64encode(buf.getvalue()).decode()
# —— Cost preflight ——
def estimate_tokens(path: str) -> int:
w, h = Image.open(path).size
w, h = min(w, MAX_LONG_EDGE), min(h, MAX_LONG_EDGE)
return (w * h) // 750 # Anthropic approximation
Plug this into your vision calls and a 10k-image/month pipeline typically saves 30–50% in image tokens, accuracy delta < 1%. More direct than another prompt rewrite.
From 2025 onward Anthropic and OpenAI both support PDF input natively. Internally they render each page to image AND extract the text layer, feeding both to the model. Powerful, but each page consumes 1500–3000 tokens — a 10-page doc ≈ 20k+ tokens per session. The three tiers:
Decision flow (30 seconds): try pdfplumber.extract_text() on one page. Structured text and not garbled? Tier 1. Broken line breaks, scrambled multi-column? Tier 2. Empty / garbage? Tier 3. In production, the typical pipeline is Tier 2 as default + Tier 3 as fallback for the pages layout parsing fails on.
A hybrid processor that auto-falls-back to vision per page:
# pdf_hybrid.py — text first, vision fallback for failing pages
import pdfplumber, fitz, anthropic
from pathlib import Path
def extract_text(page) -> str:
txt = page.extract_text() or ""
return txt.strip()
def needs_vision(text: str) -> bool:
# Heuristic: too-short / high garbage-char ratio → page is image
if len(text) < 50: return True
bad = sum(1 for c in text if not (c.isprintable() or c.isspace()))
return bad / len(text) > 0.05
def page_to_image_b64(pdf_path: str, page_idx: int) -> str:
doc = fitz.open(pdf_path)
pix = doc[page_idx].get_pixmap(dpi=150) # 150 dpi is OCR sweet spot
return base64.b64encode(pix.tobytes("jpeg")).decode()
def process_pdf(path: str, client) -> list[str]:
results = []
with pdfplumber.open(path) as pdf:
for i, page in enumerate(pdf.pages):
txt = extract_text(page)
if not needs_vision(txt):
results.append(txt) # Tier 1, free
continue
# fall back to vision, page-by-page
img_b64 = page_to_image_b64(path, i)
r = client.messages.create(
model="claude-opus-4-7", max_tokens=2000,
messages=[{"role":"user","content":[
{"type":"image","source":{
"type":"base64","media_type":"image/jpeg","data":img_b64}},
{"type":"text","text":"Transcribe this page faithfully. Preserve tables as Markdown."}
]}])
results.append(r.content[0].text)
return results
On real corpora this pattern typically routes 70–90% of pages to Tier 1, total cost ~1/10 of naive PDF-input. 150 dpi is the OCR sweet spot — higher buys nothing, lower blurs small text.
CoT is common sense for text; people forget it for vision. Ask "what trend does this chart show?" and the model may infer without ever reading the y-axis unit. OpenAI's 2024 vision evals make the same observation: VQA accuracy improves 5–15% just by adding "first describe what you see in the image, then answer." The cause: visual features enter via cross-attention diffusely; forcing an intermediate description makes the model explicitly surface key visual evidence before reasoning.
The second tool is structured output. For a table, a dashboard, asking for JSON beats free-form text. Two reasons: (1) the JSON schema forces field-by-field generation, each field being its own "look"; (2) downstream you can schema-validate and retry on missing fields. Claude's tool use natively supports this constraint (wrap extraction as a tool call), far more reliable than "please output JSON" in the prompt.
The third tool is the actual limit of grounding. Claude/GPT-4o can output approximate bounding-box coordinates, but precision is bounded by roughly 1/100 of image grid — fine-grained UI element localization will drift. Anthropic Computer Use (screen agent) is itself a hybrid — vision sees + xdotool acts; the model's coordinates serve only as initial localization, not pixel-precise interaction.
An industrial-grade vision-extraction prompt template (works for charts, forms, UI screenshots):
# vision_prompt.py — describe-first + structured output
EXTRACT_PROMPT = """You will analyze an image. Follow these steps strictly.
<step1_observe>
Describe what you literally see (no interpretation):
- Image type (chart / table / UI / photo / diagram)
- All visible text labels, headers, axis titles, legend
- All numeric values you can read
- Color encoding / visual structure
</step1_observe>
<step2_extract>
Output the structured data as JSON matching this schema:
{schema}
For any field you cannot read confidently, use null and add a
note to "uncertain_fields" array.
</step2_extract>
<step3_answer>
Only after the above, answer the user's question:
{user_question}
Cite specific values from step2.
</step3_answer>
"""
# —— Wrap extraction as a tool; schema is the contract ——
chart_tool = {
"name": "record_chart",
"description": "Record the chart's structured data after observation.",
"input_schema": {"type":"object","properties":{
"chart_type": {"type":"string","enum":["line","bar","pie","scatter"]},
"x_axis": {"type":"object","properties":{
"label":{"type":"string"},"unit":{"type":"string"}}},
"y_axis": {"type":"object","properties":{
"label":{"type":"string"},"unit":{"type":"string"}}},
"series": {"type":"array","items":{"type":"object","properties":{
"name":{"type":"string"},
"points":{"type":"array"}}}},
"uncertain_fields": {"type":"array","items":{"type":"string"}}
},"required":["chart_type","y_axis","series"]}
}
# Force the tool: tool_choice = {"type":"tool","name":"record_chart"}
On a 100-chart holdout the numeric-extraction accuracy went from ~62% (free-form) to ~88% with this pattern, and uncertain_fields doubles as a downstream confidence signal.
Two roads for multimodal RAG:
Why CTR wins: (1) caption is LLM-generated semantic compression, sharper than CLIP visual features for "what does this image say"; (2) caption length is unbounded — you can encode "Q3 sales dashboard, primary metric ARR grew from $2M to $3.5M" at business semantic level; (3) retrieval rides the mature text-embedding stack (BGE / Cohere / OpenAI), no multimodal infra to maintain; (4) caption metadata (doc/page/section) doubles as filterable recall context. The cost: a one-time offline captioning (one vision-LLM call per image), amortized over forever. ColPali (Faysse et al. 2024) is the third road — ColBERT-style late interaction directly on PDF page renders; beats CTR on visually-dense documents but requires specialised infra.
A minimum caption-then-retrieve implementation:
# multimodal_rag_ctr.py — offline caption + online text retrieval
CAPTION_PROMPT = """Describe this image for a semantic search index.
Include:
1. Image type (chart/diagram/screenshot/photo/table)
2. Main subject / what it depicts
3. All readable text, labels, headers
4. Key numeric values or data points
5. The likely "question this image answers" (1 sentence)
Output 200-400 words, dense and factual. No filler."""
def index_image(img_path, doc_id, page, client, vec_store):
img_b64 = prep_image(img_path) # from §1
caption = client.messages.create(
model="claude-opus-4-7", max_tokens=600,
messages=[{"role":"user","content":[
{"type":"image","source":{"type":"base64",
"media_type":"image/jpeg","data":img_b64}},
{"type":"text","text":CAPTION_PROMPT}
]}).content[0].text
vec_store.add(
embedding=text_embed(caption),
text=caption,
metadata={"img_path":img_path,"doc_id":doc_id,"page":page}
)
def query(question, vec_store, client, k=3):
hits = vec_store.search(text_embed(question), k=k)
# Send the original images back in — caption is the key, image is ground truth
content = [{"type":"text","text":f"Question: {question}"}]
for h in hits:
content.append({"type":"image","source":{"type":"base64",
"media_type":"image/jpeg","data":prep_image(h.metadata["img_path"])}})
content.append({"type":"text","text":f"[Caption hint]: {h.text[:200]}"})
return client.messages.create(model="claude-opus-4-7",
max_tokens=1500, messages=[{"role":"user","content":content}])
Two design choices matter: (1) caption is the retrieval key, but at answer time the original image is sent back (caption is lossy compression, image is truth); (2) the caption prompt explicitly asks "what question does this image answer" — exactly the semantic shape a query needs to match.
Pick one vision/PDF task you're running (or planning), then go through these 6 steps:
estimate_tokens(). Median > 3000 tokens? You're burning cash.prep_image() in, long edge 1568. Run a 20-sample eval, compare accuracy — if delta < 1%, ship to 100%.pdfplumber.extract_text on 3 pages each. Count Tier-1-capable pages. If 70%+, plug in the hybrid processor today — 10× cost saving.<step1_observe> in front, re-run. If accuracy gain > 5%, lock it in.45 minutes typically nets at least 30% off image tokens and 5–10% accuracy gain. The ROI of multimodal optimization is usually an order of magnitude higher than swapping models or rewriting prompts — because almost nobody has actually engineered this layer.