AI/ML Explained: Multimodality

Day 23 · 2026-06-09 · Level: Frontier
For: engineers with coding experience, outside the AI field

Vision-Language ModelVLM

cross-modalalignment
One-line analogy

CLIP's contrastive-learning route (contrastive learning = pull "matching" samples together, push "non-matching" ones apart) is essentially the two-tower retrieval model you already know from recommendation/search: a user tower and an item tower each encode their input into a vector, and you take inner products in a shared space. CLIP trains an "image tower" and a "text tower" into one shared vector space, so matching image-text pairs have a large inner product and mismatched ones small. Modern generative VLMs (the LLaVA route) take a different path: they bolt an "image→token" encoder onto the front of an LLM — like adding an ETL adapter to a text-only query engine, translating images into tokens the LLM can consume directly.

Problem it solves + how it works

Pain point: an LLM only takes text tokens as input — it literally cannot "see" pixels. To make a model understand images, the field split into two paradigms:

  • Contrastive (CLIP, 2021): trained on 400M web image-text pairs; the contrastive loss pulls matching image/text vectors together and pushes mismatches apart. The output is a shared embedding space that naturally supports zero-shot classification and image-text retrieval. But it can only "match", not generate sentences;
  • Generative (Flamingo / LLaVA): a vision encoder (usually a CLIP-trained ViT) slices the image into patches and encodes them into vectors → a projection layer maps those visual vectors into the LLM's token space → they're prepended as "visual tokens" to the text tokens, and the LLM reasons/generates over them like any other tokens. Training often freezes the vision encoder and most of the LLM, training only the projection plus a little fine-tuning — cheap and effective.
Generative VLM architecture (LLaVA-style)

imageViT vision encoderpatch vectorsprojection MLP
↓ (projected into LLM token space)
visual tokens+text tokens "What's in this image?"LLM text answer
↑ Key: the image is "translated" into tokens that sit as equals beside text tokens in one LLM
Code example
import base64
from anthropic import Anthropic
client = Anthropic()  # needs ANTHROPIC_API_KEY

img = base64.b64encode(open("chart.png", "rb").read()).decode()
resp = client.messages.create(
    model="claude-opus-4-8", max_tokens=512,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64",
         "media_type": "image/png", "data": img}},
        {"type": "text", "text": "What's the key takeaway of this chart?"}
    ]}]
)  # image and text in the same message — both end up as tokens
print(resp.content[0].text)
Common misconception + practical scenario
"A VLM really 'reads' every pixel" — wrong. The image is compressed into a finite number of patch tokens (often 256–576), so tiny text, exact counting, and dense tables are error-prone. It's good at semantic understanding (what kind of scene this is), not pixel-level close reading (what's written in row 3, column 5).
📌 Scenario: when reading papers, screenshotting an architecture diagram or results table into a VLM for a quick gist is efficient — but always re-check the key numbers against the original; don't let the model "read out" a decimal point it never actually saw.
Takeaway + reflection
💡 A VLM is not an "LLM with eyes" — it's "an LLM fed images translated into tokens". Grasp that and you grasp the boundary of its abilities.
🤔 If an image ultimately becomes tokens too, does "a picture is worth a thousand words" hold on a token budget? How many tokens should one image cost — and whose engineering decision is that?

Audio LLMWhisper

seq modelingspeech
One-line analogy

Audio is a one-dimensional time-series stream — like a database binlog or the event stream in a message queue, continuous and unbroken. Whisper's approach: slice the waveform into 30-second windows, turn each into a log-Mel spectrogram (a 2D "time × frequency" image), then use an encoder-decoder Transformer to "translate" it into text tokens. It's isomorphic to seq2seq machine translation: the source sequence is sound, the target sequence is text, with an encoder and a decoder in between.

Problem it solves + how it works

Pain point: traditional speech recognition stitches together several modules — voice activity detection, an acoustic model, a language model, forced alignment — each trained separately with brittle interfaces. Whisper (2022) replaces the whole pipeline with one end-to-end seq2seq model, trained on 680,000 hours of "weak supervision" (weak supervision = web-scraped captions not guaranteed to be perfectly accurate), achieving zero-shot robustness across languages and accents. The mechanism, in four steps: waveform → log-Mel spectrogram → encoder encodes → decoder autoregressively emits text tokens, with special control tokens switching tasks (transcribe / translate / language ID / timestamps).

The new generation of audio LLMs (e.g. GPT-4o, Gemini's native speech) goes further: they put audio tokens and text tokens into the same LLM, doing spoken dialogue and emotion/tone understanding directly, not just transcription. Which loops back to the unifying theme — audio also becomes tokens.

Code example
from openai import OpenAI
client = OpenAI()  # needs OPENAI_API_KEY

# Whisper transcribes audio straight to text (hosted API)
with open("meeting.mp3", "rb") as f:
    tr = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        response_format="text",
        # language="en" can be set explicitly to avoid mis-ID on short clips
    )
print(tr)  # once you have text, feed an LLM to summarize / extract action items
Common misconception + practical scenario
"Whisper is always more accurate than a small model" — not always. On long audio it suffers "repetition hallucination" (looping the same phrase), and on silent / non-speech segments it invents text from nothing. It's a generative model — it "guesses" even in silence. Accuracy-sensitive use cases need silence splitting and post-processing.
📌 Scenario: transcribing meeting recordings or voice memos to text, then having an LLM summarize decisions and to-dos, is a high-leverage workflow. But proofread names, numbers, and jargon — exactly where transcription most often fails.
Takeaway + reflection
💡 Speech recognition = seq2seq-translating sound into text; modern audio LLMs turn sound into tokens too, reasoning over them inside the LLM directly.
🤔 When a model "hears" tone and pauses rather than just literal words, which scenarios get redefined — customer service, companionship, or lie detection?

Video UnderstandingVideo Understanding

temporalsampling
One-line analogy

Video = images sharded along the time dimension. One minute at 30fps = 1,800 frames, and each frame is hundreds of patch tokens — the token count explodes, equivalent to a full table scan on a huge table. Engineering-wise you must do sampling + partition pruning like a database: frame sampling + temporal pooling, keeping only the high-information-gain frames and dropping the redundant "incremental log".

Problem it solves + how it works

The core difficulty of video understanding is one word: too much — token explosion + temporal redundancy. Adjacent frames are often 99% identical (like an incremental log with almost no change); stuffing them all in is both expensive and dilutes attention. Two classes of mechanism handle it:

  • Temporal modeling: each frame goes through the vision encoder for spatial features, then a temporal-attention or 3D-convolution layer on top captures "motion" (you can't tell "the cup is tipping over" from a single frame);
  • Token compression: sparse frame sampling (e.g. 1fps instead of 30fps), token merging (merging similar patches), or a Q-Former that compresses each frame into a few query tokens. The goal: keep the most information in the fewest tokens.
Token count for a 1-minute video (illustrative)
30 fps full   ~1800 frames → token explosion 💥
1 fps sparse ~60 frames → manageable ✅
↑ Core trade-off: too sparse misses fast motion, too dense blows the window
Code example
import cv2, base64
# Sparse-sample with OpenCV at 1fps, instead of sending every frame
cap = cv2.VideoCapture("lesson.mp4")
fps = cap.get(cv2.CAP_PROP_FPS)
frames = []
i = 0
while True:
    ok, frame = cap.read()
    if not ok: break
    if i % int(fps) == 0:  # take 1 frame per second
        _, buf = cv2.imencode(".jpg", frame)
        frames.append(base64.b64encode(buf).decode())
    i += 1
# send these frames + timestamps to a VLM, ask "what happened at second N"
print(len(frames), "frames sent (sparsified)")
Common misconception + practical scenario
"More frames = better understanding" — wrong. Past the window budget, attention gets diluted and key frames drown in redundancy, so quality drops instead of rising. The art of video understanding is "sampling the right frames", not "more frames" — the same principle as "put the right things in" for long context.
📌 Scenario: having AI watch a lecture video or a recording of your kid's activity and asking "at what minute was concept X covered / what is the child doing" — first sample at 1fps and attach timestamps, so the model can map content to time instead of mushing it together.
Takeaway + reflection
💡 The bottleneck of video understanding isn't "can't understand a frame" — it's "too many frames". It's fundamentally a sampling-and-compression engineering problem.
🤔 "Dropping frames loses information" and a database "losing precision on sampled queries" are the same trade-off. What signal would you use to decide whether a frame is worth keeping?

Embodied AI · Vision-Language-ActionVLA

end-to-endclosed-loop
One-line analogy

VLA turns a robot's "perceive → decide → act" into one end-to-end service. Traditional robots are a microservices architecture: perception, planning, and control modules each independent and glued by interfaces — one failure cascades through the chain. VLA (RT-2 / OpenVLA) is a monolithic large model: image + natural-language instruction in, robot action directly out — and the action is encoded as tokens (like serializing an RPC call into tokens), so robot control directly reuses the LLM's "predict the next token" machinery.

Problem it solves + how it works

Pain point: robots are extremely hard to generalize — swap in an unseen object or an unseen kitchen and they fail, because real robot data is scarce (collecting one trajectory requires physical operation — slow and expensive). VLA's key insight: discretize actions into tokens, and robot control becomes "sequence generation", letting it reuse the common sense a VLM learned from internet-scale image-text data.

RT-2 (2023) discretizes 7-DoF actions (xyz translation + roll/pitch/yaw pose + gripper open/close) into tokens sharing one vocabulary with text, and co-fine-tunes on robot trajectories and web image-text. The result: the robot exhibits emergent "common-sense reasoning" — tell it to "put the strawberry into the correct bowl" and it infers the semantic relation between strawberry and bowl from pretraining knowledge, something pure robot data could never teach.

VLA closed loop (perception-action loop)

camera image+instruction "put strawberry in bowl"VLA model
↓ emits action tokens (same vocab as text)
action tokensrobot executes 7-DoF action environment changes
└──────────── next-frame image feeds back, loop continues ────────────┘
Code example
from transformers import AutoModelForVision2Seq, AutoProcessor
import torch
# OpenVLA: open-source 7B vision-language-action model (needs GPU)
proc = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b", torch_dtype=torch.bfloat16,
    trust_remote_code=True).to("cuda:0")

image = get_from_camera()  # PIL image from the robot camera
prompt = "In: What action should the robot take to pick up the cup?\nOut:"
inputs = proc(prompt, image).to("cuda:0", dtype=torch.bfloat16)
# outputs a 7-DoF action (xyz+pose+gripper); un-normalize before the real robot
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
print(action)  # the action is literally "decoded" out as tokens
Common misconception + practical scenario
"Actions can be tokenized, so home robots are right around the corner" — not so fast. Scarce real-world data plus hard constraints on safety / latency / physical contact (you can't "retry once" after breaking something) mean VLA still lives mostly in controlled labs. It proves the "language is a universal interface" path works, but it's a long way from your kitchen.
📌 Cross-disciplinary scenario: VLA gives a deep validation that the "language is a universal interface" hypothesis holds. Encode any task (seeing, hearing, acting) into a token sequence and you can reuse the same autoregressive prediction engine. This rhymes with "everything is a message" as a unifying abstraction in distributed systems.
Takeaway + reflection
💡 Tying today's four pieces together: text, images, audio, video, actions — all encoded into tokens, fed into one autoregressive engine. The essence of multimodality is "everything is a token".
🤔 If even robot actions can be predicted as tokens, is there still a fundamental computational difference between "thinking" and "acting"? What does that mean for agent design?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. Today's four concepts share a hidden thread: "everything is a token". Why does this paradigm work? What's the cost?
The root reason it works is that the Transformer's attention is modality-agnostic over "sequences" — it only cares about the pairwise correlations among a series of vectors, not whether those vectors were originally text, image patches, audio spectrograms, or robot actions. As long as you can encode a modality into a vector sequence, you can feed it into the same engine and reuse its prediction machinery and the statistical regularities it learned over massive data. That's why "language is a universal interface" — more precisely, "token sequences are a universal interface". Three costs: (a) information loss — discretizing continuous pixels, sound waves, and actions into finite tokens always loses precision (can't read tiny text; action precision is bounded by vocabulary granularity); (b) token explosion — high-bandwidth modalities (video) explode in token count with duration × resolution, and O(n²) attention can't take it, hence the need for sampling and compression; (c) the alignment problem — tokens from different modalities must land in "the same semantic space" to reason across each other, and alignment quality caps cross-modal ability. You'll notice this is isomorphic to "everything is a message" as a unifying abstraction in distributed systems: a universal interface buys composability at the cost of serialization/deserialization loss and bandwidth pressure.
2. CLIP's contrastive route vs LLaVA's generative route — when should you use which?
It depends on whether you want to "match" or "generate". Contrastive (CLIP) yields a shared embedding space; its strengths are: image-text retrieval (text-to-image, image-to-image search), zero-shot classification, the recall layer of multimodal RAG, image tagging — anything that "computes similarity". It's light, fast, and indexable offline — essentially the two-tower retrieval you know. Weakness: it can't talk; it can't give a natural-language description of "what's happening in this image". Generative (LLaVA / Flamingo) shines at: visual QA, captioning, OCR+reasoning, visual dialogue — anything that needs "expressing image understanding in language". The cost is expensive, slow inference, and hallucination. In practice the two are often stacked: use CLIP embeddings for retrieval recall (cheaply pulling the relevant few from a million images), then a generative VLM for fine reasoning over the recalled results (expensive but accurate). Again, the "recall-then-rerank" two-stage architecture, replicated in multimodality.
3. Video token explosion — which classic distributed-systems problem is it isomorphic to? How do you transfer your existing engineering intuition?
It's isomorphic to "storing and querying massive time-series data" — time-series DBs, log systems, and stream processing all solve the same problem. Transferable intuitions: (a) downsampling — monitoring systems don't store raw per-millisecond data but aggregate per second/minute; frame sampling is time-dimension downsampling, 1fps being "one aggregate point per second"; (b) delta encoding / dedup — adjacent frames are 99% identical, like an incremental log storing only the delta; token merging and keyframe extraction essentially "store only the delta"; (c) tiered storage (hot/cold) — sample "action-dense segments" at high frame rate and "static segments" at low rate, analogous to high-precision hot data and archived cold data; (d) predicate pushdown — rather than feeding the whole video to the model then asking "when did the cup tip over", first use a lightweight detector to localize candidate windows, then read closely — push the filter upstream. Your experience handling massive logs/monitoring maps almost one-to-one onto the engineering trade-offs of video understanding.
4. VLA validates "language is a universal interface". What does this mean philosophically — for cognitive science, for the boundary between "thinking" and "acting"?
The most striking thing about VLA isn't that a robot can move — it's the implication that perception, language, and action may share one underlying representation: all sequences processable by the same prediction engine. This echoes several threads in cognitive science: embodied cognition argues "thought is inseparable from the body's interaction with the environment", and VLA precisely compresses "understanding" and "acting" into one model, with no clean "deliberate first, then act" boundary — action tokens and thought tokens are predicted in the same autoregressive stream. Deeper still, this touches an old question: if "deciding the next word" and "deciding the next action" are computationally the same thing, is the "deliberation" we pride ourselves on just a longer token prediction? That needn't be a demotion — it may mean the unity of intelligence is stronger than we assumed. But beware over-analogizing: predicting tokens isn't "understanding meaning"; VLA is easily fooled by unseen physical situations and lacks the human world-model grown from bodily experience. This tension of "unity vs emergence" is exactly the boundary worth turning over repeatedly when crossing Buddhism, neuroscience, and complexity science.
5. Are tokens from different modalities really "in the same vector space"? What does cross-modal alignment actually mean?
"Same space" is a useful but careful phrasing. For CLIP, image vectors and text vectors really are trained into one metric space — matching image-text have a large inner product, so you can compute cross-modal similarity directly; that's a genuine shared space. But for LLaVA-style generative VLMs, the more accurate description is: visual tokens are projected into the input space the LLM can "consume", letting the LLM treat them like "foreign words" — they don't necessarily occupy positions symmetric to text tokens, only aligned enough for the LLM to "understand" them. The essence of alignment is making "the same semantics land in nearby positions across modalities": an image of a cat and the word "cat" should point to similar internal representations. Alignment quality determines all downstream ability — align it poorly and the model "talks nonsense while looking at the image". The hard part is that different modalities have wildly different information density (one word vs hundreds of thousands of pixels), so forcing them into one space inevitably creates an information bottleneck. That's why multimodal models still stumble on tasks needing "precise correspondence" (reading text in an image, counting objects) — semantics are aligned, but fine-grained pixel-level alignment is far from solved.