CLIP's contrastive-learning route (contrastive learning = pull "matching" samples together, push "non-matching" ones apart) is essentially the two-tower retrieval model you already know from recommendation/search: a user tower and an item tower each encode their input into a vector, and you take inner products in a shared space. CLIP trains an "image tower" and a "text tower" into one shared vector space, so matching image-text pairs have a large inner product and mismatched ones small. Modern generative VLMs (the LLaVA route) take a different path: they bolt an "image→token" encoder onto the front of an LLM — like adding an ETL adapter to a text-only query engine, translating images into tokens the LLM can consume directly.
Pain point: an LLM only takes text tokens as input — it literally cannot "see" pixels. To make a model understand images, the field split into two paradigms:
import base64 from anthropic import Anthropic client = Anthropic() # needs ANTHROPIC_API_KEY img = base64.b64encode(open("chart.png", "rb").read()).decode() resp = client.messages.create( model="claude-opus-4-8", max_tokens=512, messages=[{"role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img}}, {"type": "text", "text": "What's the key takeaway of this chart?"} ]}] ) # image and text in the same message — both end up as tokens print(resp.content[0].text)
Audio is a one-dimensional time-series stream — like a database binlog or the event stream in a message queue, continuous and unbroken. Whisper's approach: slice the waveform into 30-second windows, turn each into a log-Mel spectrogram (a 2D "time × frequency" image), then use an encoder-decoder Transformer to "translate" it into text tokens. It's isomorphic to seq2seq machine translation: the source sequence is sound, the target sequence is text, with an encoder and a decoder in between.
Pain point: traditional speech recognition stitches together several modules — voice activity detection, an acoustic model, a language model, forced alignment — each trained separately with brittle interfaces. Whisper (2022) replaces the whole pipeline with one end-to-end seq2seq model, trained on 680,000 hours of "weak supervision" (weak supervision = web-scraped captions not guaranteed to be perfectly accurate), achieving zero-shot robustness across languages and accents. The mechanism, in four steps: waveform → log-Mel spectrogram → encoder encodes → decoder autoregressively emits text tokens, with special control tokens switching tasks (transcribe / translate / language ID / timestamps).
The new generation of audio LLMs (e.g. GPT-4o, Gemini's native speech) goes further: they put audio tokens and text tokens into the same LLM, doing spoken dialogue and emotion/tone understanding directly, not just transcription. Which loops back to the unifying theme — audio also becomes tokens.
from openai import OpenAI client = OpenAI() # needs OPENAI_API_KEY # Whisper transcribes audio straight to text (hosted API) with open("meeting.mp3", "rb") as f: tr = client.audio.transcriptions.create( model="whisper-1", file=f, response_format="text", # language="en" can be set explicitly to avoid mis-ID on short clips ) print(tr) # once you have text, feed an LLM to summarize / extract action items
Video = images sharded along the time dimension. One minute at 30fps = 1,800 frames, and each frame is hundreds of patch tokens — the token count explodes, equivalent to a full table scan on a huge table. Engineering-wise you must do sampling + partition pruning like a database: frame sampling + temporal pooling, keeping only the high-information-gain frames and dropping the redundant "incremental log".
The core difficulty of video understanding is one word: too much — token explosion + temporal redundancy. Adjacent frames are often 99% identical (like an incremental log with almost no change); stuffing them all in is both expensive and dilutes attention. Two classes of mechanism handle it:
import cv2, base64 # Sparse-sample with OpenCV at 1fps, instead of sending every frame cap = cv2.VideoCapture("lesson.mp4") fps = cap.get(cv2.CAP_PROP_FPS) frames = [] i = 0 while True: ok, frame = cap.read() if not ok: break if i % int(fps) == 0: # take 1 frame per second _, buf = cv2.imencode(".jpg", frame) frames.append(base64.b64encode(buf).decode()) i += 1 # send these frames + timestamps to a VLM, ask "what happened at second N" print(len(frames), "frames sent (sparsified)")
VLA turns a robot's "perceive → decide → act" into one end-to-end service. Traditional robots are a microservices architecture: perception, planning, and control modules each independent and glued by interfaces — one failure cascades through the chain. VLA (RT-2 / OpenVLA) is a monolithic large model: image + natural-language instruction in, robot action directly out — and the action is encoded as tokens (like serializing an RPC call into tokens), so robot control directly reuses the LLM's "predict the next token" machinery.
Pain point: robots are extremely hard to generalize — swap in an unseen object or an unseen kitchen and they fail, because real robot data is scarce (collecting one trajectory requires physical operation — slow and expensive). VLA's key insight: discretize actions into tokens, and robot control becomes "sequence generation", letting it reuse the common sense a VLM learned from internet-scale image-text data.
RT-2 (2023) discretizes 7-DoF actions (xyz translation + roll/pitch/yaw pose + gripper open/close) into tokens sharing one vocabulary with text, and co-fine-tunes on robot trajectories and web image-text. The result: the robot exhibits emergent "common-sense reasoning" — tell it to "put the strawberry into the correct bowl" and it infers the semantic relation between strawberry and bowl from pretraining knowledge, something pure robot data could never teach.
from transformers import AutoModelForVision2Seq, AutoProcessor import torch # OpenVLA: open-source 7B vision-language-action model (needs GPU) proc = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True) vla = AutoModelForVision2Seq.from_pretrained( "openvla/openvla-7b", torch_dtype=torch.bfloat16, trust_remote_code=True).to("cuda:0") image = get_from_camera() # PIL image from the robot camera prompt = "In: What action should the robot take to pick up the cup?\nOut:" inputs = proc(prompt, image).to("cuda:0", dtype=torch.bfloat16) # outputs a 7-DoF action (xyz+pose+gripper); un-normalize before the real robot action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False) print(action) # the action is literally "decoded" out as tokens