DAY 47 / PHASE 5 · NON-LLM MODALITIES

Image Generation Engineering

Control Stack · Counter-intuitive Prompting · Generation Agent · Provenance

2026-06-26 · BigCat

Text prompt is the weakest control knob; controllable, reproducible, compliant is what makes it engineering.

// WHY THIS MATTERS

Most people's mental model of image generation stops at "type an incantation into Midjourney and gacha." But the moment you put generated images into a product pipeline—brand-consistent characters, fixed-composition e-commerce shots, reproducible batch assets—gacha collapses. Image models and LLMs are two different disciplines of engineering: a diffusion model doesn't "follow instructions," it samples from a conditional distribution; the text prompt is only the coarsest of many control signals. This issue isn't "what is diffusion" (that's a concept class). For people who already use the tools, it's about turning the uncontrollable into the controllable: layering spatial and style control with ControlNet/LoRA/IP-Adapter, prompting by the diffusion model's own rules, upgrading single-shot gacha into a generate–critique–repaint agentic loop, and the watermark + copyright provenance layer you can't skip in production.

// 01

The Control Stack: ControlNet / LoRA / IP-Adapter

Claim: text prompts can't control composition or identity. Hand spatial structure to ControlNet, style and subject to LoRA, reference images to IP-Adapter—stacked orthogonally.

Background & principle

These three tools act on different facets of the diffusion process—not substitutes, but layers:

ControlNet (Zhang 2023): clones the UNet encoder and injects spatial conditions (canny edges, depth, openpose skeleton, segmentation) via zero-conv. Pixel-level composition/pose lock, at ~30%+ extra inference cost.
LoRA (Hu 2021): low-rank weight deltas (rank ~4–128); fine-tunes a style or character from dozens of images, a few-MB file activated by a trigger word. The workhorse for subject consistency.
IP-Adapter: encodes a reference image into an image prompt injected into cross-attention—zero training to transfer style/face. Faster than LoRA, but weaker control.

The engineering mantra: fix structure first (ControlNet), then identity (LoRA), then tune mood last (prompt + IP-Adapter). Route "what you want" to the right control channel instead of cramming it all into the text prompt.

Denoising loop ──receives multiple conditions per step──▶ ┌──────────────────────────────────────────────┐ │ text prompt ─┐ │ │ (CLIP encoded) │ coarse: subject/mood │ │ ▼ │ │ IP-Adapter ──▶ cross-attention ─ style/face │ │ (reference) │ │ │ LoRA ──▶ UNet weight delta ─ char/style │ │ │ │ │ ControlNet ─▶ UNet encoder bypass ─ pose/comp│ │ (edge/depth/pose) ▲ strongest spatial lock │ └──────────────────────────────────────────────┘ control strength: ControlNet > LoRA > IP-Adapter > prompt

Hands-on

# diffusers: ControlNet(pose) + LoRA(character) stacked
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
cn = ControlNetModel.from_pretrained("...controlnet-openpose-sdxl")
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=cn)
pipe.load_lora_weights("./my_character.safetensors")  # consistency

img = pipe(
    prompt="mychar, cinematic lighting, rooftop at dusk",
    image=pose_map,                 # openpose skeleton locks pose
    controlnet_conditioning_scale=0.8,  # don't use 1.0; leave slack
    guidance_scale=6.5, num_inference_steps=30,
    generator=torch.Generator().manual_seed(42),  # reproducible
).images[0]

Failure modes: ① Stacking 3+ ControlNets → conditions fight, producing deformities; get one condition working first. ② controlnet_conditioning_scale=1.0 welds structure shut and the prompt stops mattering. ③ LoRA rank too high + too little data → overfit; it parrots training images, won't generalize. ④ Forgetting to fix the seed, so you can't reproduce "that good one from last time."

Resources: ControlNet paper (arXiv 2302.05543) · diffusers ControlNet docs

// 02

Image Prompts: Completely Different Rules from LLMs

Claim: a diffusion prompt is not an "instruction" but "coordinates in a conditional distribution"; negation goes through a separate channel, CFG sets obedience, CLIP only reads the first 77 tokens.

Background & principle

Bringing LLM prompting habits to diffusion is almost all wrong. Four core differences:

Negation fails: writing "don't add a hat" makes a hat more likely because you mentioned it. Put exclusions into a separate negative_prompt channel.
CFG scale = obedience: classifier-free guidance (Ho & Salimans 2022) extrapolates between conditional and unconditional predictions. Too low (<4) drifts off-prompt; too high (>14) oversaturates with artifacts. SDXL typically uses 5–8.
Order is weight: earlier tokens carry more influence; most UIs also support explicit (word:1.3) weighting.
77-token hard cut: the CLIP text encoder takes only 77 tokens; the overflow is silently dropped—don't write essays.

Hands-on

# ❌ LLM-style (almost always backfires on diffusion)
"Please generate a portrait. Make sure there is no blur,
 and don't include any text or watermark in the image."
# → mentioning blur/text/watermark summons them; instruction filler eats 77 tokens

# ✅ Diffusion-style: positive holds only what you WANT, exclude via negative
prompt = "portrait of a woman, soft window light, 85mm, \
          shallow depth of field, (sharp focus:1.2)"
negative_prompt = "blurry, text, watermark, extra fingers, \
                   lowres, deformed hands"
# guidance_scale=7  order: subject→light→lens→quality

Failure modes: ① Writing negatives into the positive prompt (the most common faceplant). ② Piling hundreds of "quality spells" past 77 tokens—the tail is dropped while you think it works. ③ Cranking CFG to 15+ for "more obedience," getting burnt colors and oil-painting artifacts. ④ Copying prompts across models: SD1.5 spells weight differently on SDXL/Flux and need retuning.

Resources: Classifier-Free Guidance (arXiv 2207.12598) · SDXL technical report (arXiv 2307.01952)

// 03

Image Generation Agent: Turning Gacha into a Generate–Critique–Repaint Loop

Claim: single-shot generation is gambling; reliable output comes from an agentic loop—a VLM as reviewer, local inpaint as fix, fixed seed as version control.

Background & principle

Production won't accept "render ten, pick one." Wire image generation into an agent loop, reusing the harness thinking you already know (Day 3): generate → VLM evaluates against the brief → if it fails, mask the problem region + inpaint a local repaint → upscale to finalize. The key is a multimodal model (Claude / GPT-4V) acting as an eval gate, automating "human picks the image" into a structured score. Repaint only fixes the bad local region (hands, text, stray objects) instead of re-rolling the whole image—saving cost and preserving the parts already correct.

┌─────────────────────────────────────────────┐ │ brief ──▶ ① text-to-image (seed=fixed) │ │ │ │ │ ▼ │ │ ② VLM review ──meets brief?──┐ │ │ (Claude/GPT-4V) │ no │ │ │ yes ▼ │ │ │ ③ locate bad region │ │ │ ④ inpaint repaint │ │ │ │ (loop≤N) │ │ ▼ ◀──────────────┘ │ │ ⑤ upscale → final │ │ │ │ human gate: fails N rounds → escalate, │ │ no infinite repaint │ └─────────────────────────────────────────────┘

Hands-on

# VLM as reviewer: structured score drives whether to repaint
import anthropic
client = anthropic.Anthropic()

def critique(image_b64, brief):
    msg = client.messages.create(
        model="claude-opus-4-8", max_tokens=400,
        messages=[{"role":"user","content":[
            {"type":"image","source":{"type":"base64",
             "media_type":"image/png","data":image_b64}},
            {"type":"text","text":
             f"Brief: {brief}\nScore strictly, return JSON: "
             '{"pass":bool,"issues":[...],"worst_region":"hands|text|..."}'}
        ]}])
    return msg.content[0].text   # → parse, decide which region to inpaint

# loop: max_iters=3, escalate to human beyond that, avoid burning $$ on infinite repaints

Failure modes: ① The VLM hallucinates too—inventing details not in the generated image; pair scoring with consistency checks or multi-vote. ② No max_iters and no human gate → one image repaints endlessly, burning dollars. ③ Changing seed each round → fix one spot, break another, never converges; lock unmasked regions during inpaint. ④ Treating a full re-roll as a "fix," discarding the parts already correct.

Resources: diffusers Inpainting docs · Anthropic · Building Effective Agents (loop design)

// 04

Watermarking & Copyright: The Provenance Layer You Can't Skip

Claim: a visible watermark is decoration; real proof of origin is SynthID (robust invisible watermark) + C2PA (cryptographically signed metadata), and copyright risk lives mostly on the training-data side.

Background & principle

Ship a generated image into a product and compliance surfaces immediately. Engineering splits into two layers:

SynthID (Google DeepMind): embeds an invisible watermark distributed across the whole image's pixels, surviving screenshots, resizing, JPEG recompression, and minor crops—it persists even after metadata is stripped. By 2026 SynthID + C2PA has become a cross-vendor de facto standard.
C2PA Content Credentials: attaches a cryptographically signed chain of origin (who, which model, how edited). Verifiable and traceable, but a single screenshot or re-save can strip it—so pair it with SynthID.
Copyright gray zone: the real legal risk is not "does the output carry a watermark" but the training data and style replication. Using LoRA to clone a living artist's style, or fine-tuning on someone's work, is a high-risk zone.

Hands-on

# Sign output with C2PA Content Credentials (c2patool)
c2patool input.png \
  --manifest manifest.json \   # declare: AI-generated + model + time
  --output signed.png

# key fields in manifest.json
{
  "claim_generator": "my-pipeline/1.0",
  "assertions": [
    {"label": "c2pa.actions",
     "data": {"actions": [{"action": "c2pa.created",
              "digitalSourceType": "trainedAlgorithmicMedia"}]}}
  ]
}
# two layers: C2PA signature (strippable but verifiable) + platform SynthID (robust, invisible)

Failure modes: ① Relying only on metadata/EXIF watermarks—one screenshot and it's gone. ② Assuming "added a watermark, so I'm off the hook": copyright risk is on training data and style replication, which watermarks don't solve. ③ A commercial LoRA training set mixing in copyrighted/privacy-protected faces or works, planting downstream legal landmines. ④ Mistaking C2PA for tamper-proofing—it proves origin, doesn't prevent stripping; it must be paired with a robust watermark.

Resources: Google DeepMind · SynthID · C2PA · Content Credentials standard

// PUTTING IT TOGETHER · A "Reproducible Brand Asset" Pipeline

String the four points into a factory that batch-produces consistent, traceable assets—this is the watershed between gacha and engineering:

Identity layer: train a LoRA (rank 16–32) on 20–40 brand assets to lock style/character, with a fixed trigger word.
Structure layer: prepare openpose / depth maps per placement as ControlNet conditions, conditioning_scale≈0.7, so composition is reusable and tunable.
Prompt layer: positive holds only "what you want," quality and exclusions go into negative; lock CFG to 6–7; keep a fixed seed table throughout for reproducibility.
Agent layer: a VLM reviews each image against the brand brief, inpaints bad regions locally, escalates to human after max_iters=3.
Provenance layer: sign finals with C2PA + platform SynthID; keep an audit trail of the LoRA training set's copyright review.

Get this running and you no longer deliver "a few nice images" but a parameterized, reproducible, auditable generation system—exactly what separates image engineering from image dabbling.

// DEEPER THINKING

An LLM prompt is an "instruction"; a diffusion prompt is "distribution coordinates." Is the root of this difference the training objective?

Yes. LLMs are RLHF-aligned into instruction-following assistants, so they parse the prompt as a command—understanding negation, conditions, meta-instructions. Diffusion models only learned the conditional probability text→image; CLIP compresses the whole text into one embedding vector, with no mechanism to "execute instructions"—it just pushes you toward a region of latent space. So "don't" fails (the model only sees the hat token pushing you toward hat-having regions), and order is weight (earlier tokens contribute more to the embedding). Grasp this and you stop cramming natural-language commands into diffusion prompts.

Using a VLM to review generated images—does this fall into "judge and contestant from the same source"? Both may hallucinate at the same spot.

The risk is real but mitigable. Same-source means generation and evaluation share similar visual priors and may share blind spots (e.g., both insensitive to finger count). Mitigations: ① review with a model from a different family than the generator; ② use dedicated detectors (finger count, OCR, object counting) for hard structural constraints rather than a generic VLM; ③ multi-vote, or require the VLM to locate coordinates before scoring to reduce confabulation. Fundamentally this is the same debiasing problem as LLM-as-judge—a VLM reviewer blocks obvious junk but can't replace human review in critical scenarios.

ControlNet gives pixel-level control, but stronger control means less creative space for the model. Where's the optimum of this trade-off?

It depends on the task's "determinism need." E-commerce hero shots, UI assets, layout-aligned scenes: high conditioning_scale (0.8–1.0), trading diversity for control. Concept exploration, moodboards: low scale (0.3–0.5) or no ControlNet, letting the model riot. In practice 1.0 is rarely used—it welds structure shut, kills the prompt, and degrades the model into a "fill tool." Smarter is staged: low control to explore directions, then high control at finalization to lock structure. Control isn't better in larger doses; it's the right strength at the right stage.

If robust watermarks like SynthID get widely deployed, will they spark a "de-watermarking" arms race? How should engineering view their effectiveness boundary?

Yes, and it's already happening. Any watermark is probabilistic, not a cryptographic guarantee: aggressive regeneration (img2img through another model), heavy degradation, or adversarial perturbation can all weaken it. So be honest about its role—SynthID addresses "unintentional spread" and "platform-side bulk detection," not stopping a determined attacker. The right mindset: treat it as one layer of defense in depth (paired with C2PA signatures + platform policy + law), not a silver bullet. Its value is in scale—DeepMind reports over 100 billion items watermarked, making "detectable by default" the ecosystem norm and raising the cost of misuse, rather than achieving 100% irremovability.

Training-data copyright risk can't be solved by watermarks. For an individual creator, how do you draw the line between capability and risk?

Tier it. Low risk: train LoRAs on clearly licensed/CC0 data, generate generic assets, replicate your own work's style. High risk: cloning a living artist's recognizable style, fine-tuning on others' copyrighted work, generating real faces that may infringe likeness rights. The middle ground (learning a generic aesthetic) is still legally evolving. Pragmatic advice: ① audit and log training-set provenance before commercial use (provenance is also your own legal evidence); ② avoid style replication that can "name a living artist"; ③ make the C2PA "AI-generated" declaration your default—be transparent rather than hiding it. The stronger the capability, the more you must treat compliance as a first-class engineering concern, not an afterthought patch.

// FURTHER READING

ControlNet · Adding Conditional Control to Text-to-Image Diffusion Models — foundational paper on spatial conditioning
DreamBooth · Subject-Driven Generation — the classic route to subject/character consistency
SDXL technical report — the engineering baseline for modern open t2i
Hugging Face diffusers docs — runnable implementations of ControlNet/LoRA/inpaint
C2PA · Content Credentials — origin-signing standard; pairs with DeepMind SynthID for two-layer provenance