DAY 22 / PHASE 2 · APPLICATIONS

Writing Engineering

Voice Spec · Anti-Slop · Hierarchical Prompting · AI as Editor

2026-06-05 · BigCat

Put AI on the verification side (critique), not the generation side (ghostwrite).

// WHY THIS MATTERS

You use AI to write PR descriptions, docs, posts, and weekly reports every day—yet the output often carries that unmistakable "fake" AI smell: over-hedging, formulaic parallelism, boilerplate openings and closings. The problem isn't the model's capability, it's the engineering method: most people cram "writing" into a single prompt ("write me an article about X, professional and concise"), and get back the mean of the distribution—the "safe mediocrity" that RLHF tuned in. This issue engineers writing: compile your voice into a reusable spec, detect and strip the LLM's fingerprints, drive long-form text with an outline-first hierarchical prompt, and pin AI to the role of editor/critic rather than author. The core counter-intuition: the ceiling on AI's writing quality is set not by how well it writes, but by whether you place it on the verification side or the generation side. Judging whether a sentence is weak is easy (critique); writing a great sentence from scratch is hard (generation)—the former is its strength, the latter is where it goes off the rails.

// 01

Voice Spec: Compile "Your Style" Into a Reusable Spec

Claim: voice isn't described with adjectives—it's "compiled" from samples plus a negative list.

Background & Principle

When people describe the style they want, they write "professional, concise, punchy"—adjectives that carry almost zero information for the model, because everyone's notion of "concise" sits near the model's prior mean, so it can only regress to its learned "generic professional tone." The Examples > Rules lesson from Day 9 hits its most extreme case with voice: abstract adjectives never beat concrete samples. Anthropic's multishot docs recommend 3-5 examples to anchor output form; voice is the same—feed 3-5 representative passages you wrote, have the model reverse-extract operable features (sentence-length distribution, active/passive ratio, paragraph rhythm, punctuation and connective habits, use of lists/emoji), and crystallize them into a voice.md spec that you inject on every write. The point is to turn "style" from a vague feeling into a declarable, reusable, diffable artifact.

Hands-On

# Run once: reverse-extract a voice spec from samples (save as voice.md)
Below are 5 representative passages I wrote (same genre), wrapped
in <sample>:
<sample>...paste your real paragraphs...</sample>

Reverse-analyze my writing style and output an operable style spec.
Write only executable rules, no adjectives like "professional/vivid". Cover:
- Sentence length: average words, long/short rhythm (give numbers)
- Syntax: active vs passive, use of rhetorical questions/dashes/semicolons
- Structure: paragraph length, conclusion-first or not
- Diction: preferred connectives, avoided words, jargon density
- Forbidden zones: expressions I never use (list what you observe)

Hand-check the output (delete what the model invented, add what it missed) and save it as voice.md. This spec is more precise than any adjective, and you can version-control it—keep a separate one per genre (blog / report / technical doc).

Failure modes: (1) Mixing genres in the samples (report + blog + comment) averages voice into a featureless blur—one spec serves one genre. (2) Too few samples (1-2 paragraphs) and the model can't catch a stable pattern, so it guesses. (3) Treating the generated spec as gospel—it's a starting point, not the end; it must be hand-edited.

Going deeper · Anthropic Multishot prompting docs, docs.anthropic.com/.../multishot-prompting · Paul Graham Write Like You Talk, paulgraham.com/talk.html

// 02

Anti-Slop: Detect and Strip the LLM's "Smell"

Claim: default LLM output carries a detectable fingerprint—a distribution-level bias that prompts can suppress but not eradicate.

Background & Principle

Kobak et al. (2024) analyzed 14 million PubMed abstracts and found that after ChatGPT appeared, a batch of "flowery words" spiked off a cliff—"delve" rose ~28×, with "intricate" and "meticulous" surging in step. These are objective evidence of the AI smell: byproducts of RLHF pushing the model toward "safe mediocrity"—over-hedging ("it's worth noting," "to some extent"), excessive transitions, three-beat parallelism, boilerplate openings ("in today's fast-paced…") and closings ("in conclusion…"). Simon Willison named this mass-produced, low-quality, unwanted AI output slop (a 2025 word of the year for several dictionaries). The key realization: this is a distribution-level bias, not something a single prompt fully clears—you can lower the frequency, but to cure it you must use §1's real samples to drag the whole output distribution away.

Hands-On

# A slop filter to append at the end of any writing prompt
Hard constraints:
1. Banned words/phrases: delve, it's worth noting, in today's/in this era,
   in conclusion, not only...but also, unlock (potential), empower, embrace (change)
2. No preamble ("Sure, here's...") and no postamble (summary closing)
3. Every paragraph has at least one concrete anchor: a number,
   an example, or a proper noun
4. Cut every transition sentence that adds no information; prefer short over long
5. Don't stack parallelism for momentum—one concrete example beats three abstract adjectives

Rule 3 is the most effective—forcing concreteness, because slop is fundamentally emptiness: with no information to fill, the model pads with ornate diction and parallelism. Force it to drop numbers and examples and slop has no ground to stand on.

Failure modes: the more negations you stack ("don't… don't…"), the stiffer the output—Day 9 noted negative instructions are inherently weak, and the model tends to "think of the pink elephant" and step on the mine anyway. A banned-list treats symptoms; truly dragging the distribution away requires few-shot injection of your real samples (§1). Also update the banned-list periodically—fingerprint words shift as models iterate.

Going deeper · Kobak et al. Delving into ChatGPT usage in academic writing through excess vocabulary, arXiv:2406.07016 · Simon Willison on slop, simonwillison.net/tags/slop

// 03

Hierarchical Prompting: Chapter → Paragraph → Sentence Control

Claim: long-form isn't one prompt—it's an outline-first prompt chain. This is a workflow, not an agent.

Background & Principle

Asking the model to write 3,000 words in one shot almost inevitably hits three traps: lost-in-the-middle (diluted mid-section points), structure collapsing midway, and terminology/style drift. The right approach splits it into three layers, each an independent prompt with independent verification (echoing Day 5: use a workflow when you can, not an agent—it's controllable, debuggable, evaluable). Anthropic's chain-prompts docs are explicit: splitting complex tasks into chained subtasks is far steadier than one mega-prompt. The three layers each have a job:

L1 · Outline: produce only the structure + one-sentence thesis per section, human-reviewed—the cheapest verification in the whole pipeline; if the skeleton is wrong, everything after is wasted.
L2 · Per-section expansion: write one section at a time, but inject the global outline + the end of the previous section as anchors to prevent repetition or drift.
L3 · Sentence-level style pass: apply §1's voice spec + §2's slop filter for the final polish.

┌──────────── writing = outline-first prompt chain ─────────┐ │ │ │ your bullet skeleton │ │ │ │ │ ▼ │ │ ┌─────────┐ human ✋ bad structure→redo, don't enter L2 │ │ │ L1 outln│ ─────────▶ [structure + per-section thesis] │ │ └────┬────┘ │ │ │ inject: global outline + prev section end │ │ ▼ │ │ ┌─────────┐ per-sec loop anti-drift / anti-repeat │ │ │ L2 write│ ─────────▶ §1..§N paragraph drafts │ │ └────┬────┘ │ │ │ inject: voice.md + slop banned-list │ │ ▼ │ │ ┌─────────┐ │ │ │ L3 polish│ ────────▶ sentence-level style pass → final│ │ └─────────┘ │ └───────────────────────────────────────────────────────────┘

Hands-On

def write_long(topic, voice_md, banned):
    outline = llm(f"Outline for '{topic}': section titles + one-sentence thesis each, "
                  "no body text.")
    review(outline)                       # human-review the skeleton—cheapest verification
    sections, prev = [], ""
    for sec in outline.sections:
        draft = llm(f"Global outline: {outline}\nPrev section end: {prev[-200:]}\n"
                    f"Write only this section: {sec}. Continue from above, no repeats.")
        sections.append(draft); prev = draft
    full = "\n\n".join(sections)
    return llm(f"Polish per this voice spec, run each sentence through the slop list:\n"
               f"{voice_md}\nBanned: {banned}\n\nText: {full}")

Failure modes: (1) L2 without the global outline → sections repeat each other, terminology drifts. (2) Skipping L1 and asking for the whole text directly → you get a draft that "looks complete but has no logical skeleton," harder to fix than to rewrite. (3) No human checkpoint between layers → errors amplify all the way to L3 before you notice.

Going deeper · Anthropic Chain complex prompts docs, docs.anthropic.com/.../chain-prompts · Anthropic Building Effective Agents (prompt chaining), anthropic.com/engineering/building-effective-agents

// 04

AI as Editor: Pin AI to Critic, Not Author

Claim: the highest-quality AI writing is "human writes the skeleton, AI does critique / expand / polish," not "AI writes, human edits."

Background & Principle

This applies the evaluator-optimizer pattern (Anthropic Building Effective Agents) to writing. Key insight: critique is a verification problem, generation is an open-ended one. Judging "which sentence here is weakest, and why" is easy and reliable for the model; generating a great full piece from scratch easily goes off the rails—with sycophancy as a bonus: ask it to evaluate its own draft and it tends to say "looks good." So using it as a critic (point out weaknesses, propose fixes) is far more reliable than as an author (write for you). The crucial side effect: a human-written skeleton naturally preserves voice—voice loss is rooted precisely in letting AI generate from scratch. You supply the ideas and structure, it amplifies and polishes; only then is the division of labor right.

Hands-On

# self-critique: AI as editor, not ghostwriter; touch only what needs touching
You are a strict editor, not a ghostwriter. For the passage below:
1. Point out the 3 weakest sentences, with why each is weak
   (empty / redundant / logical leap)
2. Give diff-style edits: <del>original</del> → <ins>revised</ins>
3. Change only those 3 sentences, keep the rest verbatim—don't rewrite
4. Don't praise my writing—give criticism directly

<draft>...your paragraph...</draft>

Rules 3 and 4 are the linchpin: without "change only the weak sentences," the model will rewrite everything (erasing your voice along the way); without banning praise, it dilutes criticism with sycophancy. You want a surgical edit, not a ghostwrite.

Failure modes: (1) Letting AI rewrite a whole paragraph → voice lost + it tacks on "you wrote this really well" (Day 9's sycophancy). (2) Not requiring a diff and not bounding the edit scope → you can't quickly review what changed and must reread the whole block. (3) Putting critique and rewrite in the same turn → its critique self-fulfillingly bends toward the version it wants to write.

Going deeper · Anthropic Building Effective Agents (evaluator-optimizer), anthropic.com/engineering/building-effective-agents · Anthropic Claude's Character, anthropic.com/research/claude-character

// INTEGRATED BUILD · Make Yourself a "Writing Harness"

Chain the four points into a pipeline you can reuse for weekly reports and blog posts, aiming for output that passes a "Turing test"—others can't guess which piece was AI-assisted:

Build the voice spec (§1): feed 5 real passages, reverse-extract rules, hand-edit, and save as voice.md; one per genre.
You write only the bullet skeleton: ideas and logic come from you (preserving voice + structure)—this is the part you can't outsource.
L2 expansion script (§3): call section by section, injecting voice.md + global outline + slop banned-list.
L3 auto self-critique (§4): output a diff; you only review the changed sentences, not the whole piece.
Eval: mix the final draft with your historical hand-written pieces and have another model or a friend blind-guess which is AI-assisted. Can't tell = the voice spec passes; spotted at a glance = go back, add §1 samples and update §2's banned words.

The essence of this flow isn't "let AI write for you"—it's always keeping AI on the verification side: you generate the idea skeleton, it does the amplify/critique/polish work where "judging right from wrong is easier than creating from scratch." That way the output has both your voice and AI's throughput.

// DEEP THINKING

If the voice spec is precise enough that AI can replicate your style 100%, is "your voice" still yours? Will over-reliance atrophy your own writing?

This is the classic cognitive-offloading trade-off. A voice spec replicates your past style, while real voice evolves with your thinking. If you only use AI to replicate the old voice, your style gets "frozen" at the sampling moment and stops growing. Healthy use: let the voice spec handle low-value output (reports, template emails) to save energy, and reserve high-value writing (what you truly want to express) for your own hand—that's where voice keeps evolving. Treat AI as a "style snapshot tool," not a "style proxy."

Slop fingerprint words (delve / intricate) shift as models iterate. If future models are trained to remove them, does the AI smell vanish—or just change form?

The fingerprint changes form, it doesn't vanish. Surface vocabulary can be tuned out, but the deep source of the AI smell is distribution collapse—RLHF pushes output toward the "safe, broadly-liked" center, yielding low individual specificity, low risk appetite, and predictable syntax. This is a byproduct of the alignment objective, not a bug. As long as training optimizes "satisfy the majority," output regresses to the mean; genuine personal voice lives precisely in the tail. That's why §1's few-shot injection is the cure—it drags the sampling point from the center to your tail.

This issue argues "AI as critic beats AI as author." But critique can be wrong—the "weak sentence" AI flags may be the gem of your voice. How do you stop AI from flattening you into mediocrity?

This is exactly why point 4 stresses "change only flagged weak sentences + output a diff + human has final say." Critique's value is drawing your attention, not deciding for you. AI may flag your deliberate transgressions (short sentences, colloquialism, white space) as "weak," because its yardstick is also the mean. Countermeasures: (1) always present as a diff so you accept/reject line by line; (2) state your voice spec in the critique prompt so it judges by your standard, not the generic one; (3) default-reject "stylistic" edits, accept only "fact/logic/redundancy" criticisms.

What does hierarchical prompting (L1→L2→L3) sacrifice for controllability? When is a single long prompt actually better?

It sacrifices global coherence and inspired leaps. When expanding section by section, the model can't see the not-yet-written later text, making it hard to set up long-range foreshadowing, callbacks, and overall rhythm—the very soul of great long-form. A single long prompt is better when the model's context is ample and you want a "one-breath narrative flow" (short stories, tightly-argued essays). Hierarchy suits structure > prose genres (technical docs, reports, tutorials), where value lies in information organization, not cadence. The test: do you fear "structure collapse" or "reads like it was stitched together" more?

"Have a friend blind-guess which is AI-assisted" sounds handy—but what does it actually measure? Does passing it mean the writing is high quality?

It measures voice consistency (indistinguishability), not quality. Passing only means "the AI-assisted draft is indistinguishable from your hand-written one"—but if your hand-written level is mediocre, passing just means "mediocre in a way that resembles you." This is the classic eval trap: the metric aligns with the proxy goal, not the real one. A better eval has two layers: (1) voice consistency (blind guess); (2) absolute quality (information density, argument strength, value to the reader), which needs a rubric or expert judgment, not a blind guess. Only passing both counts as truly qualified.

// FURTHER READING

Anthropic · Multishot Prompting — anchor output form with 3-5 examples, the basis of voice extraction
Anthropic · Chain Complex Prompts — the official basis for hierarchical prompting
Anthropic · Building Effective Agents — prompt chaining and evaluator-optimizer patterns
Kobak et al. · Delving into ChatGPT usage in academic writing — quantifying LLM writing traces via "fingerprint words"
Simon Willison · on slop — the origin and boundaries of the term "slop"