AI/ML Deep Dive: Prompt Engineering

Day 3 · 2026-05-20

For: engineers with coding experience but no AI background

Engineering counterpart → super-individual D1: Prompt Engineering (system prompt architecture, four-layer structure)

Zero-shot / Few-shot PromptingZero-shot / Few-shot Prompting

PromptBasics

One-line intuition

It's like writing unit tests: drop a few "input → output" examples into the prompt and the model generalizes to new inputs by analogy — no retraining, just an ad-hoc lesson delivered through examples.

What problem it solves

Traditional ML for a new task ("is this comment sarcastic?") meant labeling thousands of examples and training a model. Few-shot prompting collapses that pipeline into "write 3-5 examples in the prompt." Zero-shot goes further still — give zero examples, just describe the task. Both leverage the pattern-completion capacity from pre-training, shifting "training" from a separate step to runtime.

How it works (intuition)

The model does one thing: continue text. Zero-shot phrases the task as a natural-language instruction; few-shot constructs a "Q-and-A" pattern that the model then extends. The order, format, and wording of examples noticeably shape the output — that's the core idea that "prompt form is model behavior."

# Zero-shot: only the task description
"Classify sentiment (positive/negative): 'The service was terrible' →"

# Few-shot: show 3 examples first, the model continues the pattern
"Input: Weather is lovely today → positive
Input: Battery dies in a day → negative
Input: Support patiently solved my problem → positive
Input: The service was terrible →"

Code example

from anthropic import Anthropic
client = Anthropic()

# Few-shot: teach the model "extract company name" by example
prompt = """Extract company names from sentences. Output the company name only.

Sentence: Apple launched a new iPhone today.
Company: Apple

Sentence: Tesla's stock surged last night.
Company: Tesla

Sentence: NVIDIA is partnering with TSMC to build a new fab.
Company:"""

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=50,
    messages=[{"role": "user", "content": prompt}]
)
print(resp.content[0].text)  # Expected: NVIDIA, TSMC

Common misconception

"More examples is better" — wrong. Three to five high-quality examples that cover edge cases usually beat twenty look-alike ones, and excess examples eat your context window and bill. Another myth: thinking few-shot is "training." It's not — the model's parameters don't change, and the model "forgets" your examples on the next call.

Key resources

Prompt Engineering Guide — systematic, example-heavy, bilingual
Anthropic's official Prompt Engineering docs

Where you see it

Canonical: few-shot to coerce a fixed JSON output (instead of writing regex parsers).
Closer to home: photograph your kid's homework, OCR the text, then few-shot the model to organize it into a three-column "problem / topic / difficulty" review sheet you can skim in minutes.

English Summary

Zero-shot prompting describes the task in natural language; few-shot prompting includes a handful of input-output examples in the prompt to demonstrate the desired pattern. No weights are updated — the model simply continues the pattern it sees, leveraging in-context learning from pre-training.

Think it through

1. Why does the order of examples in few-shot affect output? How does it tie back to Day 1's attention?

The model is fundamentally "predict the next token given everything seen." Attention lets later positions see all prior tokens, but examples close to the end usually dominate attention weight and "recent pattern" — that's recency bias. Studies show the last few-shot example acts as the strongest signal. So put the example "most similar to the target input" last, put simple/typical cases first, and edge cases at the end to "leave a deeper impression."

2. If a task already works reliably zero-shot, is few-shot still useful?

Yes, but the value shifts from "lifting capability" to "constraining output format." When zero-shot capability suffices, the model often pads with explanations ("Sure, let me analyze…"). Few-shot examples lock the format more reliably than a wall of "output X only, no explanation" rules, because models imitate examples better than they obey prohibitions. Cost: small bump in tokens and latency. For high-QPS production traffic, consider zero-shot + structured output APIs.

3. Is LLM few-shot the same thing as ML's "K-shot learning"?

Same name, different beast. Classic K-shot learning updates parameters from K labeled samples (meta-learning paradigm). LLM few-shot updates nothing — examples are pure context. The more accurate name is "in-context learning." Consequence: LLM few-shot has near-zero training cost but every inference pays per-token; classic K-shot trains once and inference is cheap, but the training pipeline is complex.

4. When does few-shot make the model dumber?

Three classic traps: (a) errors or biases in examples — the model imitates them faithfully ("garbage in, garbage out," amplified); (b) examples from a domain too far from the target — the model force-fits the wrong pattern; (c) example format too complex (nested JSON + mixed languages) — confuses the model into format drift. Diagnostic: run a zero-shot baseline first, then add few-shot, and confirm it's a real lift rather than added noise.

5. From a product perspective, how do you choose between few-shot and "fine-tuning a model"?

Four axes: traffic volume, stability needs, privacy, iteration speed. Low traffic (under 10K/day), frequently changing requirements → few-shot wins outright (ship by editing a prompt). Extremely high traffic where long prompts blow up cost and latency → fine-tuning "bakes the examples into weights." Strict stability requirements (finance, medical) → fine-tuning + an eval set is more controllable. Sensitive data that can't leave the network → fine-tune a small model on-prem. In practice: ship few-shot to find PMF, fine-tune only after volume justifies it.

Chain-of-ThoughtChain-of-Thought (CoT)

PromptReasoning

One-line intuition

Like enforcing "write the comment explaining your approach before the code" in code review — make the model "think out loud" before giving the final answer, and accuracy jumps a tier.

What problem it solves

For word problems, multi-step reasoning, and logic tasks, asking the model for the answer directly often fails. The reason: per-token compute is constant, and "compute the final answer in one step" is too tight a budget for hard problems. CoT's insight: letting the model emit more intermediate tokens effectively gives it more "thinking time." The model breaks reasoning into small steps, each built on the previous result, and error rates drop sharply.

How it works (intuition)

Triggering it is absurdly simple: add "Let's think step by step." at the end of the prompt, or include reasoning in your few-shot examples. The model writes out reasoning first, then the answer.

# Without CoT
"Mike has 5 apples. He eats 2, then buys 3 times as many as he ate. How many does he have now?"
→ "9"  (wrong!)

# With CoT
"... Let's think step by step."
→ "Mike started with 5, ate 2 leaving 3;
   bought 3 × 2 = 6 more;
   total = 3 + 6 = 9.
   Wait — recheck: 3 × (apples eaten = 2) = 6, plus remaining 3 = 9. Answer: 9."  (correct reasoning made explicit)

Analogy: your brain does 23 × 47 by mentally splitting into 20 × 47 + 3 × 47 first. CoT forces the model to do that explicitly.

Code example

from anthropic import Anthropic
client = Anthropic()

question = "A car drives 2.5 hours at 60 km/h, then 1.5 hours at 80 km/h. What's the average speed?"

# Trigger CoT: write "reason first, answer second" into the system prompt
resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=500,
    system="Write your reasoning steps inside <thinking> tags, then give the final number inside <answer> tags.",
    messages=[{"role": "user", "content": question}]
)
print(resp.content[0].text)
# Model will compute total distance = 60×2.5 + 80×1.5 = 270, then the average

Common misconception

The "reasoning" CoT writes out isn't necessarily the model's actual internal computation. It's just more generated tokens, which do raise accuracy, but intermediate steps may be confabulated (plausible-looking but disconnected from the final answer). Don't treat CoT output as the model's honest thought process — interpretability research has shown the two are not equivalent.

Key resources

Original paper (Wei et al., 2022): Chain-of-Thought Prompting — clear in 4 pages
Prompting Guide: CoT chapter with many examples

Where you see it

Canonical: math, logic puzzles, SQL derivation.
Closer to home: for investment analysis, prompt the model to "first analyze the moat, then break down revenue, then value the company, then give a buy/sell recommendation" — encode the research framework into the prompt to avoid gut-call conclusions.

English Summary

Chain-of-Thought prompting elicits intermediate reasoning steps before the final answer, dramatically improving performance on multi-step problems. It works because extra tokens give the model more "compute budget" per query. Note that the verbalized reasoning is not guaranteed to reflect the model's internal mechanics — it is a behavioral trick, not introspection.

Think it through

1. Why does CoT barely help small models and only "emerge" in large ones?

Researchers observed that under ~10B parameters, CoT can hurt — small models "think but miscalculate," and a single error in the chain amplifies through later steps, leaving the final answer worse than direct guessing. Big models have low per-step error rates, so the chain holds together. This is why the GPT-3 paper barely showed CoT effects, but GPT-3.5/4 surfaced them dramatically — capability needs to clear a threshold before CoT "unlocks." Engineering implication: if you must use CoT on small models, pair it with Self-Consistency or tool calls for error correction.

2. CoT multiplies token count and latency. When is that cost not worth it?

Three scenarios to skip CoT: (a) tasks that are classification, extraction, or translation — pattern matching, where CoT may make the model overthink; (b) high-QPS real-time services (search suggestions, chat completion) — a 200 ms budget won't fit CoT; (c) the model is natively a reasoning model (e.g. o1, Claude with extended thinking) — it already runs implicit CoT, so adding explicit CoT just burns tokens. Principle: CoT is for reasoning tasks, not a default toggle.

3. CoT can confabulate. What does this share with human "post-hoc rationalization"?

A lot. Psychology has shown humans often decide first and rationalize afterward (self-justification). CoT models do similar: researchers have edited the model's final answer while keeping the chain, and the chain still "argues coherently" toward the swapped answer. A warning for interpretability — "the model explained" is not "the model understood." In high-stakes settings (medical advice, credit decisions), use tool calls or external verifiers to cross-check, not the model's self-reported chain.

4. Designing a CoT prompt — compare "Let's think step by step" vs "Please analyze using these steps: 1. ... 2. ... 3. ..."

The first is zero-shot CoT — generalizes well but the reasoning is unconstrained and quality fluctuates. The second is structured CoT — domain knowledge (SWOT, the five-step valuation framework) is encoded into the prompt, giving more stable output, easier audit, and cleaner downstream parsing. In practice: use zero-shot CoT during exploration to validate whether the LLM can do the task at all; switch to structured CoT when productionizing, with an eval set. Or combine: first "Let's plan the steps for this problem" then "Execute step by step according to the plan."

5. Tie this back to Day 2's RLHF — is CoT a pre-training capability or an RLHF artifact?

Pre-training already exposed the model to mountains of human text in "reason first, conclude after" form (papers, tutorials, Stack Overflow answers), so the raw capability comes from pre-training. But RLHF/SFT reinforces the tendency to deploy reasoning at the right time: when annotators prefer explained answers, the reward model encourages CoT-style output by default. That's why base models (e.g. GPT-3 base) need explicit CoT triggers, while chat-tuned models (ChatGPT, Claude) often step through automatically — "can but doesn't volunteer" vs "trained to prefer it."

ReAct (Reason + Act)ReAct (Reason + Act)

PromptAgent

One-line intuition

Like REPL debugging: instead of emitting the full answer at once, the model "thinks one step → calls a tool → reads the result → thinks another step," looping until done. CoT is pure mental reasoning; ReAct is reasoning plus actually going out to fetch data or execute commands.

What problem it solves

Pure CoT reasons only over what the model already knows. Ask "what's today's stock price?" or "how many stars does this GitHub repo have?" and it will hallucinate. ReAct weaves reasoning with action: the model first thinks about what information it lacks, then acts by calling a tool, observes the result, and resumes thinking — looping toward the answer. This is the conceptual backbone of nearly every modern agent framework.

How it works (intuition)

It's a Thought → Action → Observation → Thought ... loop. Each round the model emits a short segment; an outer runtime parses the Action portion, executes the real tool, then splices the result back into the prompt as Observation so the model can continue.

# Typical ReAct conversation format
User: How many stars does BigCat's ai-ml-daily GitHub repo have?

Thought: I don't have real-time data; I need the GitHub API.
Action:  http_get("https://api.github.com/repos/cissy0802/ai-ml-daily")
Observation: {"stargazers_count": 42, ...}
Thought: I have the data now, I can answer.
Answer:  42 stars.

Essentially: ReAct = CoT + tool use + loop. Day 6 covers Function Calling, which is ReAct's industrialized form.

Code example

from langchain.agents import create_react_agent, AgentExecutor
from langchain_anthropic import ChatAnthropic
from langchain.tools import Tool

# 1) Define tools — the things the model can actually do
tools = [
    Tool(name="calculator",
         func=lambda x: str(eval(x)),
         description="Evaluate a math expression"),
    Tool(name="search",
         func=my_search_api,           # your search endpoint
         description="Search real-time web information"),
]

# 2) Let LangChain assemble the ReAct prompt + parsing loop
agent = create_react_agent(ChatAnthropic(model="claude-opus-4-7"),
                           tools, prompt=react_prompt_template)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

executor.invoke({"input": "Today's Shanghai Composite move × my 500k position = ?"})
# You'll see the full Thought → Action → Observation trace

Common misconception

ReAct doesn't give the model "thinking ability" — it gives the model an outer loop. Each round is still one forward pass; "it's thinking" is really the runtime parsing output, executing tools, splicing results back, and asking again. This matters for debugging: ReAct bugs usually live in "Action parsing" or "fuzzy tool schemas," not in the model itself.

Key resources

Original paper ReAct: Synergizing Reasoning and Acting (Yao et al., 2022)
LangChain Agent tutorial — run it to get the intuition

Where you see it

Canonical: agents that browse the web + compute + write code to complete end-to-end tasks (booking flights, drafting market reports).
Closer to home: a morning-standup helper — the agent checks today's calendar, searches your inbox for action items, queries Notion for OKR progress, then outputs a five-minute briefing.

English Summary

ReAct interleaves reasoning traces (Thought) and tool actions (Act) in a loop, using tool observations to ground further reasoning. It transforms a static LLM into an interactive agent that can fetch real-world data. ReAct is the conceptual backbone of most modern agent frameworks (LangChain, OpenAI Assistants, Anthropic tool use).

Think it through

1. A ReAct loop can run away — the model keeps "thinking" and never answers. What guardrails would you add?

The standard set: (a) a max-iteration cap (e.g. max_iterations=10); on hit, force a "based on what I have" answer or return an error; (b) duplicate detection: if two consecutive Actions are identical, declare a loop and abort; (c) budget control: cumulative tokens, time, or cost over threshold triggers a circuit breaker. Production also needs observability (log every Thought/Action), schema validation on the final answer, and human-in-the-loop confirmation for "high-risk actions" (delete, payment).

2. Compared to plain Function Calling, is the explicit "Thought" step in ReAct necessary?

Not always. Modern APIs (OpenAI tool calling, Anthropic tool use) implicitize Thought — the model emits a tool_use block directly, the runtime executes, then feeds the result back. This implicit ReAct has lower latency and more robust parsing. Explicit ReAct still has two upsides: (a) observability — you see what the model was reasoning when it chose a tool; (b) training scaffolding for weak (small) models — forcing "think before act" lifts accuracy. Rule of thumb: use native tool use when you can; switch to explicit ReAct when you need audit or debugging.

3. How is ReAct structurally similar to an "event loop" (Node.js / Python asyncio)?

Strikingly similar: an outer runtime schedules; a "task" yields an intention (Promise / Action); the runtime performs I/O; the result wakes the task. Two differences: (a) the "task" is a text-generation model with all state encoded in the prompt (no memory pointer), so each wake-up must re-paste the full history — that's why ReAct context grows and token cost rises with each step; (b) the next move is non-deterministic — the model may change its mind, so debugging looks more like LLM debugging than state-machine debugging.

4. What are the typical failure modes of ReAct?

(a) Wrong tool: vague description leads the model to use the calculator for date parsing; (b) wrong arguments: it passes "2026-05-20" as a string to a tool expecting Unix timestamp; (c) tool addiction: searches before every question even when it already knows the answer ("tool abuse"); (d) tool distrust: the tool returned correct data but the model "feels off" and confabulates ("tool bias"); (e) context explosion: each observation is huge (a full webpage), filling the context in a few rounds. Each maps to an engineering fix: clearer descriptions, strict argument schemas, "only call when needed" in the prompt, observation length limits, and summarization.

5. If you could only teach one non-AI colleague one prompting trick, would it be CoT or ReAct? Why?

CoT. It's zero-cost and immediately effective — add "please think step by step" and they can use it today; no tools, no framework, no loop management. It covers 95% of daily uses (email drafting, data analysis, document writing). ReAct is more powerful but has higher entry cost: tool definition, parsing, error handling. Teaching order should be: start with CoT to win them over, introduce ReAct/Function Calling once they hit the "model hallucinates real-time info" wall. Lessons land better with a felt pain point. That's exactly why Day 3 puts CoT before ReAct.

Self-ConsistencySelf-Consistency

PromptReasoning

One-line intuition

Like asking several engineers to solve the same problem independently and then voting: sample the model multiple times at non-zero temperature, run CoT each time, take the most frequent answer. The majority is usually right. It's a statistical "wisdom of crowds" that trades compute for accuracy.

What problem it solves

A single CoT chain can still slip on one step, and one mistake cascades. Self-Consistency's insight: correct reasoning paths tend to converge to the same answer, while errors scatter. Sample N times (say N=10), majority-vote the answers, and robustness climbs steeply. On math and logic tasks this often adds 10-20 accuracy points.

How it works (intuition)

Three steps: (1) sample N times at temperature > 0 (e.g. 0.7) so the same prompt produces different CoT chains; (2) extract the final answer from each chain; (3) majority-vote.

# Pseudocode: Self-Consistency
question = "If 3x + 7 = 22, x = ?"
answers = []
for _ in range(10):
    cot = llm(question, temperature=0.7)   # different chain each time
    answers.append(extract_final(cot))      # pull out the final number
final = mode(answers)                       # vote: [5,5,5,5,4,5,5,6,5,5] → 5

Analogy: a single CoT is one student doing the problem; Self-Consistency hands the problem to ten students independently and tallies their answers. The price is ~10x tokens.

Code example

from anthropic import Anthropic
from collections import Counter
import re

client = Anthropic()
question = "Anna has 3 apples. Mom gives her twice as many more, then she shares half with her brother. How many does she have left?"

answers = []
for _ in range(10):                          # sample 10 independent chains
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=300,
        temperature=0.7,                       # key: temperature for diversity
        messages=[{"role": "user",
                   "content": question + "\nReason step by step, then give the number."}]
    )
    # Take the last number as the answer
    nums = re.findall(r"\d+", resp.content[0].text)
    if nums: answers.append(nums[-1])

print(Counter(answers).most_common(1))      # the most frequent answer wins

Common misconception

Self-Consistency only works when there's a clear, single answer (math, multiple choice, extraction). For open-ended tasks (essay writing, planning) you can't vote — "the most common creative idea" isn't "the best one." Another myth: thinking Self-Consistency works at temperature 0. It doesn't — you must use non-zero temperature for sampling diversity, otherwise 10 runs are identical and the vote is meaningless.

Key resources

Original paper Self-Consistency Improves CoT (Wang et al., 2022)
Prompting Guide: Self-Consistency section

Where you see it

Canonical: high-accuracy math and logic, medical diagnostics, contract clause extraction.
Closer to home: estimating a company's intrinsic value — run the model N times with different assumption combinations (discount rate, growth rate, terminal value), then look at the valuation distribution. Tight distribution = robust conclusion; wide distribution = highly assumption-sensitive, so don't overweight.

English Summary

Self-Consistency samples multiple CoT reasoning paths at non-zero temperature and takes the majority-vote answer. The intuition: correct reasoning paths tend to converge, while errors scatter. It trades compute for accuracy and works best on tasks with a unique, extractable final answer.

Think it through

1. How does Self-Consistency relate to classical ML ensemble (bagging) ideas?

Common ground: both use "several slightly-different predictors voting" to reduce single-point errors. Differences: bagging draws diversity from bootstrapped training data with different model weights; Self-Consistency draws diversity from one model's stochastic sampling at temperature > 0 — weights identical, only generation paths differ. Cost shape differs too: bagging trains expensive but infers cheap; Self-Consistency doesn't train but pays N× at inference. It's "bagging at inference time" — a uniquely LLM-era pattern.

2. With only 3× budget (sample 3, not 10), is Self-Consistency still worth it?

Probably yes, but with sharper diminishing returns. Research shows the biggest jump is N=1→5, marginal returns from 5→20, near saturation from 20→40. Real N=3 gain depends on problem difficulty: (a) easy problems already correct random slips by N=3; (b) hard problems need at least N=10 for a stable majority. Cheaper variants exist: use a small model to generate N chains then a large model as "judge" to pick the best — that's LLM-as-Judge (Day 10), often cheaper than N runs of the big model.

3. Self-Consistency assumes "the correct answer is the mode." When does that assumption fail?

Three classic failures: (a) systematic bias — when the model holds a shared bias on a problem (e.g. mass misinformation in training data), all 10 chains agree wrongly, and the majority reinforces the error; (b) long-tail correctness — some problems need a rare insight the model finds only 1 in 10 times, smothered by 9 plausible-but-wrong runs; (c) too-large answer space — an open numeric problem ("how much is this company worth?") rarely returns the same number twice, and voting degrades to random. Mitigations: (a) switch model or add a verifier; (b) move to Best-of-N + Reward Model; (c) bucket the answers before voting.

4. What happens if you combine Self-Consistency with ReAct?

You get "N independent agents each running the full tool-use loop." Upside: corrects accidental wrong tools / wrong arguments by individual agents — the final conclusion is more robust. Cost: each agent makes 5-10 tool calls; N=10 means 100 tool calls + 100× LLM tokens. Latency and cost are brutal. The common compromise: only apply Self-Consistency at "critical decision points" (the final conclusion), keep intermediate tool calls single-path. This is the seed of Day 7's multi-agent systems — promote "independent sampling" into "role-divided multi-agent collaboration."

5. From a product perspective, how do you explain to users "this answer was slower but more accurate"?

This is a UX design problem worth solving deliberately. Three angles: (a) progress bar: "verifying using 5 independent approaches…" — visualize multi-sampling as "rigor," making the cost feel justified; (b) comparison view: collapse 5 chains so users can see them converge — builds trust; (c) tiered options: default fast mode (single sample); add a "deep analysis" button for high-value tasks (investment decisions, medical) that triggers Self-Consistency, letting the user make the trade. The essence: translate "compute traded for accuracy" into user-perceptible value. Day 29's AI product design covers "latency budget" and "user trust" in depth.

← Back to home