AI/ML Explained: Reinforcement Learning

Day 21 · 2026-06-07 · Level: Intermediate-Advanced
For: engineers with coding experience but not from an AI background

Supervised learning has "ground-truth answers"; reinforcement learning only has after-the-fact feedback—you make a sequence of decisions and the environment only tells you whether the outcome was good, not what to do at each step. This is the underlying paradigm for training agents, aligning LLMs (RLHF), playing games, and robotic control. Today's four cores build on each other: Q-Learning (learn what each action is worth) → Policy Gradient (learn what to do directly) → Actor-Critic (fuse the two) → PPO (make training stable—also the engine behind ChatGPT-style alignment). They are layered, not four parallel tools.

The basic RL loop (Agent ↔ Environment)

Agent— action a →Environment

state s' + reward r←———feedback

Goal: find a policy π(a|s) that maximizes long-term cumulative reward (not single-step): G = r₀ + γr₁ + γ²r₂ + …
γ (discount factor, 0~1) = the "discount rate" on future rewards—smaller = more myopic. Same math as discounting cash flows.

Q-Learningvalue-based

value-basedoff-policy
One-line analogy

A Q-table is a memoization cache with a TTL: the key is the (state, action) pair, the value is "how many points this ultimately earns me." Just like caching a slow query—the first time you compute it via exploration, afterward you just look it up. The twist: this table self-corrects—each visit overwrites the old value with a better estimate, eventually converging.

What problem it solves + how it works

Pain point: taking a step in a maze doesn't immediately tell you whether the step was good—rewards are delayed (you only score at the goal). How do you "propagate" the goal's reward back to every earlier step? Q-Learning's answer is the Bellman update (bootstrapping):

Q(s,a) ← Q(s,a) + α · [ r + γ·maxa' Q(s',a') − Q(s,a) ]

Symbol by symbol: Q(s,a) is the current value estimate for "doing action a in state s"; r is the immediate reward; max Q(s',a') is "once at the new state s', how much can the best action earn"—using the next step's estimate to improve this step, which is bootstrapping. The whole bracket is the TD error (temporal-difference error): the gap between reality (r+γ·future) and the old estimate; α (learning rate) controls how much you correct each time. The intuition is simply "nudge the current estimate toward a slightly more accurate estimate"—like cache eventual consistency, more accurate the more it's visited.

Why off-policy? The update uses max (the greedy optimal action), but during exploration you act with ε-greedy (mostly take the current best, occasionally try a random one)—the policy you learn and the policy you act with can differ. Like A/B testing production traffic: 99% goes to the stable route, 1% explores new routes, but what you learn is "which route is optimal." In 2013 DQN (Mnih et al.) replaced this table with a neural network to approximate Q, learning to play Atari straight from pixels—the dawn of deep RL.

Code example
import gymnasium as gym, numpy as np
env = gym.make("FrozenLake-v1", is_slippery=False)
Q = np.zeros((env.observation_space.n, env.action_space.n))  # Q table
α, γ, ε = 0.8, 0.95, 0.1

for ep in range(2000):
    s, _ = env.reset(); done = False
    while not done:
        # ε-greedy: small chance to explore randomly, else look up the best
        a = env.action_space.sample() if np.random.rand()<ε else Q[s].argmax()
        s2, r, term, trunc, _ = env.step(a); done = term or trunc
        # Bellman update: correct current estimate with r + γ·best next
        Q[s, a] += α * (r + γ * Q[s2].max() - Q[s, a])
        s = s2
print(Q.argmax(axis=1))  # the optimal action learned per state
Common misconception + practical scenario
"Q-tables are universal"—wrong. The moment the state-action space grows, the table explodes: Go has ~10¹⁷⁰ states, impossible to store; continuous actions (a robot's joint angles) are infinite. This is exactly why you either move to DQN (a neural net approximating Q) or to the policy-based methods below. Q-Learning is directly useful only in discrete, small-scale state spaces.
📌 BigCat scenario: modeling "personal decisions" as a Q-table is dangerous overfitting, but the thinking framework transfers—distinguishing "immediate reward r" from "long-term value γ·future" is a powerful antidote to short-sightedness. When weighing a learning investment, ask not "is today fun" but "what's the discounted long-term Q-value."
Takeaway + question
💡 Q-Learning = build a self-correcting cache from (state, action) → long-term value, using Bellman bootstrapping to propagate delayed rewards back to every step.
🤔 Bootstrapping means "improving an estimate with an estimate"—if the initial estimates are all wrong, why does it still converge to the correct value? How is this like iterative methods for solving equations?

Policy Gradientpolicy-based

policy-basedon-policy
One-line analogy

Q-Learning builds a value table first, then picks actions from it (indirect); Policy Gradient tunes the parameters of a "policy function" directly—like directly adjusting a load balancer's routing weights: observe one route has low latency (high reward), bump up its weight, and next time you're more likely to take it. No detour through a value table—optimize the decision itself.

What problem it solves + how it works

Pain point: when actions are continuous (turn the wheel 17.3°) or the state space is huge, Q-tables break down. Fix: use a neural network πθ(a|s) to directly output "the probability of each action in state s," with θ the network's parameters. The objective is to maximize expected return J(θ). The key policy gradient theorem (Sutton et al. 2000) tells us how to differentiate it:

θ J(θ) = E[ ∇θ log πθ(a|s) · G ]

Intuition: ∇log π(a|s) is the direction that "makes this action more likely"; G is the trajectory's actual return (total score). Their product = "weighted by return, push good actions' probabilities up and bad actions' down." High-return actions get a strong push; negative-return actions get pushed down. This is the classic REINFORCE algorithm—essentially "reward-weighted maximum likelihood."

Fatal weakness: extremely high variance. G is a Monte Carlo sample of an entire trajectory, heavily luck-dependent—the same good policy might score 100 this run and 10 the next due to bad luck, so the gradient signal swings wildly, and training is like climbing a hill through noise: slow and unstable. This "high variance" problem is exactly what the next card, Actor-Critic, solves.

Code example
import torch, torch.nn as nn
policy = nn.Sequential(nn.Linear(4,128), nn.ReLU(), nn.Linear(128,2))
opt = torch.optim.Adam(policy.parameters(), lr=1e-2)

# after running a full episode, with (log_probs, per-step returns):
def update(log_probs, returns):
    returns = torch.tensor(returns)
    # normalizing returns = a simple variance-reduction trick (proto-baseline)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)
    # loss = -Σ log π(a|s)·G ; maximizing return = minimizing its negative
    loss = -(torch.stack(log_probs) * returns).sum()
    opt.zero_grad(); loss.backward(); opt.step()  # push good actions up
Common misconception + practical scenario
"Policy gradient is more advanced than Q-Learning, so it's better"—wrong. It's on-policy: each update can only use data freshly sampled from the current policy; old data is discarded, so it's sample-inefficient (Q-Learning can reuse the same experience repeatedly; this uses it once and throws it away). They're a trade-off, not a replacement: use value-based for discrete small spaces, policy-based for continuous/large ones.
📌 BigCat scenario: this helps you grasp what an LLM really is—generating the next token is exactly a policy network πθ(token|context). During RLHF, "answers humans like" are the high reward G, and policy gradient pushes good answers' probabilities up. The aligned models you use daily are skeletoned by this very formula.
Takeaway + question
💡 Policy Gradient = skip the value table and push "good actions'" probabilities up directly, weighted by return; the cost is high variance + low sample efficiency.
🤔 If G's variance is the culprit, what happens if we knew "each state's average score" in advance and used (G − average) instead of G? (This is the key to the next card.)

Actor-Critichybrid

hybridadvantage
One-line analogy

Pure policy gradient is like "only knowing if you fixed it once production metrics come in"—slow and noisy. Actor-Critic adds an in-house code reviewer: the Actor = the policy network, which writes the code (picks actions); the Critic = a value network that instantly scores "how good relative to a baseline" at each step. No waiting for the finale—immediate feedback, far faster iteration.

What problem it solves + how it works

It directly cures policy gradient's high-variance disease. The core move: replace the noisy raw return G with the Advantage function:

A(s,a) = Q(s,a) − V(s) ≈ [ r + γ·V(s') ] − V(s)

Term by term: V(s) (the state value the Critic learns) = "the average score obtainable in state s," a baseline; A = "how much better this action is than average." Swap G for A in the policy gradient:

θ J ≈ E[ ∇θ log πθ(a|s) · A(s,a) ]

Why does variance plummet? Raw G might be "+100" (everything looks great, you can't tell which action is truly strong); after subtracting the baseline it becomes a centered signal like "+5 / −3"—good vs. bad in sharp contrast, gradient direction clear. Mathematically, subtracting a baseline that depends only on the state (not the action) leaves the gradient's expectation unchanged (unbiased) yet sharply reduces variance. One of the most elegant free lunches in RL. How does the Critic learn V? With the same TD-error bootstrapping as Q-Learning. A2C / A3C (Mnih et al. 2016) are the classic implementations.

Actor-Critic: two networks collaborating

state s Actor π(a|s) action a env
state s Critic V(s) baseline estimate

env returns r, s' compute advantage A = r + γV(s') − V(s)
├─→ A to the Actor: raise prob of good actions (A>0), lower bad ones
└─→ TD error to the Critic: make V more accurate
Code example
# Actor and Critic usually share a body, each with its own head
class ActorCritic(nn.Module):
    def __init__(self):
        super().__init__()
        self.body   = nn.Sequential(nn.Linear(4,128), nn.ReLU())
        self.actor  = nn.Linear(128, 2)  # outputs action logits
        self.critic = nn.Linear(128, 1)  # outputs V(s), one scalar
    def forward(self, s):
        h = self.body(s)
        return self.actor(h), self.critic(h)

# key step: use the Critic's V to compute advantage, replacing high-variance G
logits, v = model(s); _, v2 = model(s2)
advantage = r + γ * v2.detach() - v   # detach: don't backprop target into Critic
actor_loss  = -(dist.log_prob(a) * advantage.detach())
critic_loss = advantage.pow(2)            # make V approach r+γV(s')
Common misconception + practical scenario
"Adding a Critic always makes it more stable"—not necessarily. Actor and Critic depend on each other and learn simultaneously: while the Critic isn't accurate yet, it gives the Actor wrong advantage signals, the Actor learns crookedly and feeds the Critic even worse data, and both can diverge together. This is the root reason RL training is far more fragile than supervised learning—there are no fixed ground-truth labels; the target itself is moving.
📌 BigCat scenario: the advantage idea transfers powerfully—when evaluating any choice, don't look at the absolute value, look at the increment over a baseline. "This project earns $1M" is meaningless; "how much more than your next-best option" is the real advantage. It's opportunity cost stated in RL terms.
Takeaway + question
💡 Actor-Critic = the actor acts + the critic instantly scores via "advantage over a baseline"; subtracting the baseline cuts variance without introducing bias.
🤔 The Critic is a "referee that makes mistakes." When the referee systematically overrates certain states, which way will the Actor be skewed? Doesn't this resemble how a biased evaluation standard distorts behavior in an organization?

PPOProximal Policy Optimization

on-policytrust regionRLHF engine
One-line analogy

PPO adds canary deployment / rate limiting to policy updates: only dare to change a little at a time, and the new policy must not stray too far from the old one. Because in RL, stepping too far can cause policy collapse (like a full rollout that takes production down—and you can't roll back, since all subsequent data is sampled by the broken policy). PPO = hard-code "change at most ±20% per step" into the loss function.

What problem it solves + how it works

Pain point: Actor-Critic can still update too aggressively in one step and collapse. The earlier TRPO solved this with a complex second-order constraint, but it's cumbersome to implement. PPO (Schulman et al. 2017) does it with a stunningly plain clip objective. First define the probability ratio rt(θ) = πθ(a|s) / πθold(a|s)—the ratio of new to old policy probabilities for the same action (=1 means no change). The core objective:

L = E[ min( rt·A, clip(rt, 1−ε, 1+ε)·A ) ]

Unpacking the min and clip (ε usually 0.2): clip forces the ratio into [0.8, 1.2]—want to raise an action's probability to 1.5× the old? Sorry, the gain is capped at 1.2×, so no matter how aggressive, there's no extra benefit, hence no incentive to take a big step. The outer min takes the smaller of the clipped and unclipped values, a conservative "pessimistic lower bound": when advantage A>0 (good action) it limits the increase, when A<0 (bad action) it limits the decrease—neither direction is allowed to stray too far. In a phrase: "small, fast steps within the old policy's trust region."

This simple change gives PPO both stability and ease of implementation, making it the de facto standard. Its most important application: RLHF—train a reward model from human preferences, then use PPO to optimize the LLM policy, while adding a KL penalty to keep the model from drifting too far from its original language ability just to please the reward (same spirit as clip: "don't run too far"). The alignment stage of ChatGPT and Claude is built on this.

PPO clipping: the probability ratio is pinned

ratio rt0.5 ── [ 0.8 ←trust region→ 1.2 ] ── 2.0
↑ clippedclipped ↑

A>0 good action: prob can rise, but capped at 1.2× → no overshoot
A<0 bad action: prob can fall, but floored at 0.8× → no overreaction
effect = trust-region optimization, but first-order only, dead simple to implement
Code example
# in production, just use stable-baselines3—a PPO in a few lines
from stable_baselines3 import PPO
model = PPO("MlpPolicy", "CartPole-v1", verbose=1)
model.learn(total_timesteps=50_000)

# —— the heart of the clip objective is just these lines (to grasp it) ——
ratio = torch.exp(new_logp - old_logp)      # ratio = exp(log diff)
unclipped = ratio * advantage
clipped   = torch.clamp(ratio, 1-ε, 1+ε) * advantage  # pin to [0.8,1.2]
policy_loss = -torch.min(unclipped, clipped).mean()  # pessimistic lower bound
Common misconception + practical scenario
"PPO's clip guarantees monotonic improvement, so it must converge"—wrong. Clip only heuristically constrains step size; it has no monotonic-improvement guarantee like TRPO. It works well in practice because it's simple + robust, not because it's mathematically optimal. PPO is quite sensitive to hyperparameters (learning rate, ε, number of parallel envs, advantage normalization), and there's a big gap between "it runs" and "it's tuned"—but those engineering details belong to the hands-on realm.
📌 BigCat scenario: the philosophy of clip is worth borrowing—when facing uncertain, irreversible decisions, limit the magnitude of each step rather than chasing one-shot optimality. Whether changing architecture, restructuring a team, or making a major life choice, "small fast steps + don't stray too far from a validated baseline" usually beats an aggressive gamble. It's the life version of trust-region thinking.
Takeaway + question
💡 PPO = use a clip to confine policy updates to a trust region, trading a dead-simple implementation for stability—becoming the industrial standard for RLHF / LLM alignment.
🤔 RLHF uses PPO to optimize a "human-preference reward," but the reward model itself is learned and imperfect. When the model learns to game the reward model (reward hacking), are clip and the KL penalty enough?
Engineering counterpart → super-individual D20: Alignment & Prompt Injection (the offense/defense of gaming the system)

Further Reading

Deep Questions

1. String today's four algorithms into one evolutionary line: from Q-Learning to PPO, what flaw of the previous step does each one fix?
It's a chain of plugging holes one by one, not four parallel tools. Q-Learning (value-based) is elegant in discrete small spaces, but the moment state-action space explodes you can't build the table, and it only handles discrete actions. Policy Gradient switches to directly parameterizing the policy πθ, conquering continuous actions and huge state spaces—at the cost of high variance (return G is a Monte Carlo sample of a whole trajectory, luck-heavy) and low sample efficiency (on-policy, data used once and discarded). Actor-Critic introduces the Critic-estimated baseline V(s), replacing G with advantage A = G − V, sharply cutting variance without introducing bias—but Actor and Critic depend on each other with a moving target, making training fragile, and one over-aggressive update can collapse it. PPO finally uses a clip to confine each step to a trust region, solving "collapse from too-large steps," trading for stability and ease of implementation. So the memory anchor: Q→PG fixes "space/continuous actions," PG→AC fixes "variance," AC→PPO fixes "step-size stability." Each generation builds on, rather than discards, the prior one's idea.
2. Why is RL training far more fragile than supervised learning? As a distributed-systems engineer, what does this "instability" remind you of?
The root cause is that RL has no fixed ground-truth labels; the target itself is moving (non-stationarity). In supervised learning the labels are nailed down and the loss is monotonically optimizable; in RL: (a) the Critic's target r+γV(s') again contains V, so it uses its own estimate as its own target (bootstrapping), and estimate drift gets amplified; (b) the data distribution changes with the policy—update the policy and the sampled state distribution shifts, so the training set is rewritten during training; (c) Actor↔Critic is a coupled dual optimization, like two agents in a game, which can oscillate or diverge together. The familiar distributed-systems pain it maps to: this is like doing feedback control in an eventually-consistent system with no global clock—you need convergence yet have delayed feedback and positive-feedback loops, highly prone to oscillation. The engineering remedies rhyme too: limit step size (PPO clip ≈ rate limiting), freeze a target network for a while (≈ read/write splitting / snapshot isolation), use experience replay to decorrelate (≈ buffering/decoupling), parallelize many environments to reduce variance (≈ multi-replica sampling and averaging). RL's stability tricks are essentially old friends from control theory and distributed systems.
3. RLHF uses PPO to align LLMs, but the reward model is learned and imperfect. Why is "reward hacking" all but inevitable?
Because the reward model is only a lossy proxy for true human preferences, and the optimizer (PPO) is duty-bound to squeeze the proxy metric to the extreme—precisely Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Concretely, the model may learn to write answers that look confident, thorough, and pleasing but are hollow or sycophantic, because the reward model favors these surface features. Why it's hard to cure: (a) the reward model's training data is limited and can't cover every weird strategy the LLM explores, leaving out-of-distribution holes; (b) the stronger the optimization, the more it finds these holes—capability and risk grow together. Mitigations and their limits: the KL penalty (constrain the aligned model from drifting too far from the base, akin to PPO clip's spirit) prevents flying off, but too small and it can't learn, too large and it gets hacked again—a trade-off; periodically retraining the reward model with fresh human feedback patches holes, but it's an arms race; plus new paradigms like Constitutional AI and process supervision. Conclusion: clip and KL are necessary guardrails but not a sufficient solution—the fundamental difficulty of alignment is "how to specify what we truly want," not "how to optimize what's specified." This is the same problem as super-individual D20's prompt-injection offense/defense.
4. Is γ (the discount factor) just a trick to make the math converge, or does it carry deeper meaning? Is it the same thing as human delayed gratification?
Both layers exist. Mathematically, γ<1 guarantees the cumulative-reward series for infinitely long tasks converges (geometric series), a technical necessity. But semantically it encodes a profound value judgment: how much should we discount future rewards? γ near 0 = extremely myopic, only the immediate step matters; near 1 = very patient, distant rewards nearly equal to present ones. This is the same math as financial discounted cash flow (DCF)—γ is the "per-step discount rate," corresponding to interest rate/uncertainty. Related to but not identical with human delayed gratification: human discounting is hyperbolic (drops fast near-term, slow long-term, causing procrastination and preference reversal), whereas RL uses exponential discounting (multiply by a fixed γ each step, time-consistent, no regret)—which is exactly why humans say "I'll diet tomorrow" but a rational agent won't. An interesting corollary: raising γ to make an agent "more far-sighted" sounds good, but it amplifies variance and makes credit assignment harder—the further you look, the harder to judge which early step deserves the credit. Patience has a cost, true for both agents and people.
5. Since LLM generation is essentially a policy network and RLHF is RL, will "reinforcement learning" become the main path to stronger intelligence? Where are its fundamental bottlenecks?
RL's unique value is handling sequential decision-making + delayed feedback + self-improvement—which supervised learning can't touch, and which is the shared bedrock of agents, reasoning (o1-style RL on reasoning), and alignment. The 2024–2026 trend is indeed "RL elicits/aligns the abilities a pretrained model already has" (e.g., RL with verifiable rewards on math/code). But there are three fundamental bottlenecks: (a) reward specification is hard—"good" in real tasks is hard to formalize, and a wrong reward gets faithfully optimized into disaster (see Q3's reward hacking); (b) sample efficiency and exploration—RL usually needs vast interaction, real-world trial-and-error (robotics, medicine) is hugely costly, and effective exploration in high-dimensional spaces remains unsolved; (c) stability and reproducibility—RL is sensitive to hyperparameters and random seeds, "works in the lab, collapses in a new environment." So the likelier picture is not "RL alone saves the day" but a combination of pretraining (acquiring world knowledge) + RL (alignment and decision-making) + retrieval/tools (external capability). RL is an indispensable link, but unless we solve "where the reward comes from and how to ensure it points at what we truly want," a stronger RL is, if anything, more dangerous. The bottleneck of intelligence is shifting from "how to optimize" to "what to optimize."