Supervised learning has "ground-truth answers"; reinforcement learning only has after-the-fact feedback—you make a sequence of decisions and the environment only tells you whether the outcome was good, not what to do at each step. This is the underlying paradigm for training agents, aligning LLMs (RLHF), playing games, and robotic control. Today's four cores build on each other: Q-Learning (learn what each action is worth) → Policy Gradient (learn what to do directly) → Actor-Critic (fuse the two) → PPO (make training stable—also the engine behind ChatGPT-style alignment). They are layered, not four parallel tools.
A Q-table is a memoization cache with a TTL: the key is the (state, action) pair, the value is "how many points this ultimately earns me." Just like caching a slow query—the first time you compute it via exploration, afterward you just look it up. The twist: this table self-corrects—each visit overwrites the old value with a better estimate, eventually converging.
Pain point: taking a step in a maze doesn't immediately tell you whether the step was good—rewards are delayed (you only score at the goal). How do you "propagate" the goal's reward back to every earlier step? Q-Learning's answer is the Bellman update (bootstrapping):
Symbol by symbol: Q(s,a) is the current value estimate for "doing action a in state s"; r is the immediate reward; max Q(s',a') is "once at the new state s', how much can the best action earn"—using the next step's estimate to improve this step, which is bootstrapping. The whole bracket is the TD error (temporal-difference error): the gap between reality (r+γ·future) and the old estimate; α (learning rate) controls how much you correct each time. The intuition is simply "nudge the current estimate toward a slightly more accurate estimate"—like cache eventual consistency, more accurate the more it's visited.
Why off-policy? The update uses max (the greedy optimal action), but during exploration you act with ε-greedy (mostly take the current best, occasionally try a random one)—the policy you learn and the policy you act with can differ. Like A/B testing production traffic: 99% goes to the stable route, 1% explores new routes, but what you learn is "which route is optimal." In 2013 DQN (Mnih et al.) replaced this table with a neural network to approximate Q, learning to play Atari straight from pixels—the dawn of deep RL.
import gymnasium as gym, numpy as np env = gym.make("FrozenLake-v1", is_slippery=False) Q = np.zeros((env.observation_space.n, env.action_space.n)) # Q table α, γ, ε = 0.8, 0.95, 0.1 for ep in range(2000): s, _ = env.reset(); done = False while not done: # ε-greedy: small chance to explore randomly, else look up the best a = env.action_space.sample() if np.random.rand()<ε else Q[s].argmax() s2, r, term, trunc, _ = env.step(a); done = term or trunc # Bellman update: correct current estimate with r + γ·best next Q[s, a] += α * (r + γ * Q[s2].max() - Q[s, a]) s = s2 print(Q.argmax(axis=1)) # the optimal action learned per state
Q-Learning builds a value table first, then picks actions from it (indirect); Policy Gradient tunes the parameters of a "policy function" directly—like directly adjusting a load balancer's routing weights: observe one route has low latency (high reward), bump up its weight, and next time you're more likely to take it. No detour through a value table—optimize the decision itself.
Pain point: when actions are continuous (turn the wheel 17.3°) or the state space is huge, Q-tables break down. Fix: use a neural network πθ(a|s) to directly output "the probability of each action in state s," with θ the network's parameters. The objective is to maximize expected return J(θ). The key policy gradient theorem (Sutton et al. 2000) tells us how to differentiate it:
Intuition: ∇log π(a|s) is the direction that "makes this action more likely"; G is the trajectory's actual return (total score). Their product = "weighted by return, push good actions' probabilities up and bad actions' down." High-return actions get a strong push; negative-return actions get pushed down. This is the classic REINFORCE algorithm—essentially "reward-weighted maximum likelihood."
Fatal weakness: extremely high variance. G is a Monte Carlo sample of an entire trajectory, heavily luck-dependent—the same good policy might score 100 this run and 10 the next due to bad luck, so the gradient signal swings wildly, and training is like climbing a hill through noise: slow and unstable. This "high variance" problem is exactly what the next card, Actor-Critic, solves.
import torch, torch.nn as nn policy = nn.Sequential(nn.Linear(4,128), nn.ReLU(), nn.Linear(128,2)) opt = torch.optim.Adam(policy.parameters(), lr=1e-2) # after running a full episode, with (log_probs, per-step returns): def update(log_probs, returns): returns = torch.tensor(returns) # normalizing returns = a simple variance-reduction trick (proto-baseline) returns = (returns - returns.mean()) / (returns.std() + 1e-8) # loss = -Σ log π(a|s)·G ; maximizing return = minimizing its negative loss = -(torch.stack(log_probs) * returns).sum() opt.zero_grad(); loss.backward(); opt.step() # push good actions up
Pure policy gradient is like "only knowing if you fixed it once production metrics come in"—slow and noisy. Actor-Critic adds an in-house code reviewer: the Actor = the policy network, which writes the code (picks actions); the Critic = a value network that instantly scores "how good relative to a baseline" at each step. No waiting for the finale—immediate feedback, far faster iteration.
It directly cures policy gradient's high-variance disease. The core move: replace the noisy raw return G with the Advantage function:
Term by term: V(s) (the state value the Critic learns) = "the average score obtainable in state s," a baseline; A = "how much better this action is than average." Swap G for A in the policy gradient:
Why does variance plummet? Raw G might be "+100" (everything looks great, you can't tell which action is truly strong); after subtracting the baseline it becomes a centered signal like "+5 / −3"—good vs. bad in sharp contrast, gradient direction clear. Mathematically, subtracting a baseline that depends only on the state (not the action) leaves the gradient's expectation unchanged (unbiased) yet sharply reduces variance. One of the most elegant free lunches in RL. How does the Critic learn V? With the same TD-error bootstrapping as Q-Learning. A2C / A3C (Mnih et al. 2016) are the classic implementations.
# Actor and Critic usually share a body, each with its own head class ActorCritic(nn.Module): def __init__(self): super().__init__() self.body = nn.Sequential(nn.Linear(4,128), nn.ReLU()) self.actor = nn.Linear(128, 2) # outputs action logits self.critic = nn.Linear(128, 1) # outputs V(s), one scalar def forward(self, s): h = self.body(s) return self.actor(h), self.critic(h) # key step: use the Critic's V to compute advantage, replacing high-variance G logits, v = model(s); _, v2 = model(s2) advantage = r + γ * v2.detach() - v # detach: don't backprop target into Critic actor_loss = -(dist.log_prob(a) * advantage.detach()) critic_loss = advantage.pow(2) # make V approach r+γV(s')
PPO adds canary deployment / rate limiting to policy updates: only dare to change a little at a time, and the new policy must not stray too far from the old one. Because in RL, stepping too far can cause policy collapse (like a full rollout that takes production down—and you can't roll back, since all subsequent data is sampled by the broken policy). PPO = hard-code "change at most ±20% per step" into the loss function.
Pain point: Actor-Critic can still update too aggressively in one step and collapse. The earlier TRPO solved this with a complex second-order constraint, but it's cumbersome to implement. PPO (Schulman et al. 2017) does it with a stunningly plain clip objective. First define the probability ratio rt(θ) = πθ(a|s) / πθold(a|s)—the ratio of new to old policy probabilities for the same action (=1 means no change). The core objective:
Unpacking the min and clip (ε usually 0.2): clip forces the ratio into [0.8, 1.2]—want to raise an action's probability to 1.5× the old? Sorry, the gain is capped at 1.2×, so no matter how aggressive, there's no extra benefit, hence no incentive to take a big step. The outer min takes the smaller of the clipped and unclipped values, a conservative "pessimistic lower bound": when advantage A>0 (good action) it limits the increase, when A<0 (bad action) it limits the decrease—neither direction is allowed to stray too far. In a phrase: "small, fast steps within the old policy's trust region."
This simple change gives PPO both stability and ease of implementation, making it the de facto standard. Its most important application: RLHF—train a reward model from human preferences, then use PPO to optimize the LLM policy, while adding a KL penalty to keep the model from drifting too far from its original language ability just to please the reward (same spirit as clip: "don't run too far"). The alignment stage of ChatGPT and Claude is built on this.
# in production, just use stable-baselines3—a PPO in a few lines from stable_baselines3 import PPO model = PPO("MlpPolicy", "CartPole-v1", verbose=1) model.learn(total_timesteps=50_000) # —— the heart of the clip objective is just these lines (to grasp it) —— ratio = torch.exp(new_logp - old_logp) # ratio = exp(log diff) unclipped = ratio * advantage clipped = torch.clamp(ratio, 1-ε, 1+ε) * advantage # pin to [0.8,1.2] policy_loss = -torch.min(unclipped, clipped).mean() # pessimistic lower bound