AI/ML Deep Dive: Classic ML Algorithms

Day 16 · 2026-06-02 · Difficulty ★★☆
For: experienced engineers outside the AI/ML field

For 15 days we've been talking about LLMs. But the LLM is the tip of the pyramid; its base is the classic ML built up between 1950–2000. Today we go back to the foundation: four algorithms still running at scale in production systems. What they share — interpretable, cheap, and often beating deep learning on tabular data. Understand these and you finally grasp what "machine learning" actually learns.

Linear RegressionLinear Regression

SupervisedRegression
One-line analogy

Linear regression is a scoring formula with learnable weights. Doing capacity planning you've hand-written things like cost = 0.3·CPU + 0.5·MEM + 0.2·IO — those coefficients you picked by gut. Linear regression does exactly the same thing, except it lets the data compute the optimal coefficients, aiming to minimize prediction error.

What it solves + how it works

Pain point: you have historical "input features → numeric outcome" samples (house area→price, request count→latency) and want a rule to predict new ones. Linear regression assumes the outcome is a weighted sum of inputs:

ŷ = w₁x₁ + w₂x₂ + … + wₙxₙ + b

Here xᵢ are features (area, room count), wᵢ are weights (how much each feature matters), b is the bias (baseline). How do we find the best w? Define a measure of "how badly we're wrong" — the Mean Squared Error (MSE): take each sample's prediction minus truth, square it, average.

Loss = (1/N) · Σ (yᵢ − ŷᵢ)²

Why square instead of absolute value? Three reasons: ① squares are differentiable everywhere, enabling gradient descent; ② big errors are amplified (an error of 10 is penalized 100× more than 1), forcing the model to avoid large mistakes; ③ mathematically there's a closed-form solution — one matrix formula w = (XᵀX)⁻¹Xᵀy gives the answer, no iteration. This is "least squares," used by Legendre back in 1805 — 150 years before neural nets.

Code
from sklearn.linear_model import LinearRegression
import numpy as np

# features: [area, rooms]   label: price (10k)
X = np.array([[50, 1], [80, 2], [120, 3], [200, 4]])
y = np.array([300, 480, 700, 1100])

model = LinearRegression()
model.fit(X, y)              # one line solves for optimal weights (closed form)

print(model.coef_)          # weight w per feature
print(model.intercept_)     # bias b
print(model.predict([[100, 2]]))  # predict price of 100㎡/2-room
Pitfall + practical scenario
"Linear regression only fits straight lines, too weak" — wrong. It's linear in the parameters, but you can feed it nonlinear features: add an x² column, log(x), x₁·x₂, and it fits curves and interactions. The real limit isn't "straight lines" — it's that you must design the features by hand. That's exactly the part deep learning replaces.
📌 Super-individual scenario: model your own time investment — "N hours writing per week → subscriber growth." The coefficients tell you which action has the highest ROI, far more reliable than deciding by feel. Linear regression is the cheapest causal-intuition tool.
Takeaway + question
💡 Linear regression is the "Hello World" of all ML: a weighted sum + an error measure + an optimizer. A neural net just stacks these three things a million times over.
🤔 That decision formula where you last "picked weights by gut" — if you had 50 rows of history, could regression compute the coefficients for you?

Logistic RegressionLogistic Regression

SupervisedClassificationProbability
One-line analogy

Logistic regression = linear regression + a valve that squashes any number into a "probability." It's like an admission controller with calibrated probabilities: instead of just "allow/deny," it says "this request is 87% likely to be normal traffic." It has "regression" in its name but does classification.

What it solves + how it works

Pain point: many tasks want a yes/no, not a number — is this email spam? will this user churn? Linear regression outputs (−∞, +∞), unusable as a probability. Logistic regression wraps the weighted sum in a Sigmoid function that squeezes it into (0, 1):

p = σ(wᵀx + b),  σ(z) = 1 / (1 + e−z)

Intuition: large z → p near 1 (almost certainly positive); very negative z → p near 0; z=0 → p=0.5 (maximally uncertain). This S-curve smoothly maps a "linear score" into a "probability." Its mathematical identity is the log-odds: wᵀx+b = log(p/(1−p)) — the model is linearly predicting the "log of the odds of winning."

Training uses not MSE but cross-entropy / log loss: when truth is positive, penalize by −log(p); when negative, by −log(1−p). Why switch losses? Sigmoid + MSE makes the loss non-convex (bumpy, gradient descent gets stuck in local minima); cross-entropy makes it convex again, and is equivalent to maximum likelihood estimation — the most natural choice mathematically.

Code
from sklearn.linear_model import LogisticRegression

# features: [email length, count of 'free']   label: 1=spam 0=ham
X = [[20, 0], [15, 3], [200, 0], [30, 5]]
y = [0, 1, 0, 1]

clf = LogisticRegression()
clf.fit(X, y)

# predict_proba gives probabilities, not hard labels — that's the value
print(clf.predict_proba([[18, 4]]))  # [[P(ham), P(spam)]]
print(clf.predict([[18, 4]]))        # hard label after default 0.5 threshold
Pitfall + practical scenario
"Output 0.9 means 90% accuracy" — this conflates probability with accuracy. 0.9 is the model's confidence on this one sample; whether it's "well calibrated" (do 90% of its 0.9-samples truly turn out positive?) must be verified separately. Also: the default 0.5 threshold isn't sacred — for fraud detection you might drop it to 0.2, preferring false alarms over misses.
📌 Super-individual scenario: build a lightweight "importance scorer" for your inbox/task stream. Logistic regression's weights are directly readable — you can see how much "sender is the boss" contributes. Far better than a black box for personal decision systems that need explaining.
Takeaway + question
💡 Logistic regression is the most-used classifier in industry — not because it's strongest, but because it's fast, interpretable, and outputs calibrated probabilities. A neural net's final layer (Softmax) is essentially multi-class logistic regression.
🤔 Which of your "binary judgments" (worth doing? trustworthy?) could actually be decomposed into a few quantifiable features and handed to a scorer?

Decision Tree & Random ForestDecision Tree & Random Forest

SupervisedEnsembleTrees
One-line analogy

A decision tree is an auto-generated nested if-else routing table — exactly like the rule engines you've written, except the rules are learned from data. A random forest then replaces a single-point decision with quorum voting: grow hundreds of mutually different trees, let them vote, and majority rule cancels out any single tree's random bias — the same idea as using replica redundancy to defend against single-point failure in distributed systems.

What it solves + how it works

Pain point: linear models make you design features by hand and assume the relationship is linear. A decision tree assumes no shape — it greedily splits the data over and over: at each step asking "which feature, which threshold makes the two sides purest?" Purity is measured by Gini impurity:

Gini = 1 − Σ pₖ²

pₖ is the fraction of class k in a node. All one class → Gini=0 (purest); 50/50 → Gini=0.5 (most mixed). The tree keeps picking the split that drops weighted Gini the most, until leaves are pure enough. This gives you a fully readable model — you can draw it for a non-technical audience.

But a single tree has a fatal flaw: extremely high variance. Swap a few samples and it grows into a totally different tree — it overfits easily. Random forest (Breiman 2001) uses two layers of randomness to decorrelate the trees:

  • Bagging (bootstrap sampling): each tree trains only on a random sample drawn with replacement — like sampling data shards;
  • Random feature subsets: each split considers only a random subset of features, forcing different trees to view from different angles.
Random Forest = many decorrelated trees voting

random samples+features Tree1: spam Tree2: ham Tree3: spam …Tree500
          └──────── majority vote (quorum) ────────┘
final output spam (312 votes vs 188)

Key math intuition: averaging N roughly-independent noisy predictions cuts variance to about 1/N (the same statistics as taking the mean of repeated measurements to reduce noise). Bagging makes the trees independent, so voting "averages away" each tree's overfitting — the core of why ensembles win.

Code
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

# n_estimators = number of trees; more is steadier, with diminishing returns
rf = RandomForestClassifier(n_estimators=300, max_depth=5)
rf.fit(X, y)

# near-free feature importances — forests naturally rank which feature helps most
for name, imp in zip(["sepal_len","sepal_wid","petal_len","petal_wid"], rf.feature_importances_):
    print(f"{name}: {imp:.2f}")
Pitfall + practical scenario
"Deeper trees are better" — wrong. A single deep tree memorizes the training set (overfits) and generalizes worse. Random forest wins via "many shallow trees voting," not "one deep tree." When you want a single production model, the more common choice is gradient boosting (XGBoost / LightGBM) — also a tree ensemble, but using boosting (each tree corrects the last) instead of bagging (parallel voting), and it still beats deep learning on tabular-data competitions.
📌 Super-individual scenario: with a structured dataset in hand (user-behavior table, personal finances, experiment logs), don't reach for a neural net first. Random forest / XGBoost gives you a strong baseline + feature-importance ranking in a few lines with no tuning, telling you "which variables truly drive the outcome."
Takeaway + question
💡 A single tree is interpretable but unstable; random forest trades a little interpretability for robustness via the "wisdom of the crowd" — a perfect mapping of distributed-redundancy thinking into ML.
🤔 Your familiar "replica voting / quorum consensus" and the forest's "tree voting" — how does each manufacture independence between members? Why is independence the linchpin of both?

Support Vector MachineSupport Vector Machine (SVM)

SupervisedClassificationKernel
One-line analogy

SVM finds the widest "buffer zone / DMZ" between two classes. Other classifiers are happy drawing any line that separates the sides; SVM insists on the line that is farthest from both — the wider the buffer, the more robust to new data. And that line is decided by only a handful of critical samples hugging the boundary (the support vectors), like the few constraints that are actually "tight" in an optimization problem.

What it solves + how it works

Pain point: there are infinitely many straight lines separating two classes — which generalizes best? SVM gives a clear answer: maximize the margin, the distance from the decision boundary to the nearest sample. Geometrically the margin equals 2/‖w‖, so "maximize the margin" equals "minimize ‖w‖," solved under the constraint that "every point lands correctly outside the buffer":

min ½‖w‖² s.t. yᵢ(wᵀxᵢ + b) ≥ 1

Intuition: smaller ‖w‖ → wider buffer; the constraint guarantees each sample sits on its own side, at least "one unit" from the boundary. The optimal solution is determined only by the few support vectors pressed against the boundary — delete the other samples and the boundary doesn't budge. This makes SVM especially stable in high-dimensional, small-sample settings.

But data is often not linearly separable (no straight line works). SVM's killer move is the kernel trick: implicitly map the data into a higher-dimensional space where the two tangled classes become cleanly cuttable by a plane — like adding a computed column to a SQL query (say x²+y²), instantly making ring-shaped data separable. The beauty is the kernel never actually computes the high-dimensional coordinates, only the similarity between samples, saving astronomical compute:

RBF kernel: K(x, x′) = exp(−γ‖x − x′‖²)

γ controls the "radius of influence": large γ → each point only affects its neighborhood, boundary bends (overfits easily); small γ → wide influence, smooth boundary. This is SVM's most important knob.

Code
from sklearn.svm import SVC
from sklearn.datasets import make_circles

# ring data: inner ring one class, outer ring another — no line can split
X, y = make_circles(n_samples=200, noise=0.1, factor=0.4)

# RBF kernel: implicitly lift dimensions, making the rings separable
clf = SVC(kernel="rbf", C=1.0, gamma="scale")
clf.fit(X, y)

print("num support vectors:", len(clf.support_))  # only a few samples decide the boundary
print("accuracy:", clf.score(X, y))
Pitfall + practical scenario
"SVM is obsolete, killed by deep learning" — half true. On big data, images, and text it did cede ground to neural nets; but in small-sample (hundreds to thousands), high-dimensional settings (bioinformatics, small classification tasks), SVM is still accurate, stable, and needs no GPU. Picking the wrong tool isn't the algorithm's fault — there's no silver bullet, only fit.
📌 Super-individual scenario: you have only 300 labeled samples (say hand-labeled "high-quality / ordinary" articles). A neural net will surely overfit. An RBF-kernel SVM is often the best starting point for small-sample classification.
Takeaway + question
💡 SVM teaches two timeless ideas: "maximum margin" = leave slack to be robust; "kernel trick" = look from another space and the hard problem turns easy. These two outlive SVM itself.
🤔 "Lifting dimensions turns inseparable into separable" — is there an analog in your other domains (data modeling, problem decomposition, cross-disciplinary thinking): switch representation space and a tangled problem suddenly resolves?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. All four of today's algorithms are "supervised learning." What's their essential difference from — and common ground with — the LLMs of the past 15 days, in terms of "what is learned"?
Common ground: all fit a function f(input)→output from data, all minimize some loss — an LLM's next-token prediction is also classification (Softmax over a vocabulary of tens of thousands, i.e. a giant logistic regression). The differences are threefold: (1) features — classic ML needs you to hand-design features (area, word frequency), while deep learning learns representations automatically from raw data, the revolution of "representation learning"; (2) capacity and data — linear/SVM have few parameters and run on hundreds of samples, while LLMs have hundreds of billions of parameters and need vast data or they overfit; (3) interpretability — you can read logistic regression's weights one by one, while an LLM is a black box. Deeper insight: a neural network can be seen as a "tower of logistic regressions with automated feature engineering" — each layer is a linear transform + nonlinear activation, and the final layer is logistic regression. Grasp classic ML and you truly see how pivotal it is that deep learning "removed human feature engineering."
2. Why do tree ensembles like XGBoost still often beat deep learning on "tabular / structured data"? What does this imply for "should we just throw a big model at everything"?
Tabular data's traits: heterogeneous features (numeric/categorical mixed), modest dimensionality, limited samples, no spatial/temporal structure between features. Tree models are naturally good at this: (1) insensitive to feature scale, no normalization needed; (2) automatically handle nonlinearity and feature interactions (each split is an interaction); (3) robust to irrelevant features, ignoring them automatically. Neural nets' strength — learning hierarchical representations from raw signals — has nothing to chew on when the table is "already hand-engineered features," and with few samples they overfit and need heavy tuning. Implication: match the model to the data structure; bigger isn't better. Images/text/audio — "high-dimensional, structured, abundant" unstructured data — are deep learning's home turf; but the huge number of tabular prediction problems in enterprises are crushed by a few-line XGBoost that beats a carefully tuned neural net, while being 100× cheaper and interpretable. "Just throw a big model at it" is often using an aircraft carrier to deliver a parcel.
3. Random forest cuts variance via "many trees voting"; SVM seeks robustness via "maximum margin." Are these two anti-overfitting ideas actually the same thing underneath?
Different on the surface, but both fight overfitting = the model memorized noise instead of the rule, betting on different mechanisms. Random forest takes the "averaging out noise" route: a single tree has high variance (sensitive to the training set), but as long as the trees' error directions are uncorrelated, averaging shrinks variance by ~1/N — the ensemble / statistical-averaging idea, same family as "average repeated measurements" and "distributed multi-replica." SVM takes the "leave slack" route: instead of fitting the training points as happily as possible, it pushes the boundary as far from the data as it can, leaving a buffer — the regularization / Occam's razor idea, "don't draw the rule hugging the samples; leave a buffer to survive perturbation." Unified view: both deliberately sacrifice fit on the training set to gain stability on unseen data — exactly the core tension of generalization (the bias-variance trade-off). Grasp this and you can ask of any ML method: "how is it buying robustness?"
4. These algorithms were invented 50–200 years ago (least squares 1805, SVM 1995). Why are they still running at scale in production today?
Because they hit the true complexity of most real problems. A counterintuitive fact: 80% of industry's ML value comes not from the bleeding edge but from applying classic algorithms in the right place. Reasons: (1) interpretability is mandatory — finance risk control, healthcare, hiring and other high-stakes settings are legally required to explain decisions; logistic regression's weights are inherently compliant, while black-box big models can't be used; (2) cost and latency — logistic-regression inference is one dot product, microseconds, zero GPU; an LLM inference is orders of magnitude pricier; (3) data reality — most companies lack the data to train a big model but have a few thousand rows of tabular data, exactly classic ML's sweet spot; (4) robust and maintainable — simple models have clear failure modes, easy to debug and monitor. The deeper lesson matters especially for the "AI super-individual": real engineering judgment isn't always chasing the newest, but knowing how heavy a tool each problem deserves. Whatever linear regression can solve, don't reach for a Transformer — that restraint is itself a scarce skill.