For 15 days we've been talking about LLMs. But the LLM is the tip of the pyramid; its base is the classic ML built up between 1950–2000. Today we go back to the foundation: four algorithms still running at scale in production systems. What they share — interpretable, cheap, and often beating deep learning on tabular data. Understand these and you finally grasp what "machine learning" actually learns.
Linear regression is a scoring formula with learnable weights. Doing capacity planning you've hand-written things like cost = 0.3·CPU + 0.5·MEM + 0.2·IO — those coefficients you picked by gut. Linear regression does exactly the same thing, except it lets the data compute the optimal coefficients, aiming to minimize prediction error.
Pain point: you have historical "input features → numeric outcome" samples (house area→price, request count→latency) and want a rule to predict new ones. Linear regression assumes the outcome is a weighted sum of inputs:
Here xᵢ are features (area, room count), wᵢ are weights (how much each feature matters), b is the bias (baseline). How do we find the best w? Define a measure of "how badly we're wrong" — the Mean Squared Error (MSE): take each sample's prediction minus truth, square it, average.
Why square instead of absolute value? Three reasons: ① squares are differentiable everywhere, enabling gradient descent; ② big errors are amplified (an error of 10 is penalized 100× more than 1), forcing the model to avoid large mistakes; ③ mathematically there's a closed-form solution — one matrix formula w = (XᵀX)⁻¹Xᵀy gives the answer, no iteration. This is "least squares," used by Legendre back in 1805 — 150 years before neural nets.
from sklearn.linear_model import LinearRegression import numpy as np # features: [area, rooms] label: price (10k) X = np.array([[50, 1], [80, 2], [120, 3], [200, 4]]) y = np.array([300, 480, 700, 1100]) model = LinearRegression() model.fit(X, y) # one line solves for optimal weights (closed form) print(model.coef_) # weight w per feature print(model.intercept_) # bias b print(model.predict([[100, 2]])) # predict price of 100㎡/2-room
Logistic regression = linear regression + a valve that squashes any number into a "probability." It's like an admission controller with calibrated probabilities: instead of just "allow/deny," it says "this request is 87% likely to be normal traffic." It has "regression" in its name but does classification.
Pain point: many tasks want a yes/no, not a number — is this email spam? will this user churn? Linear regression outputs (−∞, +∞), unusable as a probability. Logistic regression wraps the weighted sum in a Sigmoid function that squeezes it into (0, 1):
Intuition: large z → p near 1 (almost certainly positive); very negative z → p near 0; z=0 → p=0.5 (maximally uncertain). This S-curve smoothly maps a "linear score" into a "probability." Its mathematical identity is the log-odds: wᵀx+b = log(p/(1−p)) — the model is linearly predicting the "log of the odds of winning."
Training uses not MSE but cross-entropy / log loss: when truth is positive, penalize by −log(p); when negative, by −log(1−p). Why switch losses? Sigmoid + MSE makes the loss non-convex (bumpy, gradient descent gets stuck in local minima); cross-entropy makes it convex again, and is equivalent to maximum likelihood estimation — the most natural choice mathematically.
from sklearn.linear_model import LogisticRegression # features: [email length, count of 'free'] label: 1=spam 0=ham X = [[20, 0], [15, 3], [200, 0], [30, 5]] y = [0, 1, 0, 1] clf = LogisticRegression() clf.fit(X, y) # predict_proba gives probabilities, not hard labels — that's the value print(clf.predict_proba([[18, 4]])) # [[P(ham), P(spam)]] print(clf.predict([[18, 4]])) # hard label after default 0.5 threshold
A decision tree is an auto-generated nested if-else routing table — exactly like the rule engines you've written, except the rules are learned from data. A random forest then replaces a single-point decision with quorum voting: grow hundreds of mutually different trees, let them vote, and majority rule cancels out any single tree's random bias — the same idea as using replica redundancy to defend against single-point failure in distributed systems.
Pain point: linear models make you design features by hand and assume the relationship is linear. A decision tree assumes no shape — it greedily splits the data over and over: at each step asking "which feature, which threshold makes the two sides purest?" Purity is measured by Gini impurity:
pₖ is the fraction of class k in a node. All one class → Gini=0 (purest); 50/50 → Gini=0.5 (most mixed). The tree keeps picking the split that drops weighted Gini the most, until leaves are pure enough. This gives you a fully readable model — you can draw it for a non-technical audience.
But a single tree has a fatal flaw: extremely high variance. Swap a few samples and it grows into a totally different tree — it overfits easily. Random forest (Breiman 2001) uses two layers of randomness to decorrelate the trees:
Key math intuition: averaging N roughly-independent noisy predictions cuts variance to about 1/N (the same statistics as taking the mean of repeated measurements to reduce noise). Bagging makes the trees independent, so voting "averages away" each tree's overfitting — the core of why ensembles win.
from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) # n_estimators = number of trees; more is steadier, with diminishing returns rf = RandomForestClassifier(n_estimators=300, max_depth=5) rf.fit(X, y) # near-free feature importances — forests naturally rank which feature helps most for name, imp in zip(["sepal_len","sepal_wid","petal_len","petal_wid"], rf.feature_importances_): print(f"{name}: {imp:.2f}")
SVM finds the widest "buffer zone / DMZ" between two classes. Other classifiers are happy drawing any line that separates the sides; SVM insists on the line that is farthest from both — the wider the buffer, the more robust to new data. And that line is decided by only a handful of critical samples hugging the boundary (the support vectors), like the few constraints that are actually "tight" in an optimization problem.
Pain point: there are infinitely many straight lines separating two classes — which generalizes best? SVM gives a clear answer: maximize the margin, the distance from the decision boundary to the nearest sample. Geometrically the margin equals 2/‖w‖, so "maximize the margin" equals "minimize ‖w‖," solved under the constraint that "every point lands correctly outside the buffer":
Intuition: smaller ‖w‖ → wider buffer; the constraint guarantees each sample sits on its own side, at least "one unit" from the boundary. The optimal solution is determined only by the few support vectors pressed against the boundary — delete the other samples and the boundary doesn't budge. This makes SVM especially stable in high-dimensional, small-sample settings.
But data is often not linearly separable (no straight line works). SVM's killer move is the kernel trick: implicitly map the data into a higher-dimensional space where the two tangled classes become cleanly cuttable by a plane — like adding a computed column to a SQL query (say x²+y²), instantly making ring-shaped data separable. The beauty is the kernel never actually computes the high-dimensional coordinates, only the similarity between samples, saving astronomical compute:
γ controls the "radius of influence": large γ → each point only affects its neighborhood, boundary bends (overfits easily); small γ → wide influence, smooth boundary. This is SVM's most important knob.
from sklearn.svm import SVC from sklearn.datasets import make_circles # ring data: inner ring one class, outer ring another — no line can split X, y = make_circles(n_samples=200, noise=0.1, factor=0.4) # RBF kernel: implicitly lift dimensions, making the rings separable clf = SVC(kernel="rbf", C=1.0, gamma="scale") clf.fit(X, y) print("num support vectors:", len(clf.support_)) # only a few samples decide the boundary print("accuracy:", clf.score(X, y))