Day 4 · 2026.05.27

The Essence of Linear Algebra

See numbers as geometry, see matrices as motion

"There are two kinds of people in the world: those who think of matrices as arrays of numbers, and those who think of them as linear transformations. The second kind have all the fun." — paraphrased after Sheldon Axler, Linear Algebra Done Right

Vector — list, arrow, and abstraction

Vector · Linear Algebra / Geometry

Vector Space

Intuition

The same vector looks like three different things to three kinds of people. To physicists, it is an arrow with direction: force, velocity, displacement. To programmers, it is a list of numbers: [2, 1.3, -0.7] — a row of data, the pixels of an image, a word embedding. To mathematicians, neither of those is the definition; they ask only: "can you add it, can you scale it, and do those operations obey a few simple rules?" If so, it lives in a vector space. Polynomials can be vectors. Functions can be vectors. Quantum states can be vectors.

3Blue1Brown identifies the meeting of these three views as the single most important thing in linear algebra: the dictionary between "arrow" and "list" is the bridge between geometry and computation. Write a numpy array — there is a geometric object behind it. Draw an arrow — there is a coordinate list behind it. Every formula in linear algebra is, essentially, a translation between these two sides.

Formal Definition

A vector space $V$ (over $\mathbb{R}$) is a set with two operations ${+}: V\times V\to V$ and ${\cdot}: \mathbb{R}\times V\to V$ satisfying eight axioms — associativity, commutativity, distributivity, zero element, additive inverse, and so on. The most common example is $\mathbb{R}^n = \{(x_1,\dots,x_n)\}$. Each symbol is a shadow of geometry: $+$ is the parallelogram law; $c\cdot v$ stretches the arrow by a factor of $c$ (and reverses it if $c<0$). The axioms are not pedantic — they exist so that the same theorems apply simultaneously to polynomial spaces, function spaces, and quantum state spaces.

Why It Is Beautiful

The triumph of abstraction: prove a theorem once, and you have a conclusion about physics, data, signals, and quantum states all at once. Discovering that surface-different things share the same structure is the central creed of the Bourbaki school and of modern mathematics. Linear algebra is its earliest and cleanest example: Euclidean geometry, list computation, and functional analysis — three rivers, one sea. Hardy would say: this kind of "unifying the scattered" beauty is deeper than any formula.

Application + History & People

In AI, everything is a vector. A word in GPT is a vector of $\sim$ 12288 dimensions; an ImageNet image is a vector of 150528 dimensions; an audio segment is a sequence of vectors. The intuition that "similar = cosine close, add/subtract = semantic operation" (king − man + woman ≈ queen) rests entirely on the algebra of vector spaces. In physics, a quantum state is a unit vector in Hilbert space (an infinite-dimensional vector space); superposition is just vector addition.

History: Hamilton invented quaternions in 1843 (a "4-D vector with multiplication"); Grassmann published his Ausdehnungslehre in 1844 introducing more general "extensive quantities" — astonishingly ahead of its time, almost no one understood it, and it was buried for half a century. Gibbs and Heaviside in the 1880s repackaged it into modern vector analysis, which physicists could finally use. The abstract definition of "vector space" itself had to wait until Peano wrote it down in 1888, and was popularized by Weyl in the 1920s; it gradually became the lingua franca of modern math.

Essence in one line: A vector is neither an arrow nor a list — it is the very act of "being addable and scalable."

Thinking Question

Treat all polynomials of degree less than 4, $a + bx + cx^2 + dx^3$, as a vector space. What is its dimension? What is its natural basis? Can you find a linear transformation (a map preserving addition and scaling) that does $f \mapsto f'$ (differentiation) on this space? Write down its matrix in the basis $\{1, x, x^2, x^3\}$ — and you'll discover that differentiation, is, a matrix.

Matrix = Linear Transformation

Matrix as Linear Transformation · Geometry / Algebra

Linear Map

Intuition

Thinking of a matrix as "a square box of numbers" is the disaster of high school. Switch viewpoint: a matrix is a motion applied to space. This motion picks up all of $\mathbb{R}^2$ (or $\mathbb{R}^n$) and rearranges it according to some rule, but two rules cannot be broken: (1) the origin stays put; (2) any grid lines that started out parallel and evenly spaced must end up parallel and evenly spaced — no bending allowed. The set of all such motions is the set of "linear transformations," and each one can be recorded by a matrix.

The recording method is dead simple: write down where $\hat{i}=(1,0)$ lands as the first column; write down where $\hat{j}=(0,1)$ lands as the second column. The columns of a matrix are the destinations of the basis vectors. Once you see this, $M\mathbf{v}$ is no longer a formula — it asks: "If you carry the arrow $\mathbf{v}$ along with the motion $M$, where does it end up?" The answer is "the coordinates of $\mathbf{v}$, used as weights, summing the columns of $M$." Every matrix-multiplication formula grows out of this single sentence.

Formal Definition

A map $T: \mathbb{R}^n \to \mathbb{R}^m$ is linear if and only if for all $\mathbf{u},\mathbf{v}$ and scalars $c$, $T(\mathbf{u}+\mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v})$ and $T(c\mathbf{v}) = cT(\mathbf{v})$. Once you fix the standard basis $\{e_1,\dots,e_n\}$ of $\mathbb{R}^n$, the entire $T$ is determined by the $n$ values $T(e_j)$ — stack them as columns of an $m\times n$ matrix, and you have the matrix representation of $T$. The composition of two linear transformations corresponds to matrix multiplication: this is why the columns of $AB$ are $A$ applied to each column of $B$.

Why It Is Beautiful

This one translation — "matrix = motion" — fuses algebra and geometry. Abstract multiplication becomes composition of motions. $AB \neq BA$ is no longer mysterious: rotate-then-shear is different from shear-then-rotate. "Inverse matrix" is no longer a formula; it is "undo the motion." "Singular / determinant zero" is no longer a check condition; it is "the motion squashed space into a lower dimension" — information is lost, the action cannot be reversed. Nearly every abstract concept in linear algebra has a clean geometric counterpart. Few areas of math achieve such a perfect duality between algebra and geometry.

Application + History & People

The 3D graphics pipeline: every game frame is a chain of 4×4 matrices carrying model coordinates into world space, then into camera space, then projecting onto the screen. An entire deep neural network can be restated as a chain of "matrix multiply + nonlinearity" — each layer $\mathbf{h} = \sigma(W\mathbf{x}+\mathbf{b})$ has a $W$ that describes how this layer carries the input space. Transformer attention is also matrix multiplication ($QK^T$ gives the attention distribution, then left-multiplied by $V$). AlphaFold translates the protein folding problem into a geometric transformation problem.

History: Arthur Cayley in 1858 first systematized "matrix algebra," treating a matrix as an independent object of study (not just shorthand for a system of equations). James Sylvester coined the name "matrix" (Latin for "womb," because it "gestates" the determinant). But the revolution of matrix thinking came in 1925 — Heisenberg, recovering on the island of Helgoland, invented the first form of quantum mechanics, "matrix mechanics." He did not even know what he wrote was a matrix; Born and Jordan recognized it for him. From that point on, matrices were not just a mathematical tool — they were the language of the universe.

Essence in one line: A matrix is not a grid of numbers; it is a motion on space — and its columns are the destinations of the basis vectors.

Thinking Question

Rotating $\mathbb{R}^2$ by 30° corresponds to a matrix $R$. Rotating by another 30° corresponds to applying $R$ again, i.e. $R^2$. If you combine the two steps into a single rotation by 60°, the corresponding matrix is just $R^2$. Without a lookup, use $R^2$ to derive $\cos 60°$ and $\sin 60°$. You will discover that the trigonometric identity $\cos(2\theta) = \cos^2\theta - \sin^2\theta$ is nothing more than the diagonal entry of a matrix product. The cornerstone theorem of trigonometry, it turns out, is a byproduct of linear algebra.

Eigenvector = Invariant Direction

Eigenvectors & Eigenvalues · Spectral Theory

Spectral Theory

Intuition

A matrix $A$, applied to most vectors, both stretches and rotates them — the arrow changes direction. But for a few very special directions, $A$ only stretches the vector (or compresses it, or flips its sign) without rotating it. The direction is preserved. Those special directions are the eigenvectors of $A$; the stretching factor is the corresponding eigenvalue.

Imagine you hold a ball of clay and squeeze it along some direction. The clay gets flattened, but the squeezing axis and the two perpendicular axes — these three directions — only have their points moved closer to or farther from the origin; they did not "tilt." Those three lines are the eigen-directions of that squeeze. Every linear transformation has a few of these "most natural" axes; finding them is seeing through to the essence of the transformation. 3Blue1Brown puts it: "Eigenvectors are the intrinsic face of a transformation."

Formal Definition

Let $A \in \mathbb{R}^{n\times n}$. If there exists a nonzero vector $\mathbf{v}$ and a scalar $\lambda$ such that

$$A\mathbf{v} = \lambda \mathbf{v}$$

then $\mathbf{v}$ is an eigenvector of $A$ and $\lambda$ is the corresponding eigenvalue. Each symbol maps directly to geometry: the left side $A\mathbf{v}$ is "carry $\mathbf{v}$ along through transformation $A$"; the right side $\lambda\mathbf{v}$ is "just stretch $\mathbf{v}$ along its own direction by a factor of $\lambda$." The equality means: that direction is respected by $A$. The standard way to find eigenvalues is to solve $\det(A-\lambda I)=0$. Why does this work? Because it asks: "Is there a nonzero vector squashed to zero by $A-\lambda I$?" — equivalently, "Is there a direction on which $A$ acts as exactly $\lambda$ times stretching?"

Why It Is Beautiful

Eigenvectors reveal a deep fact: every linear transformation comes with its own coordinate system built in. If you switch to that coordinate system (the eigen-basis), the matrix becomes diagonal — all the apparent coupling vanishes, leaving only $n$ independent stretches. This is "diagonalization." It is the standard tool for decomposing a high-dimensional coupled system into independent one-dimensional ones — the physicist's coveted "principal axes / normal modes."

What makes it even more beautiful: this single principle binds together a matrix (algebraic object), the intrinsic axes of a motion (geometric object), the natural frequencies of a vibrating system (physical object), and the steady-state distribution of a Markov chain (probabilistic object). One mathematical object, several disciplines.

Application + History & People

PageRank (the 1998 Google paper) at its core: model the entire web as a giant transition matrix $M$; the eigenvector belonging to the largest eigenvalue is the "importance" score of every page. PCA (Principal Component Analysis): compute eigenvectors of the data covariance matrix; you get the directions of largest variance — the foundation of dimensionality reduction and visualization. Quantum mechanics: the eigenvalues of the Hamiltonian operator $\hat{H}$ are the allowed energies, and the eigenvectors are energy eigenstates — the time-independent Schrödinger equation $\hat{H}\psi = E\psi$ is structurally an eigenvalue problem. Vibration analysis: the natural frequencies of buildings, bridges, airplane wings are generalized eigenvalues of mass and stiffness matrices. The famous 1940 Tacoma Narrows Bridge collapse was an eigenmode being driven into resonance by the wind.

History: Euler had implicitly used principal-axis ideas in the 1750s while studying moments of inertia of rigid bodies. Cauchy formally introduced the characteristic equation in 1829, in his classification of quadric surfaces. The hybrid word "eigenvalue" comes from the German Eigenwert ("own value"), spread by Hilbert in his 1904 work on integral equations. The 20th-century development of spectral theory (Hilbert, von Neumann) became the mathematical foundation of quantum mechanics — a piece of mathematics laid out the language for a new physics before the physics had even arrived.

Essence in one line: Every transformation has a few axes it likes best; find them, and a complex motion decomposes into a few independent stretches.

Thinking Question

Does a pure rotation matrix on $\mathbb{R}^2$ (say 90°) have real eigenvalues? Trust your geometry first: rotation moves every direction, so no direction is preserved — it seems there should be none. But algebraically $\det(A-\lambda I)=0$ always has a solution (in the complex numbers). How can these two be reconciled? This points to a deep fact — why quantum mechanics cannot do without complex numbers.

Determinant = Volume Scaling Factor

Determinant as Volume Scale Factor · Geometry / Algebra

Determinant

Intuition

Take a unit square (side 1, area 1). Apply matrix $A$. It becomes a parallelogram. The signed area of that parallelogram is $\det A$. In $\mathbb{R}^3$, a unit cube becomes a parallelepiped, and its volume is $\det A$. The same in $\mathbb{R}^n$. The seemingly complicated "determinant formula" (with signs, cofactor expansions, the works) is just bookkeeping for this one geometric fact.

Where does the sign come from? If the transformation flips the orientation of space (mirror reflection — left hand becomes right hand), the determinant carries a negative sign. And $\det A = 0$? That means the unit square has been crushed into a line segment or a single point — the motion has collapsed at least one dimension, and the action is irreversible. This is the geometric soul behind "determinant zero ⇔ matrix singular ⇔ no unique solution to the system": information is gone, and there is no way to undo.

Formal Definition

For a $2\times 2$ matrix $A = \begin{pmatrix}a & b \\ c & d\end{pmatrix}$,

$$\det A = ad - bc$$

This is the standard area-of-a-parallelogram computation: $ad$ is the naive "base $\times$ height" estimate, and $bc$ is the correction for "tilt" that has to be subtracted off — together they carve out exactly the area of the figure spanned by the column vectors $(a,c)$ and $(b,d)$. The general $n\times n$ determinant can be defined by the Leibniz formula $\det A = \sum_\sigma \text{sgn}(\sigma) \prod_i a_{i,\sigma(i)}$, but the only geometric characterization worth memorizing is: "$\det A$ is the signed factor by which $A$ scales unit $n$-volume." All the properties ($\det(AB)=\det A \det B$, $\det A^{-1}=1/\det A$, $\det I = 1$) are logical consequences of that one geometric statement — total scaling of two motions = product of individual scalings; undoing = reciprocal; doing nothing = no change.

Why It Is Beautiful

A matrix of $n^2$ numbers gets compressed to a single scalar, and that one scalar captures the most essential thing about the transformation — "is it invertible? how much geometric information capacity?" To collapse $n^2$ degrees of freedom into one and have it be the critical scalar — this kind of "compressed to the extreme yet retaining the soul" beauty is rare in mathematics.

A deeper layer: the determinant is the unique function satisfying three simple geometric properties (multilinearity, alternating, $\det I = 1$). The phenomenon of "a small number of plain axioms uniquely determine a complex formula" is a triumph of Bourbaki-style mathematics. Lockhart in A Mathematician's Lament uses similar examples to argue: "Math is not invented; it is squeezed out by an inescapable logic."

Application + History & People

The multivariable change-of-variables formula $\int f(\mathbf{y})\,d\mathbf{y} = \int f(\varphi(\mathbf{x})) |\det J_\varphi|\,d\mathbf{x}$ has the Jacobian determinant as its local "scaling factor for tiny volumes" — this is the generalization of single-variable $du = u'(x)\,dx$ to many variables. In machine learning, normalizing flows (RealNVP/Glow) use the Jacobian determinant to track probability densities exactly through carefully designed invertible transformations — probability conservation in a generative model is, at heart, determinant bookkeeping. In physics, the Faddeev–Popov determinant in path integrals handles gauge invariance. In numerical linear algebra, determinants as singularity detectors are actually dangerous (numerically unstable); practical engineering prefers the condition number and singular values — a candid reminder that "a beautiful formula is not always the engineer's first pick."

History: determinants actually predate matrices. The Japanese mathematician Seki Takakazu in 1683 independently used 3×3 determinants in his Method of Solving Hidden Problems; the same year Leibniz mentioned them in a letter, as a criterion for when a linear system has a solution. Cauchy in 1812 systematized the term "determinant" and the modern theory. The irony: matrices as standalone objects (Cayley 1858) came 170 years later — what people first cared about was a scalar test for solvability, and "matrix as a whole thing" was a later abstraction.

Essence in one line: The determinant is not a formula; it is "by how much did the motion scale space's volume" — crushed to 0 means uninvertible.

3Blue1Brown's Take

"If you only remember one thing about determinants — think of it as what $A$ does to areas/volumes. $\det A = 7$ means any region's area becomes 7 times bigger after $A$ acts; $\det A = -3$ means area triples and space is flipped; $\det A = 0$ means a dimension has collapsed." — Grant Sanderson

Thinking Question

If a $3\times 3$ matrix $A$ has three columns that are pairwise orthogonal (perpendicular) and each is of unit length (such a matrix is called orthogonal), what must $\det A$ be geometrically? Why can it only be $\pm 1$? These two values correspond to two kinds of "motion" — can you give each of them a direct intuitive name? (Hint: one preserves orientation; one flips it.)

Further Resources

3Blue1Brown · Essence of Linear Algebra (15 episodes; the geometric viewpoint's bible — strongly recommended before any textbook)
Sheldon Axler · Linear Algebra Done Right (deliberately avoids determinants until chapter 10 — a counter-traditional but beautiful pedagogical experiment)
Gilbert Strang · Introduction to Linear Algebra + MIT OCW 18.06 (the most beloved undergraduate course, applications-oriented)
Carl Meyer · Matrix Analysis and Applied Linear Algebra (the engineering bible — PageRank, PCA, all of it)
Page, Brin, Motwani, Winograd (1999) · The PageRank Citation Ranking (how an eigenvector made Google a trillion-dollar company)
Heisenberg (1925) · Über quantentheoretische Umdeutung kinematischer und mechanischer Beziehungen (the matrix-mechanics paper born on Helgoland)
Terence Tao · multiple notes on matrix analysis and random matrices on the What's new blog
Tristan Needham · Visual Differential Geometry and Forms (geometric viewpoint extended to differential forms; determinants as volume elements)

Deeper Reflections

1. Why do neural networks need "non-linearity"? What happens if you remove all activation functions?

Any stack of pure linear layers $W_n W_{n-1} \cdots W_1 \mathbf{x}$ is equivalent to a single matrix $W = W_n\cdots W_1$ applied once — no matter how deep, it is still one linear transformation. The expressive power of a linear transformation is fixed: it can do at most one finite-dimensional stretch/rotate/project, and cannot represent any "curved" decision boundary. Activation functions (ReLU, sigmoid) act by introducing a "fold" between layers, allowing stacked layers to achieve arbitrarily complex non-linear expressiveness. So "depth" matters only because the alternation of linear and non-linear composes — pure matrices alone cannot express the world.

2. Why does quantum mechanics use Hilbert spaces (complex vector spaces) and not reals?

Real rotation matrices have no real eigenvalues. But physically, "measurement results" must be real numbers, and the evolution of observable operators on superposed states must be expressible as "stretching along certain directions" — which requires eigenvalue structure to exist. Hermitian matrices (complex matrices with $A^\dagger = A$) happen to have all-real eigenvalues and mutually orthogonal eigenvectors — a perfect match for "measurable physical quantities." At the same time, complex numbers record phase information of interference (the heart of the double-slit experiment). So complex numbers are not a convenience; they are the language forced on us by physical structure — a deep example of "the structure of linear algebra ↔ the structure of physics."

3. Why does PageRank work? How does this connect to "democratic voting" at a deep level?

PageRank assumes a user randomly clicks links, and asks: in the steady state, what is the probability of being on each page? This is a Markov chain; the steady-state distribution = the eigenvector of the transition matrix for eigenvalue 1. The beauty lies in the self-consistency of its recursive definition: "an important page = a page linked to by important pages" — which sounds circular, but the eigenvector equation $M\mathbf{v}=\mathbf{v}$ translates "circular" into "fixed point," and gives it a precise solution. The same idea actually appears in democratic elections — Arrow's impossibility theorem can be read, in a certain sense, as an eigenvalue constraint on "aggregating everyone's rankings." Any "recursive prestige" system (academic citations, social influence, cryptocurrency consensus) has the same mathematical skeleton.

4. Does the determinant still have geometric meaning in, say, 100 dimensions?

Yes, but physical intuition breaks. 100-dimensional "volume" is a completely legitimate algebraic concept — it is the Lebesgue measure of the parallelepiped spanned by 100 vectors. In probability theory, the density formula for a multivariate Gaussian contains $\frac{1}{\sqrt{(2\pi)^n \det\Sigma}}$, and this "high-dimensional ellipsoidal volume" dominates uncertainty modeling in ML, denoising in diffusion models, and the ELBO in variational inference. But a counterintuitive fact: in high dimensions, "almost all" random matrices have determinants rushing to 0 or ∞ (Marchenko–Pastur phenomenon). This is why ML typically uses the log-determinant $\log\det$ rather than $\det$ directly — a numerical-stability issue reflecting genuine peculiarity in high-dimensional geometry.

5. Is there such a thing as "nonlinear algebra"? Why do we know so little about it?

Yes — it is called "Nonlinear Algebra" (championed by Sturmfels and others), studying polynomial systems, tensor decompositions, and algebraic geometry in data science. But it is far harder than linear algebra: the soul of linear algebra is "addition + scaling → everything is determined by a basis," and that structure lets us fully describe any linear transformation with $n^2$ numbers. The nonlinear world has no such compact description — a quadratic form on $n$ variables already has $\binom{n+1}{2}$ degrees of freedom, and it explodes from there. This is why ML chose the path of "linear + simple nonlinear (ReLU) + many layers": composing many simple structures is easier to optimize and analyze than composing a few complex ones. Deep learning, in this sense, is the engineering triumph of "we only know how to use linear algebra, so let's use a lot of it."