03 · The Patch — regularisation rescues prediction

OLS picks the coefficients that minimise squared error. When \(p > n\), infinitely many coefficient vectors achieve the same minimum, and the one OLS lands on is whichever the linear-algebra routine happened to prefer. Each model below adds a penalty term that breaks the tie, favouring smaller or sparser coefficients. The fit becomes a choice, not an accident.

the three regularisers, in one breath

Ridge shrinks every coefficient a little. Lasso shrinks some coefficients all the way to zero, so features disappear. Elastic Net mixes the two. The single knob \(\alpha\) controls how hard the penalty pushes.

Ridge · \(\ell_2\) penalty

Ridge

\[ \hat\beta_{\text{ridge}} = \arg\min_\beta \left( \| y - X\beta \|_2^2 + \alpha \|\beta\|_2^2 \right) \]

Penalises the sum of squared coefficients. Shrinks everything toward zero, nothing exactly to zero. Useful when you suspect many features are each a little bit useful.

Lasso · \(\ell_1\) penalty

Lasso

\[ \hat\beta_{\text{lasso}} = \arg\min_\beta \left( \| y - X\beta \|_2^2 + \alpha \|\beta\|_1 \right) \]

Penalises the sum of absolute coefficients. The \(\ell_1\) geometry forces many coefficients exactly to zero, turning Lasso into a feature selector and not just a shrinker.

Elastic Net · blended penalty

Elastic Net

\[ \hat\beta_{\text{enet}} = \arg\min_\beta \left( \| y - X\beta \|_2^2 + \alpha \rho \|\beta\|_1 + \alpha (1-\rho) \|\beta\|_2^2 \right) \]

A convex blend of both penalties (here \(\rho = 0.5\)). Keeps Lasso’s sparsity and Ridge’s stability when features are correlated.

why \(\ell_1\) zeroes things out and \(\ell_2\) doesn’t

The constrained-optimisation view: minimising squared error subject to a penalty is equivalent to minimising squared error subject to a constraint set \(\|\beta\|_q \le t\). The \(\ell_2\) ball is a smooth sphere — a tangent point between the squared-error contours (ellipses) and the sphere almost never lies on an axis, so no coefficient is exactly zero. The \(\ell_1\) ball is a polytope whose vertices sit on the coordinate axes; a tangent point often is a vertex, which zeroes out whichever coordinate that vertex lacks. Formally, \(\hat\beta\) satisfies the KKT conditions \[ X^\top (X\beta - y) + \alpha s = 0, \quad s_j \in \begin{cases} \{\operatorname{sign}(\beta_j)\} & \beta_j \neq 0 \\ [-1, 1] & \beta_j = 0 \end{cases} \] where the subgradient \(s_j\) of \(|\beta_j|\) can take any value in \([-1, 1]\) when \(\beta_j = 0\); this slack is exactly what lets Lasso park a coordinate at zero rather than push through.

A small knob that changes everything.

The three loss functions