CHAPTER 03 · THE PATCH

A small knob that changes everything.

493 features and 203 training countries is a lost cause for plain OLS. Ridge, Lasso and Elastic Net each add one parameter — the regularisation strength \(\alpha\) — that pushes back against oversized coefficients. Sweep \(\alpha\) on a log scale and test \(R^2\) climbs out of the hole. Lasso goes one step further and silences most of the features outright. One knob. Whole different model.

The three loss functions

OLS picks the coefficients that minimise squared error. When \(p > n\), infinitely many coefficient vectors achieve the same minimum, and the one OLS lands on is whichever the linear-algebra routine happened to prefer. Each model below adds a penalty term that breaks the tie, favouring smaller or sparser coefficients. The fit becomes a choice, not an accident.

the three regularisers, in one breath

Ridge shrinks every coefficient a little. Lasso shrinks some coefficients all the way to zero, so features disappear. Elastic Net mixes the two. The single knob \(\alpha\) controls how hard the penalty pushes.

Ridge · \(\ell_2\) penalty
Ridge
\[ \hat\beta_{\text{ridge}} = \arg\min_\beta \left( \| y - X\beta \|_2^2 + \alpha \|\beta\|_2^2 \right) \]

Penalises the sum of squared coefficients. Shrinks everything toward zero, nothing exactly to zero. Useful when you suspect many features are each a little bit useful.

Lasso · \(\ell_1\) penalty
Lasso
\[ \hat\beta_{\text{lasso}} = \arg\min_\beta \left( \| y - X\beta \|_2^2 + \alpha \|\beta\|_1 \right) \]

Penalises the sum of absolute coefficients. The \(\ell_1\) geometry forces many coefficients exactly to zero, turning Lasso into a feature selector and not just a shrinker.

Elastic Net · blended penalty
Elastic Net
\[ \hat\beta_{\text{enet}} = \arg\min_\beta \left( \| y - X\beta \|_2^2 + \alpha \rho \|\beta\|_1 + \alpha (1-\rho) \|\beta\|_2^2 \right) \]

A convex blend of both penalties (here \(\rho = 0.5\)). Keeps Lasso’s sparsity and Ridge’s stability when features are correlated.

why \(\ell_1\) zeroes things out and \(\ell_2\) doesn’t

The constrained-optimisation view: minimising squared error subject to a penalty is equivalent to minimising squared error subject to a constraint set \(\|\beta\|_q \le t\). The \(\ell_2\) ball is a smooth sphere — a tangent point between the squared-error contours (ellipses) and the sphere almost never lies on an axis, so no coefficient is exactly zero. The \(\ell_1\) ball is a polytope whose vertices sit on the coordinate axes; a tangent point often is a vertex, which zeroes out whichever coordinate that vertex lacks. Formally, \(\hat\beta\) satisfies the KKT conditions \[ X^\top (X\beta - y) + \alpha s = 0, \quad s_j \in \begin{cases} \{\operatorname{sign}(\beta_j)\} & \beta_j \neq 0 \\ [-1, 1] & \beta_j = 0 \end{cases} \] where the subgradient \(s_j\) of \(|\beta_j|\) can take any value in \([-1, 1]\) when \(\beta_j = 0\); this slack is exactly what lets Lasso park a coordinate at zero rather than push through.

loading regularisation sweep …