CHAPTER 01 · THE BAIT

What predicts GDP?

Day one on the job. You have 493 columns and one target, and no one has told you which columns matter. You do the obvious thing: you rank them by correlation with GDP. We cheat on your behalf and colour each column by its true role — causal, spurious, incidental — so you can see what correlation was always going to hide. Scatter, rank, then quiz yourself. You will lose.

causal spurious incidental
loading the dataset …

Correlations with \(\log(\text{GDP})\), stratified by role

All 493 features plotted on the same \(|r|\)-axis, one horizontal band per role. The causal and incidental swarms bleed into each other; the spurious swarm reaches surprisingly far to the right — its strongest feature outranks roughly half of the causal ones. Toggle below to switch back to the ranked-bar views if you want the old behaviour.

Scatter explorer

Pick a feature and see it plotted against GDP per capita. Points are coloured by role, so you can watch a convincing-looking line emerge from a spurious column and a messy cloud appear around a causal one. Toggle log scales when the distribution is skewed.

Guess the role

Your turn. We hand you a feature and its scatter. You tell us what the codebook thinks it is: causal, spurious, or incidental. The answer flips up with a one-line justification. Keep score.

takeaway

Correlation never told you what you thought it was telling you. Incidental columns match causal ones shoulder for shoulder. The strongest of 333 spurious features still lands at \(|r| \approx 0.47\) — not for any reason, but because hundreds of random shots at a target will put a few near the bullseye. Next chapter: you stop looking and start fitting. It goes poorly.