Day one on the job. You have 493 columns and one target, and no one has told you which
columns matter. You do the obvious thing: you rank them by correlation with GDP. We
cheat on your behalf and colour each column by its true role — causal,
spurious, incidental — so you can see what correlation was
always going to hide. Scatter, rank, then quiz yourself. You will lose.
causalspuriousincidental
loading the dataset …
Featured examples
A handful of picks from each role to anchor the intuition before the ranking chart.
The causal column lists features with a
plausible mechanism for national wealth; the incidental
column lists statistics that move with wealth but don’t cause it; the
spurious column lists features whose absurdity
is the point.
Correlations with \(\log(\text{GDP})\), stratified by role
All 493 features plotted on the same \(|r|\)-axis, one horizontal band per role.
The causal and incidental
swarms bleed into each other; the spurious swarm reaches
surprisingly far to the right — its strongest feature outranks roughly half of the causal
ones. Toggle below to switch back to the ranked-bar views if you want the
old behaviour.
Scatter explorer
Pick a feature and see it plotted against GDP per capita. Points are coloured by role,
so you can watch a convincing-looking line emerge from a spurious column and a messy
cloud appear around a causal one. Toggle log scales when the distribution is skewed.
Guess the role
Your turn. We hand you a feature and its scatter. You tell us what the codebook
thinks it is: causal, spurious, or incidental. The answer flips up with a one-line
justification. Keep score.
takeaway
Correlation never told you what you thought it was telling you. Incidental
columns match causal ones shoulder for shoulder. The strongest of 333 spurious
features still lands at \(|r| \approx 0.47\) — not for any reason, but because
hundreds of random shots at a target will put a few near the bullseye. Next chapter: you stop
looking and start fitting. It goes poorly.