Monday, May 19, 2014

Nonsense! Correlations!

In 1926, G. Udny Yule published a paper in the Journal of the Royal Statistical Society, titled, "Why do we Sometimes get Nonsense-Correlations between Time-Series?--A Study in Sampling and the Nature of Time-Series" He begins by describing the problem:
The graph above illustrates this problem for the time series of  mortality (death rate) and marriages (marriage rate) in England from 1866 to 1911. Since these two time series are both decreasing over these years, they start out high together and end up low together, so their correlation coefficient is high (0.95). Their scatterplot is shown below, and if these data were from independent observations, such a high correlation could be quite informative. But since these series are correlated in time, this correlation likely only reflects their time trends and not any other connection. Such a correlation is called spurious.
More recently, a friend (thanks JCT) pointed me to work of Tyler Vigen, explained very well in his video. He has written a program to find such nonsense correlations with hilarious results, such as the high correlation between the per capita consumption of cheese in the US and the number of people who died by becoming entangled in their bedsheets.
This is great fun, and his work has been widely circulated on the internet lately, here, here, and here as examples. Many of these discussions emphasize that "correlation does not imply causation" and some even invoke this xkcd cartoon

But spurious correlations can be more insidious than these graphs illustrate. These time series only illustrate how correlating time related variables can be misleading, but spurious correlation can arise with data not correlated in time. For example, drunk driving and fatal accidents increased more in localities that had banned smoking in bars and restaurants than those that had not (from this paper). One might be tempted to conclude that banning smoking causes these bad outcomes. But something more is at work here. Smokers needed to travel farther to find alternative localities that had not banned their addiction. They were on the road more and therefore had more accidents. In economics, these often referred to as an unintended consequences. Many other settings have similar lurking or confounding variables or behaviors, that have nothing to do with time series, that can also give rise to spurious correlations.

But if we have a correlation for two time series, how might we determine whether it is spurious or not? One approach for the Yule's marriage and mortality data is the following: If there is a real, causal connection between mortality and marriages, then when mortality changes (year-to-year), we should see a related change in marriages (year-to-year). If these changes are not related, then the correlation between mortality and marriages (0.95) is likely spurious and misleading. Below is a scatterplot of the year-to-year differences of mortality and marriages. Their correlation is 0.064, suggesting that the previous, time-dependent, correlation of 0.95 is spurious and misleading.

No comments: