Monday, November 30, 2015

How to lie, no, display data in context

These are figures from the classic book "How to Lie with Statistics" by Darrell Huff. They illustrate an admonition against a dramatic scaling of a line graph (and also a bar graph) by not including zero on the vertical axis. This omission has long been a staple of those who want to mislead but, as the video from Vox below explains, often for their graphs, or charts as they call them, it is not always wrong to omit zero. Seeing the data in context is what's important. I especially like the illustration of plotting body temperature on the Kelvin scale. We should keep in mind the phrase attributed to the late Harvard statistician Frederick Mosteller, "While it is easy to lie with statistics, it is even easier to lie without them."



Monday, November 23, 2015

Hungry Thanksgiving

Above is a map from Five-Thirty-Eight's survey of Thanksgiving eating traditions nationwide. It shows the most disproportionately common Thanksgiving side dish by region. I grew up in Florida, but we never got on the mac and cheese Thanksgiving bandwagon. Likewise, living now in Maryland, squash doesn't make it on our plates. Our family seems to match the midwest more, with green bean casserole and rolls (not biscuits). And for dessert we are looking forward to pumpkin, pecan, and apple pie. You can download their survey data for your own analysis here.

Monday, November 16, 2015

Independence in the Trumans' Wallpaper

  This is a view inside the kitchen of President Harry S Truman’s home in Independence, Missouri, courtesy of the National Park Service. Notice the pattern of wear in the wallpaper from pulling the chain to turn on the wall lamp. Here's a close up:




When Harry Truman left the Presidency he retired to a very modest and quiet life. He lived with his wife Bess in her family’s home in Independence, Missouri. He read five newspapers a day, no doubt many at the table shown in this photograph. One can imagine their morning routine of taking a seat at the kitchen table with a cup of coffee leaning against the kitchen wall and reaching up to pull the chain to turn on the lamp above the table. In so doing perhaps knuckles hit the wall or the chain rubbed the wallpaper and wore it through. This resulted in a clustered pattern of wear from the pulling the lamp’s chain and releasing it. But the wear on the wallpaper suggests that this targeting was not always exact and the release, in hitting the wall was not always consistent. These small, accumulated errors left a record over their many morning reads.


Such a targeting routine gives rise to the normal, bell-shaped pattern of wear that we have seen before. Only this time it develops in two dimensions. With a few assumptions we can derive the normal probability distribution that models and describes these actions.

In reaching for the lamp’s chain, perhaps the Mr. Truman’s morning grogginess or intense attention to the news of the day, caused him to miss the target, reaching just a little too far to the left or a little too far to the right to grab the chain. Likewise, his marks indicate that his reach was sometimes a little too high or a little too low.

Let us first assume that these small errors, right and left or high or low are independent of one another. This independence means that if his reach was too high one morning, this had no affect on how the reach left its mark when the light was turned off later in the day or on the next morning. He didn’t repeat the same too-high reach the next morning nor did he overly compensate and leave a mark too low the next.

But this Independence goes beyond Missouri, it goes further than day to day variations. It applies more importantly to each individual action of turning on the lamp. We assume that at each targeting of the lamp’s chain their right to left targeting is independent of the up and down targeting. They are not consistently grabbing slightly up and simultaneously to the right nor slightly down and at the same time to the left. On the contrary, independence would dictate that the other two possibilities of slightly up and to the left or slightly down and to the right are equally represented motions. These independent actions leave their marks on the wallpaper in the clustered pattern of roughly circular shape. The pattern has no tendency to tilt up or down to the right or the left.

Next, since turning on a lamp is such a routine and repeated task, it is much more likely that their targeting error was small rather than large. It would be very unlikely that a reach would leave a mark far from the target chain. It is much more likely that they left a mark resulting from a small error in targeting. So our second assumption is that, the bigger the error, the less its chance of occurrence. Small errors are much more likely.

Now, imagine that a high reach is just as likely as a low reach. Likewise, a reach to the right of the target is just as likely as a reach to the left. This would say that the marks fall symmetrically around the target and that the probability of a particular size error to the right is the same as the probability of the same size error to the left.

But here we make an even stronger assumption. Let us assume that errors at any given distance from the target have the same chance of occurrence in whatever direction they may land. This would mean that not just horizontal, right and left errors or vertical, up and down errors are considered. Targeting errors along any tilted diagonal are also possible. After all, Mr. Truman could have occasionally reached a little to the right and a little too high falling northeast of the target. As we’ve said it does not appear that they did this consistently, but however far in this tilted direction a mark was eventually made, we assume that the chance of such an occurrence is the same as the chance of an equally distance mark in a purely horizontal or vertical direction. What matters is not the direction of the error whether up, down, right, left, or diagonally. What matters is only how far the error is from the target. The chance of occurrence of any error depends only on how far it is from the target.

We have these assumptions: 1) independence of horizontal and vertical targeting, 2) smaller errors are more likely than larger ones, 3) errors the same distance from the target have the same chance of occurrence and finally 4) the resulting probability density function that describes the results of these targeted actions is always positive, that is, no targeted misses are excluded from possibility. With these assumptions the bivariate normal probability density function can be derived.

In fact, this result has a long history. It has been derived many times and used in many contexts. It is most notably attributed to Herschel (1850), but was developed much earlier by Adrain(1805).
Let x represent the horizontal position and y represent the vertical position of a targeting mark.  Let f(x) (or f(y)) denote the probability density of the horizontal (or vertical) position. Independence tell us that the probability of the joint positions of x and y, denoted by their joint probability distribution, g(x,y), can be represented as the product of probability distributions for x and y individually. That is, g(x,y) = f(x) f(y). But we also have the assumption that the probability distribution of the joint position of x and y depends only on the distance from the target origin. So that,

for some function h. If we let y = 0 then we see that h(x) = f(x) f(0).
Now define
Then
 
            But the well known solution to such a functional equation is given by the linear function 
k(x) = cx, for some constant c. Then
or 
We have a probability density (i.e. one that integrates to 1) only if we use our second assumption that larger errors are less likely to occur than smaller ones. This says that we must have c to be negative. We can write such a negative constant as 
for some standard deviation   . Then the function f(x) takes the form 
 
This is exactly the probability density function of a normal random variable. To find f(0) we note that the area under a probability density must be one. This results in a properly scaled probability density for our horizontal (or vertical) position:
 
            This is the normal, bell-shaped probability distribution, centered at the origin and having a standard deviation of . The two-dimensional wallpaper wear pattern can then be considered a sample from the bivariate normal probability distribution          
 
This same quantitative argument was also used by James Clerk Maxwell in 1860 in his study of the kinetic theory of gasses. 

There is more to see here. Notice the stains on the wall above each chair, about at head level. Was this the result of resting groggy, early morning heads? Next, close inspection might discern greater variability and therefore a wider spread of marks on the wallpaper in the left to right direction compared to the up and down direction. This concerns the behavior of those individual directions separately. One may be spread out a bit more than the other, that is, the horizontal direction might have a larger standard deviation than the vertical direction. Although this changes the expression of the bivariate normal density, with our assumptions, it is still normally distributed. The key requirement is the independence of the directions. Independence is concerned with how the up and down or left and right directions of action behave together and leave their marks. We would doubt independence of these individual motions only if they consistently left marks in a tilted directional pattern, rather than just the possible stretched direction seen here.

Finally, the National Park Service Rangers tell me that President Truman and his wife picked out this wallpaper in 1971. Mr. Truman died in 1972. His wife Bess Truman likely sat at the same table, until her death in 1982. A large portion of the wallpaper wear shown here is most likely due to her turning the lamp on and off.