Monday, March 23, 2015

You Move Me

Here's an example an interactive linear regression demonstration which is like many programs that, these days, are included with nearly all basic statistics textbooks. This one works well and it's easy to move around outliers as well as placing data points close to the mean, in both cases watching what happens to both the slope and the intercept. It makes for many more fun explorations. Via Statistical Modeling, Causal Inference, and Social Science.

Monday, March 16, 2015

Let it stop, let it stop, let it stop

Here in Washington, DC we certainly have not see the anywhere near the snow totals that Boston has seen this winter. But I am certainly tired of shoveling! Has the Snow Finally Stopped? is a question that Harry Enten at Five Thirty Eight's Data Lab has examined using data on the date of the last measurable snow over the last 50 years in various US cities with data from the National Oceanic and Atmospheric Administration (NOAA).. In the graphic above he plots the lower and upper quartiles of the date of the last snow along with the median. For Washington, DC here is a stemplot of the frequency distribution of those dates.

December  0 | 9   
          1 |   
          1 |   
          2 |   
          2 |   
          3 |   
January   0 |   
          0 |   
          1 |   
          1 | 5   
          2 |   
          2 | 57   
          3 | 1   
February  0 |   
          0 | 7   
          1 | 12234   
          1 | 5789   
          2 | 0334   
          2 | 577   
March     0 | 12   
          0 | 567888899   
          1 | 023344   
          1 | 8   
          2 | 011   
          2 | 5567   
          3 |   
          0 |   
          0 | 777

It's interesting that the overall latest measurable snow in Washington, DC has only occurred on April 7 (in the three years: 1972, 1990, and 2007). Reminds me of another early April pattern that held up for years. 

Over 3/4 of these snow dates are behind us now. But, there are still 11  years out of the past 50 that saw measurable snowfall after today's date of March 16. Based on these data, we can estimate the chance of more snow to be 22%. Let's bet against it!

Monday, March 9, 2015

Daily Double - Maybe

Coming across the Daily Doubles in the TV game show Jeopardy! can quickly propel contestants into the lead.  They do not have to compete with others to come up with the question and they can wager a large sum. In 2014 contestant Arthur Chu upset Jeopardy! traditionalists by bouncing around the board hoping to hit upon a Daily Double. Using data from the fan site J!Archive, Nathan Yau from Flowing Data has tabulated the locations of the Daily Doubles for 31 seasons totaling 13,633 Daily Doubles and shown them as relative frequencies. The darker the color the more likely a Daily Double for that position. Clearly, the fourth row is most favored by a small amount.

Monday, March 2, 2015

Gas Pump Dings

Here's another gas pump distribution. Inattentive users of the pump's nozzle have carelessly returned it to its home. In the process they have dinged the vertical panel on the left. The nozzle has its set home, but the users have sometimes started too high for its return resulting in a few dings on the upper regions of the panel. Similarly, there are a few dings on the low end of the panel. The majority of the dings are in-line with the nozzle's home. It's a distribution pattern we've seen often: few marks high, few marks low, and most marks in between.

Monday, February 23, 2015

Gas Pump Wear

The wear pattern in this image shows a frequency distribution of customers’ finger placement as they take their receipt from a gasoline pump. They likely target their grab at the center of the receipt. As they remove the receipt, the paper and their fingers rub paint off the dispenser. The wear pattern seems to be symmetric about the center, that is, the right side of the wear pattern is a near mirror image of the left side. This shows that users favored neither left nor right more often in removing their receipts. The image of wear shows more wear near the center of the dispenser and less on the edges but it seems balanced at the central target point. The bell-shaped curve of appears to be a very good match for a normal curve. 

Monday, February 16, 2015

Waldo, revisited

We've seen the scatterplot and marginal distributions of Waldo's location in a collection of Where's Waldo. These were from Ben Blatt at Now Ronald Olsen has added a plot of the contours of a kernel density estimate of the joint frequency distribution of Waldo's location on the facing pages of the Waldo books. Darker color indicates higher density. Olsen then dynamically computes the optimal search path. Try it out. Can you find Waldo quicker?

Monday, February 9, 2015

Selective Correlation

Sociologist Gabriel Rossman of UCLA has an article "When Correlation Is Not Causation, But Something Much More Screwy" in The Atlantic. It graphically shows how calculating correlations on selected samples can produce misleading results. He imagines two characteristics of aspiring actors, ability (mind) and attractiveness (body) plotted in a scatterplot. Random observations from independent standard normal distributions are drawn to represent these variables in a population of actors. Most actors are centered around zero with fewer well above or well below, on mind or body. But since mind and body are chosen independently there is no correlation shown in the plot, that is, no tendency of the plot to tilt with a positive or negative slope. 

But then he imagines computing the correlation between mind and body from a sample of working actors. To get work, his actors have been selected by casting directors only if they have a high value for the sum of both mind and body. He has marked these working, observed actors with small triangles in the plot. The remaining aspiring non-working and unobserved actors are marked with small circles. The plotted pattern of triangles for the working, observed actors has a definite negative correlation suggesting, wrongly, that either the more able actors are not much to look at or the most attractive actors are dunces in their acting ability. Although this might fit some stereotypical caricatures, it is entirely due to the selective sampling based on the definition that produced our observed sample. He continues with an SAT example as well.

His scatterplot can be improved. Measurements with equal variability should be plotted in a square plot. The plotting characters for the unobserved and observed actors could be more pronounced. See an example below.