Monday, April 14, 2014

Death of a Favorite, Surprising Example

It happened this past week. The death of one of my favorite examples. This one about an iconic Washington, DC event: The National Cherry Blossom Festival. Since 1912, Washington, DC has celebrated the arrival of spring with this festival. But the arrival of the blossoms doesn't always correspond with the the date of the festival. Regular records of the monthly date when the Washington, DC cherry blossoms are at peak bloom have been collected by the National Park Service (NPS) since 1921. I've posted other uses of the cherry blossoms here and here. The graphic above is from the Washington Post on March 27, 1987. It shows the frequency distribution of those monthly dates as a stem and leaf diagram. The stems are the monthly dates. The leaves are the years when that date was noted by the NPS as the peak bloom. This is the date when 70% of the trees are in full bloom. Like many frequency distributions it is sparse for the early blooming in mid-March, becoming more frequent towards the end of March and into the first few weeks of April, and then finally becoming sparse again towards mid-April. None of this is unusual or surprising. Unusual outliers in the peak blooming dates might be found in the early or late times, (as I have often taught my students). But something more surprising hides in plain sight near the most frequently occurring peak dates. Since those regular records began in 1921, a peak bloom was never noted on April 10. The surrounding dates of April 9 and April 11 have been noted as peak blooms several times, but not April 10. For this 1987 graphic, that was 65 years of missing April 10. This surprising gap in a high frequency region of the distribution is surprising. In my class in Basic Statistics we talk about possible causes: Is April 10 a commemorative date for the Park Service? Are they otherwise occupied on April 10? Do they take April 10 off? or Is this just a result of randomness?

I have kept track this for the past 27 years, and through 2013, this pattern has stayed. Until, 2014! This past week the peak bloom has recorded by the National Park Service to be on April 10. The first such time in 94 years. My longtime example has died. Randomness has filled in the gap.

Here is a view of my classroom presentation slide with the new offending date filled in.



Monday, April 7, 2014

Scrabble Distribution

Here is a display of the frequency distribution of the letters used in Scrabble arranged alphabetically.
And here is the frequency distribution of letters arranged by their point value. Although the left-most stack is labeled with the letter "A", it is a stack of all the letters with a one-point value: A,E,I,L,N,O,R,S,T,U  and similarly for the other stacks. Not shown are the two blank tiles.


Monday, March 31, 2014

Kitchen Distribution

Here is a well-used cutting board. Every morning it protects the counter top from an errant knife cutting slices of bread for breakfast. The loaf is most often placed so that the cuts fall near the middle of the board, with the knife's blade repeatedly marring the front edge of the cutting board. Less often the cuts continue and extend off to the right or left of the middle. In this contest right seems to win out. This leaves us with a frequency distribution of the knife's marks skewed a bit to the left: the fewest marks along the left of the front edge, the most along the middle, and then a bit fewer marks on the right of the edge. This is a bell-shaped, although somewhat skewed to the left, pattern that we've seen often.

Monday, March 24, 2014

Quantifying Selfies

Selfiecity.net is a project that quantifies many aspects of selfie portraits in five cities around the world in plots of various frequency distributions. The image above shows a dot plot of the estimated age of selfie subjects. In this image 61.6% of the selfies analyzed are from women and 36.7% are from men. The average estimated age of the women is 23.3 years and for men 26.7 years. Scrolling over the dots in this plot show the actual selfie that is measured.
This image assesses the mood of the selfies and plots them on a smile scale from frowning to smiling. Another plot shows the smiles ratings across the cities.
There are even plots that assess the tilt of the subjects head in the selfie, showing that on average women tilt their heads more than men. This is a fun project. Perhaps some of the participants could use the Selfie Help Book!



Monday, March 17, 2014

In Honor of Pi Day last Friday

In honor of Pi day this past Friday, March 14. I posted this long ago in 2008. It is still my favorite pie chart.

Monday, March 10, 2014

Conditional Probability

Here is a clever, interactive, simulation display of conditional probability from Victor Powell, via flowingdata.

Balls fall from the sky, uniformly distributed across the display window. Some hit a red shelf of adjustable width. Here it is set at a width so that
P(A) = 20% of all the falling balls hit the red shelf.
Another lower blue shelf, overlapping a bit with the red one, is set so that
P(B) = 12% of the balls hit the blue shelf.
A mere P(A and B) = 6% of the balls hit both shelves, indicated by the mixture of red and blue to get purple. But, as the simulation says,
"If we have a ball and we know it hit the red shelf, there's a 30.0% chance it also hit the blue shelf" and 
"If we have a ball and we know it hit the blue shelf, there's a 50.0% chance it also hit the red shelf". 

Below are connected bars showing, by their length, the color composition of the dropped balls. We can easily see how these proportions are obtained by visually estimating that fraction that purple makes up of the balls that hit the red shelf. That is the ratio, purple / (red + purple) = 30% or what fraction purple makes up of the balls that hit the blue shelf. This is the ratio, purple / (purple + blue) = 50%.

Monday, March 3, 2014

Dancing Statistics

A still image from the project Communicating Psychology to the Public through Dance, produced by Lucy Irving, Elise Phillips, and Andy Field supported by the British Psychological Association and IdeasTap. Four videos: my favorite Frequency Distributions, Sampling and Standard Error, Variance, and Correlation.

In this image from the first of the videos the dancers start our in one large unorganized group, some dancing with very slow movements, some with very quick movements, and as one would then expect more with movements of a more intermediate speed. As they dance they sort themselves out, from the slower movement dancers on the left, to the more rapidly moving dancers on the right, building up a sample from a bell-shaped distribution. Very clever.

There is a video about correlation with dancers performing the same movements together or nearly opposite movements together. As they mention in the text of the video, these movements are just co-occurrences, one does not cause the other: correlation is not causation.

There is a video about variation with dancers performing variations on the same set of movements.

Another is about sampling and standard error. In this dance, a single blue-shirted dancer performs his movements to indicate the four corners of a rectangle. He and the rectangle defined by his  movements are termed the population. Then several red-shirted dancers mark four corners in their own styles producing various quadrilaterals that estimate the rectangular shape of the blue-shirted dancer.

I do think calling that initial, single blue-shirted dancer a population could be misleading, especially since at the beginning of this video the text mentions "a large group (a population)". This, of course, is the usual view: the large group is the population from which we observe samples to estimate it. But perhaps better, in this setting, would be to talk more generally about a statistical model. This is a model for a dancer's movements. These movements depend on the physical aspects of the dancer: height, limb length, reach, flexibility, etc. They also depend on artistic intent, style, technique, etc.

The blue-shirted dancer specifies the results of a certain collection of all of these aspects. This becomes a parameter, a target. The red-shirted dancers sample from the model of movements to estimate this parameter. The variety and range of their movements display the sampling variability as they attempt to match the governing shape (parameter) of the blue-shirted dancer. This view is more general than the viewing of sampling as from a fixed large group population.