Monday, April 21, 2014

Misleading Inverted Time Series

Here is a graphic published by Reuters that has been making the rounds on the web, here and here. It depicts the number of gun deaths in Florida. On first glance it appears that gun deaths had dropped just after 2005 when Florida enacted its 'Stand Your Ground' law. The trouble is, the vertical scale is inverted: higher numbers are lower on the graph not higher. The post on the Business Insider showed the graph presented in a more standard way: higher numbers are plotted higher on the graph.
Its much clearer in this graph that gun deaths in Florida increased after 2005 and the enactment of the 'Stand Your Ground' law. So why produce a graphic that is so counter to common expectations and understanding? Politics? Point of View? Artistic Vision? or as the artist says, "Personal Preference"!

The designer is Christine Chan. She says her inspiration was based on this graph from the South China Morning Post titled "Iraq's Bloody Tool" plotting deaths as dripping blood. Artistic choice or not, this approach is misleading.

Monday, April 14, 2014

Death of a Favorite, Surprising Example

It happened this past week. The death of one of my favorite examples. This one about an iconic Washington, DC event: The National Cherry Blossom Festival. Since 1912, Washington, DC has celebrated the arrival of spring with this festival. But the arrival of the blossoms doesn't always correspond with the the date of the festival. Regular records of the monthly date when the Washington, DC cherry blossoms are at peak bloom have been collected by the National Park Service (NPS) since 1921. I've posted other uses of the cherry blossoms here and here. The graphic above is from the Washington Post on March 27, 1987. It shows the frequency distribution of those monthly dates as a stem and leaf diagram. The stems are the monthly dates. The leaves are the years when that date was noted by the NPS as the peak bloom. This is the date when 70% of the trees are in full bloom. Like many frequency distributions it is sparse for the early blooming in mid-March, becoming more frequent towards the end of March and into the first few weeks of April, and then finally becoming sparse again towards mid-April. None of this is unusual or surprising. Unusual outliers in the peak blooming dates might be found in the early or late times, (as I have often taught my students). But something more surprising hides in plain sight near the most frequently occurring peak dates. Since those regular records began in 1921, a peak bloom was never noted on April 10. The surrounding dates of April 9 and April 11 have been noted as peak blooms several times, but not April 10. For this 1987 graphic, that was 65 years of missing April 10. This surprising gap in a high frequency region of the distribution is surprising. In my class in Basic Statistics we talk about possible causes: Is April 10 a commemorative date for the Park Service? Are they otherwise occupied on April 10? Do they take April 10 off? or Is this just a result of randomness?

I have kept track this for the past 27 years, and through 2013, this pattern has stayed. Until, 2014! This past week the peak bloom has recorded by the National Park Service to be on April 10. The first such time in 94 years. My longtime example has died. Randomness has filled in the gap.

Here is a view of my classroom presentation slide with the new offending date filled in.

Monday, April 7, 2014

Scrabble Distribution

Here is a display of the frequency distribution of the letters used in Scrabble arranged alphabetically.
And here is the frequency distribution of letters arranged by their point value. Although the left-most stack is labeled with the letter "A", it is a stack of all the letters with a one-point value: A,E,I,L,N,O,R,S,T,U  and similarly for the other stacks. Not shown are the two blank tiles.

Monday, March 31, 2014

Kitchen Distribution

Here is a well-used cutting board. Every morning it protects the counter top from an errant knife cutting slices of bread for breakfast. The loaf is most often placed so that the cuts fall near the middle of the board, with the knife's blade repeatedly marring the front edge of the cutting board. Less often the cuts continue and extend off to the right or left of the middle. In this contest right seems to win out. This leaves us with a frequency distribution of the knife's marks skewed a bit to the left: the fewest marks along the left of the front edge, the most along the middle, and then a bit fewer marks on the right of the edge. This is a bell-shaped, although somewhat skewed to the left, pattern that we've seen often.

Monday, March 24, 2014

Quantifying Selfies is a project that quantifies many aspects of selfie portraits in five cities around the world in plots of various frequency distributions. The image above shows a dot plot of the estimated age of selfie subjects. In this image 61.6% of the selfies analyzed are from women and 36.7% are from men. The average estimated age of the women is 23.3 years and for men 26.7 years. Scrolling over the dots in this plot show the actual selfie that is measured.
This image assesses the mood of the selfies and plots them on a smile scale from frowning to smiling. Another plot shows the smiles ratings across the cities.
There are even plots that assess the tilt of the subjects head in the selfie, showing that on average women tilt their heads more than men. This is a fun project. Perhaps some of the participants could use the Selfie Help Book!

Monday, March 17, 2014

In Honor of Pi Day last Friday

In honor of Pi day this past Friday, March 14. I posted this long ago in 2008. It is still my favorite pie chart.

Monday, March 10, 2014

Conditional Probability

Here is a clever, interactive, simulation display of conditional probability from Victor Powell, via flowingdata.

Balls fall from the sky, uniformly distributed across the display window. Some hit a red shelf of adjustable width. Here it is set at a width so that
P(A) = 20% of all the falling balls hit the red shelf.
Another lower blue shelf, overlapping a bit with the red one, is set so that
P(B) = 12% of the balls hit the blue shelf.
A mere P(A and B) = 6% of the balls hit both shelves, indicated by the mixture of red and blue to get purple. But, as the simulation says,
"If we have a ball and we know it hit the red shelf, there's a 30.0% chance it also hit the blue shelf" and 
"If we have a ball and we know it hit the blue shelf, there's a 50.0% chance it also hit the red shelf". 

Below are connected bars showing, by their length, the color composition of the dropped balls. We can easily see how these proportions are obtained by visually estimating that fraction that purple makes up of the balls that hit the red shelf. That is the ratio, purple / (red + purple) = 30% or what fraction purple makes up of the balls that hit the blue shelf. This is the ratio, purple / (purple + blue) = 50%.