Statpics: November 2014

Monday, November 24, 2014

Elementary Steps, Dr. Watson

“Ordering my cab to wait, I passed down the steps, worn hollow in the center by the ceaseless tread of drunken feet,” Dr. Watson in The Adventure of the Man with the Twisted Lip by Arthur Conan Doyle. (Bell-shaped carpet wear from the ceaseless feet at a restaurant in Snow Hill, Maryland.)

Monday, November 17, 2014

Data Literacy: It's Elementary

The Washington Post has an article this morning (17 November, 2014): "In elementary schools, lessons on data literacy," by IT reporter Mohana Ravindranath. She describes a "growing movement of educators creating lesson plans to teach students to collect and analyze data." One goal is "to derive opinions from measurable, real-world data." Another, is to address the shortage of "managers and analysts who can make decisions based on big data analysis," according to management researcher Michael Chui. The Washington Post article goes on to quote Chui:

“It makes sense for us to be thinking about education, starting in early childhood, about concepts such as the difference between correlation and causation, what it means to have a bias as you think about data, conditional probability. These are things we as humans don’t naturally do . . . these are learned [concepts],” Chui said in an interview. He added that curricula should teach students about the realistic limitations of data sets — extraneous information, or sampling error, for instance.

The article describes students collecting their own data. Third-grade students collect daily temperature data, fifth-grade students record the hours of daylight and relate them to the earth's motions, and even kindergarten children "recording predictions for whether it will be sunny outside the next day, or which foods will decompose fastest, along with the results."

Says one science coordinator at an elementary school, evaluating the effectiveness of these lessons is "ultimately if the kid’s able to have a conversation about it and ask questions about it.”

A great goal for students of all ages. That this is taught and expected of even elementary school students is inspiring.

(On a very minor display note: the introductory graphic to this story is an image of a computer monitor showing results from a school's Science Festival using software from Tuva Labs. Dot plots are displayed showing the arm spans by gender. I wonder about the zoom-in that is shown for one data point. It seems only to extract the same dot plot that's on the screen. That's something to ask a question about!)

Monday, November 10, 2014

Pie Rules

"There is no data that can be displayed in a pie chart, that cannot be displayed BETTER in some other type of chart," is a quote Wikipedia attributes to the late, great statistician John Tukey, (I've found no original source for the quote). It gets worse when Excel and/or graphic designers start adding chart junk of 3-D projections in hope of adding more visual interest.

Any reasonable comparisons in the above chart are impossible.

If you insist on using pie charts, Benjamin Starr at Visual News offers some history and sensible guidelines for using pie charts.

First, as the lead-in graphic above illustrates, display no more than five categories in a pie chart. Many small areas are difficult to compare.

Second, since wedges of a pie are difficult to compare side-by-side, Starr says don't use multiple pie charts for comparison. Use stacked bar charts instead, as shown above.

Third, make sure that the percentages add up to 100% and that all slices are drawn proportionately to the percentages they purport to represent.

And finally, order the slices from largest to smallest, starting at 12 o'clock and continuing either clockwise or counter-clockwise to aid in comprehension.

Of course, we've posted some pie charts here and had some fun with them.

Monday, November 3, 2014

The Curse

The Curse of Dimensionality addresses the difficulty of dealing with multivariate data. It warns us that, for a set of data in high dimensions, local neighborhoods are almost certainly empty of data points and neighborhoods that are not empty are almost certainly not local.

In explaining this result, biostatistician Jeff Leek thought of clever a demonstration and got his graduate student Prasad Patil to build an interactive program to illustrate the Curse. In the screen shots above, samples of 100 points are randomly and uniformly generated in 1,2,3, and 4 dimensions in the unit cube. Subsets are examined in cubes with edge length of 1/2. In 1-dimension, the simulation contained 55% of the data in a line segment of length 1/2 (expected is, of course, 50%). In 2-dimensions, the simulation contained 31 % in a square with sides of length 1/2 (expected is 25%). In 3-dimensions, the simulation contained 14% in a cube with sides of length 1/2 (expected is 12.5%). And finally, the simulation contained just 4% of the data in a 4-D cube with sides of length 1/2 (expected is 6.25%). As the dimension grows, smaller and smaller percentages of the data can be found in regions with linear dimensions, that our low dimensional intuition tells us, are not small. Balancing the variance of a large neighborhood with the low bias of a small neighborhood is incredibly difficult in high dimensions.

This is a nice way to help visualize the Curse.