Monday, November 17, 2014

Data Literacy: It's Elementary


The Washington Post has an article this morning (17 November, 2014): "In elementary schools, lessons on data literacy," by IT reporter Mohana Ravindranath. She describes a "growing movement of educators creating lesson plans to teach students to collect and analyze data." One goal is "to derive opinions from measurable, real-world data." Another, is to address the shortage of "managers and analysts who can make decisions based on big data analysis," according to management researcher Michael Chui. The Washington Post article goes on to quote Chui:
“It makes sense for us to be thinking about education, starting in early childhood, about concepts such as the difference between correlation and causation, what it means to have a bias as you think about data, conditional probability. These are things we as humans don’t naturally do . . . these are learned [concepts],” Chui said in an interview. He added that curricula should teach students about the realistic limitations of data sets — extraneous information, or sampling error, for instance.
The article describes students collecting their own data. Third-grade students collect daily temperature data, fifth-grade students record the hours of daylight and relate them to the earth's motions, and even kindergarten children "recording predictions for whether it will be sunny outside the next day, or which foods will decompose fastest, along with the results."

Says one science coordinator at an elementary school, evaluating the effectiveness of these lessons is "ultimately if the kid’s able to have a conversation about it and ask questions about it.”

A great goal for student's of all ages. That this is taught and expected of even elementary school students is inspiring.

(On a very minor display note: the introductory graphic to this story is an image of a computer monitor showing results from a school's Science Festival using software from Tuva Labs. Dot plots are displayed showing the arm spans by gender. I wonder about the zoom-in that is shown for one data point. It seems only to extract the same dot plot that's on the screen. That's something to ask a question about!)

Monday, November 10, 2014

Pie Rules

"There is no data that can be displayed in a pie chart, that cannot be displayed BETTER in some other type of chart," is a quote Wikipedia attributes to the late, great statistician John Tukey, (I've found no original source for the quote). It gets worse when Excel and/or graphic designers start adding chart junk of 3-D projections in hope of adding more visual interest.
Any reasonable comparisons in the above chart are impossible.

If you insist on using pie charts, Benjamin Starr at Visual News offers some history and sensible guidelines for using pie charts.

First, as the lead-in graphic above illustrates, display no more than five categories in a pie chart. Many small areas are difficult to compare.
Second, since wedges of a pie are difficult to compare side-by-side, Starr says don't use multiple pie charts for comparison. Use stacked bar charts instead, as shown above.
Third, make sure that the percentages add up to 100% and that all slices are drawn proportionately to the percentages they purport to represent.
And finally, order the slices from largest to smallest, starting at 12 o'clock and continuing either clockwise or counter-clockwise to aid in comprehension.

Of course, we've posted some pie charts here and had some fun with them.




Monday, November 3, 2014

The Curse

The Curse of Dimensionality addresses the difficulty of dealing with multivariate data. It warns us that, for a set of data in high dimensions, local neighborhoods are almost certainly empty of data points and neighborhoods that are not empty are almost certainly not local. 

In explaining this result, biostatistician Jeff Leek thought of clever a demonstration and got his graduate student Prasad Patil to build an interactive program to illustrate the Curse. In the screen shots above, samples of 100 points are randomly and uniformly generated in 1,2,3, and 4 dimensions in the unit cube. Subsets are examined in cubes with edge length of 1/2. In 1-dimension, the simulation contained 55% of the data in a line segment of length 1/2 (expected is, of course, 50%). In 2-dimensions, the simulation contained 31 % in a square with sides of length 1/2 (expected is 25%). In 3-dimensions, the simulation contained 14% in a cube with sides of length 1/2 (expected is 12.5%). And finally, the simulation contained just 4% of the data in a 4-D cube with sides of length 1/2 (expected is 6.25%). As the dimension grows, smaller and smaller percentages of the data can be found in regions with linear dimensions, that our low dimensional intuition tells us, are not small. Balancing the variance of a large neighborhood with the low bias of a small neighborhood is incredibly difficult in high dimensions.

This is a nice way to help visualize the Curse.

Monday, October 27, 2014

YADDA: There when you need them

This past week I gave a talk "Normal Distributions: Photographic Confusions" at the University of Maryland. On the way to the lecture room I pointed out this bell-shaped distribution of wear on a stairway door. Yet Another Door Distribution Again.

Monday, October 20, 2014

Are You Un-fashionably Late?

Here is a histogram of when people showed up to a party. From FiveThirtyEight, begun by Walt Hickey, but then crowdsourced from readers. Of 803 guest times submitted, the median guest arrived almost an hour after the party's start time. Four guests showed up over 3 hours late. How fashionable is that?
They also looked at a scatterplot of arrival time against number of party guests.  They fit a regression, with only a 5%  R2, and the following interpretation: as host you should expect the mean guest to arrive 42 minutes after the party's start plus 4 minutes for every additional 10 guests. So comparing parties that differed by 10 guests, on average, the guests to the larger party arrived 4 minutes later.

Monday, October 13, 2014

X is for .... oh just forget it!

Journalist David Goldenberg of Five Thirty Eight noted the animals most likely used to represent letters in a sample of 50 children's ABC books from 1820 to 2013. He notes that Zebra was used almost exclusively for the letter "Z". But  note the letter "X". So few words begin with "X" that it was most often totally omitted from alphabet books or as Goldenberg says used by authors, "lamely trotting out a fox or an ox and pointing out its last letter." The modern trend seems to be using scientific words such as Xiphias for swordfish.  

Shown these results, one parent mentioned that Xenops, a genus of ovenbirds, was used in at least two of her son's items (books or toys) and was surprised that "D" for dolphin was not higher in the ranking. But I guess it's hard to top Dog and Duck. And Dr. Susess's ABC Book, for the letter D, dreams up a Duck-Dog!


Monday, October 6, 2014

Happy Birthday Holidays

Like the title asks, "How common is your birthday?" From a decade of data from 1994 to 2004, the shades of color in this display seem to indicate that September is the most common month, and further plots show that September 9, 1999 had the most births. The least common? Holidays. Perhaps the expectant parents themselves and/or health care workers keep the expectant mothers away from delivery on New Years Day, July 4th, Christmas and Christmas Eve, and several days in late November, since Thanksgiving can vary. But Leap Day, February 29th is the surely the least common. Via Visual News, via Redditer UCanDoEat.

This display brings to mind the classical birthday problem and its variations. The classical birthday problem considers the probability that in a set of n people, randomly and independently chosen, that at least one pair have the same birthday. The usual assumption is that birthdays are uniformly distributed throughout the year. The display above shows this not to be the case. Bloom(1973) in the American Mathematical Monthly showed that any non-uniform distribution of birthdays makes sharing more likely. Is is well known that for n=23 people the chances are greater than even of sharing uniformly distributed birthdays. Munford (1977) showed that this value of n=23 is also true for any non-uniform distribution. Berresford (1980) examined this with a non-uniform, data-based, distribution of birthdays, illustrating that the surprising and counter-intuitive and robust value of n=23 yields greater than even odds.