Monday, December 15, 2014

Pie Chart of Dimension

                          My new favorite pie chart. Via Flowing Data.

Monday, December 8, 2014

Rivers of Dimension

How big are US rivers? Where do they flow? How much water do they contain? This graphic from Pacific Institute helps to answer these questions. Shown are rivers, merging into rivers, displayed as increasing width of blue branches leading to their trunk exit at the ocean. In the East Central US the flow is primarily into the Gulf of Mexico and appears in this graphic as a great branching tree of water with some branches stretching across the country, nearly to Canada. Many dimensions (i.e. variables) are shown in this graphic: Longitude, Latitude, (and the associated nominal variable of State Name), Direction, Water Flow (with river width drawn proportional to the square root of its estimated average annual flow volume), across forested regions (in green) of the US. Perhaps another (seventh) variable indicating transit time from the tip of a branch to the ocean could also be color coded. Is such information easily available? I don't know.

The map is reminiscent of Minard's acclaimed map of Napoleon's march. Via Scientific Illustration and joerojasburke.

Monday, December 1, 2014


From the site JoeyCloud here is a histogram on a piano keyboard where the histogram bars are drawn with their length corresponding to "how often each key gets pressed relative to the rest." This example is Flight of the Bumble Bee by Rimsky-Korsakov. You can select others from a short menu of classical works or from your own uploaded MIDI file.Via Flowing Data.

Monday, November 24, 2014

Elementary Steps, Dr. Watson

“Ordering my cab to wait, I passed down the steps, worn hollow in the center by the ceaseless tread of drunken feet,” Dr. Watson in The Adventure of the Man with the Twisted Lip by Arthur Conan Doyle. (Bell-shaped carpet wear from the ceaseless feet at a restaurant in Snow Hill, Maryland.)

Monday, November 17, 2014

Data Literacy: It's Elementary

The Washington Post has an article this morning (17 November, 2014): "In elementary schools, lessons on data literacy," by IT reporter Mohana Ravindranath. She describes a "growing movement of educators creating lesson plans to teach students to collect and analyze data." One goal is "to derive opinions from measurable, real-world data." Another, is to address the shortage of "managers and analysts who can make decisions based on big data analysis," according to management researcher Michael Chui. The Washington Post article goes on to quote Chui:
“It makes sense for us to be thinking about education, starting in early childhood, about concepts such as the difference between correlation and causation, what it means to have a bias as you think about data, conditional probability. These are things we as humans don’t naturally do . . . these are learned [concepts],” Chui said in an interview. He added that curricula should teach students about the realistic limitations of data sets — extraneous information, or sampling error, for instance.
The article describes students collecting their own data. Third-grade students collect daily temperature data, fifth-grade students record the hours of daylight and relate them to the earth's motions, and even kindergarten children "recording predictions for whether it will be sunny outside the next day, or which foods will decompose fastest, along with the results."

Says one science coordinator at an elementary school, evaluating the effectiveness of these lessons is "ultimately if the kid’s able to have a conversation about it and ask questions about it.”

A great goal for students of all ages. That this is taught and expected of even elementary school students is inspiring.

(On a very minor display note: the introductory graphic to this story is an image of a computer monitor showing results from a school's Science Festival using software from Tuva Labs. Dot plots are displayed showing the arm spans by gender. I wonder about the zoom-in that is shown for one data point. It seems only to extract the same dot plot that's on the screen. That's something to ask a question about!)

Monday, November 10, 2014

Pie Rules

"There is no data that can be displayed in a pie chart, that cannot be displayed BETTER in some other type of chart," is a quote Wikipedia attributes to the late, great statistician John Tukey, (I've found no original source for the quote). It gets worse when Excel and/or graphic designers start adding chart junk of 3-D projections in hope of adding more visual interest.
Any reasonable comparisons in the above chart are impossible.

If you insist on using pie charts, Benjamin Starr at Visual News offers some history and sensible guidelines for using pie charts.

First, as the lead-in graphic above illustrates, display no more than five categories in a pie chart. Many small areas are difficult to compare.
Second, since wedges of a pie are difficult to compare side-by-side, Starr says don't use multiple pie charts for comparison. Use stacked bar charts instead, as shown above.
Third, make sure that the percentages add up to 100% and that all slices are drawn proportionately to the percentages they purport to represent.
And finally, order the slices from largest to smallest, starting at 12 o'clock and continuing either clockwise or counter-clockwise to aid in comprehension.

Of course, we've posted some pie charts here and had some fun with them.

Monday, November 3, 2014

The Curse

The Curse of Dimensionality addresses the difficulty of dealing with multivariate data. It warns us that, for a set of data in high dimensions, local neighborhoods are almost certainly empty of data points and neighborhoods that are not empty are almost certainly not local. 

In explaining this result, biostatistician Jeff Leek thought of clever a demonstration and got his graduate student Prasad Patil to build an interactive program to illustrate the Curse. In the screen shots above, samples of 100 points are randomly and uniformly generated in 1,2,3, and 4 dimensions in the unit cube. Subsets are examined in cubes with edge length of 1/2. In 1-dimension, the simulation contained 55% of the data in a line segment of length 1/2 (expected is, of course, 50%). In 2-dimensions, the simulation contained 31 % in a square with sides of length 1/2 (expected is 25%). In 3-dimensions, the simulation contained 14% in a cube with sides of length 1/2 (expected is 12.5%). And finally, the simulation contained just 4% of the data in a 4-D cube with sides of length 1/2 (expected is 6.25%). As the dimension grows, smaller and smaller percentages of the data can be found in regions with linear dimensions, that our low dimensional intuition tells us, are not small. Balancing the variance of a large neighborhood with the low bias of a small neighborhood is incredibly difficult in high dimensions.

This is a nice way to help visualize the Curse.