Tuesday, November 6, 2007

The Sound of a Chair



Not really statistical and not totally visual, but here is an interesting 3D representation of data. Using time, frequency, and volume this tone yields a 3D graphic in the form of a chair. Click on the link to designer Matthew Plummer Fernandez then click on sound/chair to hear the sound of a chair.

Thursday, October 18, 2007

State of the Union Statistics


Here is an interactive view of the State of the Union addresses from all the US Presidents. Larger word clouds indicate the words that were used more frequently. You can select a President and compare his address (in red) with another's (in white). Up and down arrows can help eliminate word clustering. The bar chart at the bottom indicates the total number of words in each address. A very nice use of frequency statistics.

George W. Bush has a fondness for Julie, shown on the right. This is Julie Aigner-Clark who developed the Baby Einstein videos and cashed in when she sold out to Disney. Of course, the videos have now been shown to be ineffective. Hear the NPR story.

Friday, October 5, 2007

Beware correlations on averages

A recent article on Cuba brought to mind how correlations on averages can be very different from correlations on individuals. Due to unfortunate economic conditions, the 1990s found the Cuban people both low on food and fuel. They ate less, (daily energy intake declined from 2,899 kcal in 1988 to 1,863 kcal in 1993) and they exercised more as a result of widespread use of bicycles and walking as alternative means of transportation.

As a result, obesity declined, as did deaths attributed to diabetes, coronary heart disease and stroke. This corresponds with much of what is known about increasing the length of life through caloric restriction. Laboratory tests on many organisms have shown a negative correlation between caloric consumption and length of life. The example from Cuba is a natural experiment on humans that appears to indicate that daily calorie intake is negatively correlated with life expectancy. Of course, the Cuban example has a confounding variable of increased exercise. So we can't say directly that the less you eat the longer you live. But contrast this with data from the UN through the FAO on average daily calorie intake and average life expectancy by country, shown above. The correlation is positive 0.72. Of course, this correlation is confounded with wealth, health care, etc.

If individuals generally respond to decreased caloric intake the same way that the citizens of Cuba did, then hypothetical scatterplots for individuals within each country would have a negative correlation:

Here the ellipsoidal regions indicate confidence regions for the mean caloric intake and length of life for individuals in each country. The tilt in the ellipses indicate the negative correlation. Now I don't have such data. But the Cuban example suggests that the correlations may well be negative within a country but positive across the countries of the world.

A nice example that tells us again to beware of correlations on averages. They may not reflect the correlations on individuals.

Wednesday, October 3, 2007

Normal Dining?


Here is an image of a bell-shaped distribution on the baseboards of booths at a restaurant in Bethesda, Maryland. Could it be that the pattern is caused when the wait staff set, clear,or deliver food to the table? Imagine a server scuffing the baseboards with his shoes, rubbing the wood smooth in a pattern that shows typical foot placement. Or is the pattern the residual polish left after customers scuff the wood next to their seats as they slide into the booth? I think I prefer the former explanation.
What do you think?

Monday, September 24, 2007

Gapminder


An amazing interactive Flash implementation shown here of scatterplot graphics related to world development. Multiple dimensions, variable transformations, geographic anchors, time series visualization, etc. An outstanding interactive graphic!

World's Income Distribution


An interactive Flash demonstration showing the distribution of the world's income and the percentage of the world's population that falls under the poverty line of $1 a day and how it changes over time and by region.
From www.gapminder.org

A 6-Dimensional Conditional Slice


Here is a graphic from the US Geological Survey via the World Wildlife Fund. It shows 6 dimensions of data across a conditional slice of the US at latitude 35% N. Similar to an earlier post of rainfall.

Sunday, August 12, 2007

Power of the Histogram


The new CBS game show called "Power of 10" has contestants estimate what Americans think. The host of the game is comedian Drew Carey. One question posed to a contestant was "What percentage of Americans think Drew Carey is a presidential candidate and not a comedian?" The contestant must provide a guess. Low valued questions allow for a relatively large margin of error. For example, a $1000 question has a margin of error of 40%. The next power of 10 question for $10000 has a margin of error of 30% etc. up to $10,000,000 where the exact answer is required. One contestant has already won $1,000,000 on the show. To help them with their guesses contestants are given the results of an audience poll. This is given in the form of the histogram show above.

For the Drew Carey question, the modal guess is 10%, the mean and median are in the 20's.

By the way, the answer to the question above was 9%, well within the contestant's range of 2% to 42%. She won $1,000. Of course the audience's modal answer was much closer, showing the wisdom of crowds.

Monday, August 6, 2007

A distribution in the clouds


Here is an endpaper from a book displayed at a site on the Visual Telling of Stories via www.strangemaps.wordpress.com. The book is likely about travel across the US with commentary on geology, geography, etc. A close-up is shown here. The elevation profile towards the bottom has, above it, a frequency distribution (in clouds) of the annual rainfall in each part of the country.

Friday, August 3, 2007

Statistical Clock


Using data from the World Health Organization, the CIA factbook, the US Census Bureau and other sources you can get an unsettling look at the world. The up-to-the-second version is at poodwaddle.com.

Tuesday, July 31, 2007

Cake Chart


You've heard of pie charts. Here are Cake Charts. The cake's decoration is designed to be the proportions of each cake's ingredients. As the site says, "Decoration becomes information."

Such data are called compositional data. The previous post had compositional data on national flags. Here is one of my research papers on modeling compositional data.

Friday, July 20, 2007

Pie Flags


Here is a site that shows the flags of the nations of the world as pie charts of their flag's colors. Click on the site's charts to see what country it belongs to.

Thursday, July 19, 2007

What color is PG-13?



Here's someone who has attempted to compare the frequency distributions of the colors on movie posters given their MPAA rating. As you might expect the more kid-friendly the movie, the more bright colors you see. NC-17 and R rated movies are more "dark and fleshy".

Wednesday, July 11, 2007

Plotting A Divided Court



On July 5, 2007 the Washington Post published the table above in the article “Parsing a Divided Court.” This is a similarity matrix. Displayed for each pair of Supreme Court Justices are the percentages of time that they agreed in non-unanimous rulings. We can use the method of principal co-ordinates by Gower "Some distance properties of latent roots and vector methods used in multivariate analysis" (1966, Biometrika, 53, pp. 325-338) to obtain points in Euclidean 2-space whose distances approximately mirror these similarity scores (distance^2 = 2(1 – similarity)). Justices with high similarity scores are plotted as points close together. Justices with low similarity scores are plotted as points far apart.

The method starts with a similarity matrix A. The diagonal elements of A are ones, indicating a justice’s perfect similarity with themselves. (Alternatively, a distance matrix D = (d[i,j]) could be used to build A, with diagonal entries equal to zero and the a[i,j] off diagonal entries equal to -½*d[i,j]^2).

The matrix A is transformed by subtracting out of each row the mean of that row of A, then subtracting out from each column the mean of that column of A, then adding the overall mean of A to each entry. The eigenvectors and eigenvalues of this transformed version of A are then computed. The elements of the first two eigenvectors multiplied by their respective eigenvalues give the co-ordinates of the nine justices in 2-space. This provides the visual representation of the original similarity matrix shown above. A measure of goodness of fit is given by the sum of the first two eigenvalues divided by the trace of the transformed A. Here about 75%.

An arbitrary algebraic sign places Justices Stevens, Ginsburg, Souter, and Breyer on the left of our plot and Justices Thomas, Scalia, Roberts and Alito on the right. Justice Kennedy is the most centrist justice falling just a little right of center.

An anagram note: The first letter of the justice’s last names: KGB’S STARS
(May 2010: My mnemonic KGB'S STARS for the justice's last names has held out through two replacements on the Court. Rehnquist left and Roberts replaced him. Souter left and Sotamayor replaced him.
But now it may fail. Stevens left and Kagan is the current candidate to replace him. Perhaps STARK KGBS. Not very good.)

Friday, July 6, 2007

Do Women Really Talk More Than Men?


In their article published today in Science, "Are Women Really More Talkative Than Men?" Mehl, et al. address what has become a cultural "truth" quantified in the book "The Female Brain" by Brizendine. There it is claimed that women use about 20,000 words a day while a men use about 7,000. But Mehl and colleagues have studied the actual conversations of nearly 400 people over the past 8 years. The histograms above from their online supplement display the findings. Women spoke an average of 16,215 words a day with a standard deviation of 7301 words. Men spoke an average of 15,669 words a day with a standard deviation of 8633 words. A one-sided t-test results in a p-value of 0.2479. Hardly a significant finding. Each group spoke about 16,000 words a day with large individual variation. This illustrates perfectly the quote by novelist Ivy Compton-Burnett, "There is more difference within the sexes than between them."

Tuesday, July 3, 2007

Cycle Man


On Friday, June 29, 2007 Aubrey Huff first baseman for the Baltimore Orioles hit for the cycle. In one game, he had a single, a double, a triple, and a homerun. This has happened 276 times in major league history, only three times for the Baltimore Orioles and it is the first time it has happened at home in Baltimore and at Oriole Park at Camden Yards. In the major leagues he is the third player to hit for the cycle in 2007. So what is the distribution of such an event?

Interestingly, just this year Huber and Glen have modeled the distribution of such rare events in their article, “Modeling Rare Baseball Events – Are They Memoryless?” in the Journal of Statistics Education. They model three baseball events as a Poisson Process: no-hit games, triple plays, and hitting for the cycle. Using data from 1901 through 2004 they get the distribution shown above of how many cycles are seen in a given season. The observed results are a bit short on seasons with exactly 2 players hitting for the cycle, and there are a few more seasons than expected with exactly 1 player hitting for the cycle. The expected results are shown in red: a Poisson distribution with the observed mean of 2.19.

Over the time period examined, nearly 160,000 games were played, and only 0.14% of these games had a player hit for the cycle. As Huber and Glen found the cycle is rarer than a triple play but not as rare as a no-hitter.

So how closely does this match a Poisson process? One measure is a Poissonness Plot developed by Hoaglin (American Statistician 1980). If f[k] represents the observed frequency of a random count k then the log of the Poisson density would essentially be -λ+klog(λ)-log(k!). So that in a plot of log(f[k])+log(k!) against k, a straight line would indicate a good fit to a Poisson distribution. The slope of this line is an estimate of log(λ). Here log(λ) is 0.9583, so that an estimate of λ is 2.607. Of course the maximum likelihood estimate is simply the mean number of cycles per season 2.19. Since the variance of a Poisson distribution is also equal to λ, this would provide another estimate of 2.90. Note the Poissonness plot estimate falls in between the two.

Another way of checking for a Poisson process is to check the inter-arrival times, that is, the number of games between any two players hitting for the cycle. These inter-arrival times should follow an exponential distribution. The figure below shows the cumulative relative frequency of the observed time between cycles for the period 1901 through 2004. Also shown is the theoretical exponential cumulative probability distribution with a mean equal to the observed mean of 719.533 games. This indicates that the cycle process is memoryless. Even if there have been 1000 games without any major league player hitting for a cycle, you would still have to wait on average over 719 games more before one does.

For Aubrey Huff it was certainly a special night. Not only did he hit for the cycle he also had the 1000th major league hit and the 200th double of his career. Even with all that the Orioles still lost the game to the LA Angels 9 to 7.

Monday, June 25, 2007

Shooting a Fence



"Fig. 10.—When deviations in all directions are equally probable, as in the case of shots fired at a target by an expert marksman, the "frequencies" will arrange themselves in the manner shown by the bullets in compartments above. A line drawn along the tops of these columns would be a 'normal probability curve.' Diagram by C. H. Popenoe."

From a 1918 book by Popenoe and Johnson titled "Applied Eugenics". Even though that field has been discredited, they have a clever way of motivating the normal distribution.

"Suppose an expert marksman shoots a thousand times at the center of a certain picket in a picket fence, and that there is no wind or any other source of constant error that would distort his aim. In the long run, the greatest number of his shots would be in the picket aimed at, and of his misses there would be just as many on one side as on the other, just as many above as below the center. Now if all the shots, as they struck the fence, could drop into a box below, which had a compartment for each picket, it would be found at the end of his practice that the compartments were filled up unequally, most bullets being in that representing the middle picket and least in the outside ones. The intermediate compartments would have intermediate numbers of bullets. The whole scheme is shown in Fig. 11. [actually Fig. 10] If a line be drawn to connect the tops of all the columns of bullets, it will make a rough curve or graph, which represents a typical chance distribution. It will be evident to anyone that the distribution was really governed by "chance," i.e., a multiplicity of causes too complex to permit detailed analysis. The imaginary sharp-shooter was an expert, and he was trying to hit the same spot with each shot. The deviation from the center is bound to be the same on all sides."

Monday, June 11, 2007

A Skewed Runway




This is a composite picture obtained from the United States Geological Survey. The top picture is a view of the runway called 1L at Washington-Dulles International Airport. Look at the tire skid pattern from the landing airplanes.

The real runway is many times longer than it is wide. The images shown here have been stretched across the width of the runway and shrunk along the length to better see the pattern of use. Aside from this mild distortion, the tire skid pattern has not been altered.

The aiming point for landing pilots is indicated by two broad white rectangular marking stripes about 1,000 feet from the end of the runway. Touchdown zone markers are groups of one, two, or three rectangular bars every 500 feet arranged on either side of the runway center line.

This runway at Washington-Dulles airport is so long (11,500 feet) that pilots need not land exactly on the aiming point for safe operation. This along with the fact that they definitely don’t want to land short of the runway accounts for the skewed tire skid pattern. Note that although the skid pattern is skewed along the length of the runway it is symmetric across the width of the runway.

The skewness along the length of the runway shows that most of the airplanes land within 1,000 feet of the aiming point, but some land, as indicated by the skid pattern, much further down the runway. The symmetry across the width of the runway is, of course, due to the two sets of wheels of the landing gear and how accurately the pilots hit the centerline of the runway.

The second runway picture shows the entire 1L/19R runway (albeit distorted to fit the page). Notice now the U-shaped distribution of tire skids, as we see the accumulated skid marks from airplanes landing from both directions.

Poisson Parking




The picture shows an aerial photograph, courtesy of the United States Geological Society, of the main visitor’s parking facility of business in Maryland on a Sunday morning in April. Look at the line of 13 parking spaces at the bottom of the picture. Notice the pattern in the oil stains leaked from the cars that park in those spaces. More oil is leaked in the spaces closest to the building.

Parking lots permit both the dynamic and static viewing of statistical processes. A time lapse film of customers entering and leaving a parking lot could allow us to estimate arrival rates, lengths of stay, or number of parked customers as time progresses. But automobiles, not being the cleanest of vehicles, leave their mark. This process is often modeled by a Poisson distribution. The actions of this process can be seen in static pictures as well.

Being Sunday no one is visiting the company, but the many previous visitors have left their marks. Notice the oil stains in the parking places. Much more oil is leaked and deposited in the places used most often. These places are the ones closest to the building. The one parking space closest to the building is a place for drivers with handicaps. Skip this space and examine the pattern starting with this first non-handicapped parking space. This first space will have oil stains when there are one or more cars in the lot. The second space will have oils stains when there are two or more cars in the lot, and so on. Thus, as the distance from the office front door to the car increases, then the amount of leaked oil decreases. This shows the steady state distribution of a multi-server queue. It takes the form of a truncated Poisson distribution, distributing Poisson probability only among the first few positive integers corresponding to the number of parking spaces.

Tri-modal distributions



This is an image of tennis courts at Wimbledon via earth.google.com. Examine the wear pattern in the grass along the baselines.

In most of these courts three distinct areas of wear are obvious. Two are on each corner of the baseline to deliver the ball into the service boxes. The third is a more centrally placed area near the baseline where the players position themselves to anticipate the return volley. Additional wear up and down the baseline is added as the games progress. These are examples of tri-modal distributions, with three prominent areas of most frequent use. Other areas of wear can also be seen inside each of the service boxes, perhaps indicating wear due to doubles play.

Two of these courts seem to have more bi-modal wear patterns indicating mainly the serving positions of the players. Are they not playing a baseline game on these courts? Are these courts used exclusively for service practice?

Thursday, June 7, 2007

Quincunx: a designer's view


A very interesting qunicunx or Galton board illustration by Bob O'Keefe and Springer publishers from a few years back. Most obvious is the segregation of the colors. What could cause this? Perhaps a magnet? May the force be with you!
More subtle is the resulting distribution. It should look more bell-shaped rather than this triangular shape. You could get a triangular distribution from the sum of two uniform random variables, but it would require not pins to jostle the balls, no central hole for them to fall through, and a more symmetric supply of black and white balls. Something like this..



Imagine someone has loaded the balls in the equal-sided diamond-shape shown. The balls are held, waiting to fall above the the v-shaped retainer. Suppose the the entire retainer is removed, all at once. The balls fall straight down, through the slots, into the waiting bins below. The retainer is then replaced and refilled with balls. This is the image we see. This would achieve the resulting triangular distribution.

This, of course, still doesn't explain the segregation of the colors!

Highway Spacing Proportional to Population Density


A map of the US Interstate System from Chris Yates via www.strangemaps.com
Note how the spacing is directly related to the population density. The distance from Daytona Beach to Tampa, Florida is only 132 miles. Compare that to the distance from San Antonio to Los Angeles: 1202 miles.

Galton's Quincunx





A Qunicunx at the Instituto Butantan, São Paulo, Brazil.

The model was the invention of Sir Francis Galton, one of the English gentry scientists of the 19th century. Galton was a cousin of Charles Darwin, and like Darwin, devoted himself to scientific explorations. He made significant contributions to meteorology, forensic science, and statistics. In the 1870s Galton developed a device to study dispersion of random events. His device consists of an array of pins that allows lead shot, encased behind glass, to cascade through. As a ball of shot falls it strikes a pin and falls randomly to the right or to the left, each equally likely. From there, the shot falls to the next level of pins where it repeats this random walk downward. The shot is collected in separate bins at the bottom of the device. The pattern of shot accumulated in the bins illustrates the variability associated with this simulation of a binomial experiment. He called his device a quincunx, due to the arrangement of pins like the pips on the number five side of a die. An illustration of Galton’s original quincunx can be found at the Galton Institute.


A Watery Histogram




A view of the side of an office building in Washington, DC after a rain shower.


What to look for:


Notice the water on the wall, leaking from the downspout.


Statistical Concept: A histogram showing the pattern of leaking water similar to the way a Galton board, also called Galton's Qunicunx or binomial board, is used to illustrate the binomial probability distribution, also a diffusion pattern demonstrating horizontal spread as the water seeps horizontally and down into the porous brick wall.