Monday, November 3, 2014

The Curse

The Curse of Dimensionality addresses the difficulty of dealing with multivariate data. It warns us that, for a set of data in high dimensions, local neighborhoods are almost certainly empty of data points and neighborhoods that are not empty are almost certainly not local. 

In explaining this result, biostatistician Jeff Leek thought of clever a demonstration and got his graduate student Prasad Patil to build an interactive program to illustrate the Curse. In the screen shots above, samples of 100 points are randomly and uniformly generated in 1,2,3, and 4 dimensions in the unit cube. Subsets are examined in cubes with edge length of 1/2. In 1-dimension, the simulation contained 55% of the data in a line segment of length 1/2 (expected is, of course, 50%). In 2-dimensions, the simulation contained 31 % in a square with sides of length 1/2 (expected is 25%). In 3-dimensions, the simulation contained 14% in a cube with sides of length 1/2 (expected is 12.5%). And finally, the simulation contained just 4% of the data in a 4-D cube with sides of length 1/2 (expected is 6.25%). As the dimension grows, smaller and smaller percentages of the data can be found in regions with linear dimensions, that our low dimensional intuition tells us, are not small. Balancing the variance of a large neighborhood with the low bias of a small neighborhood is incredibly difficult in high dimensions.

This is a nice way to help visualize the Curse.

No comments: