Line 17: | Line 17: | ||
You lost your needle in a haystack. If the world you live in has just one dimension, then you are lucky because the haystack is a line segment. Finding the needle only requires you to start from one end of the segment, heading towards the other end until hitting the needle: | You lost your needle in a haystack. If the world you live in has just one dimension, then you are lucky because the haystack is a line segment. Finding the needle only requires you to start from one end of the segment, heading towards the other end until hitting the needle: | ||
− | [[Image:1d_haystack.png|280px| | + | <center> |
+ | [[Image:1d_haystack.png|280px|center]] | ||
+ | </center> | ||
Not bad, right? If you are a 2D creature, a needle in a haystack now becomes a small trouble because it can be any point in a square: | Not bad, right? If you are a 2D creature, a needle in a haystack now becomes a small trouble because it can be any point in a square: | ||
− | [[Image:2d_haystack.png|220px| | + | <center> |
+ | [[Image:2d_haystack.png|220px|center]] | ||
+ | </center> | ||
There is much more space for the possible location: within the haystack, the needle can move along either the <math>x</math>-axis or the <math>y</math>-axis. Things get worse and worse as the number of dimensions grows. You can't imagine how alien high-dimensional creatures would look for something missing in their world because even in three dimensions, this task turns out to be frustrating (from everyday life experience): | There is much more space for the possible location: within the haystack, the needle can move along either the <math>x</math>-axis or the <math>y</math>-axis. Things get worse and worse as the number of dimensions grows. You can't imagine how alien high-dimensional creatures would look for something missing in their world because even in three dimensions, this task turns out to be frustrating (from everyday life experience): | ||
− | [[Image:3d_haystack.png|250px| | + | <center> |
+ | [[Image:3d_haystack.png|250px|center]] | ||
+ | </center> | ||
− | The difficulty of finding the needle starting from a certain point in a high-dimensional haystack implies an import fact: '''the points tend to distribute more sparsely (i.e., they are more dissimilar to each other) in high-dimensional space''', | + | The difficulty of finding the needle starting from a certain point in a high-dimensional haystack implies an import fact: '''the points tend to distribute more sparsely (''i.e.'', they are more dissimilar to each other) in high-dimensional space''', because the Euclidean distance between two points is large as long as they are far apart in the direction of at least one dimension, even though they may be quite close in the direction of all the other dimensions. This fact can be further illustrated quantitatively by the following example. In <math>N</math>-dimensional space, suppose we want to count all the neighbors around a central point within a hypercube with the edge length equal to 1. We also consider a bigger hypercude around the central point with the edge length equal to <math>1+2\epsilon</math>, ''i.e.'', the smaller hypercube is nested inside the bigger one. As a special case, in 3D space we have the following scenario: |
+ | <center> | ||
+ | [[Image:neighbor.png|220px|center]] | ||
+ | </center> | ||
+ | Now if by assumption the points are evenly distributed in the space, then the ratio of the number of the points in the smaller hypercube to the number of the points in the bigger hypercube is given by | ||
+ | |||
+ | <center> | ||
+ | <math> (\frac{1}{1+2\epsilon})^N </math> | ||
+ | </center> | ||
+ | |||
+ | If we take <math> \epsilon=0.01 </math>, then for <math>N=</math>1,2,3,5,10,100, and 1000 we have the ratio equal to 0.98, 0.96, 0.94, 0.91, 0.82, 0.14, and <math>2.51\text{e}^{-09}</math>. It can be seen that when <math>N >= 10</math>, the neighbors around a given point become so few that nearly all points are located inside the ''<math>\epsilon</math>-shell'' around the smaller hypercube. To maintain a certain number of neighbors, ''exponentially'' many new points should be added into the space. | ||
---- | ---- | ||
+ | == Curse of dimensionality == | ||
+ | Math theorems can often generalize from low dimensions to higher dimensions. If you know how to compute gradients for a function with two variables, then you probably can also figure out how to do that for multi-variable functions. However, this is not the case when you try to generalize your two-dimensional density estimator to hundreds or thousands of dimensions. Theoretically, your estimator should still work in high dimension. But that requires you to collect an exponential number of training data to avoid ''sparsity''. In practice, the available training data are often quite limited. | ||
+ | |||
+ | Your two-dimensional estimator stops working in high dimension because the limited training data in your hand are not '''statistically significant''' due to their sparsity, ''i.e.'', there is not enough information for you to recover the true underlying distribution. The concept of statistical significance can be illustrated in the one dimensional case as follows. Suppose you are drawing some points from a normal distribution and trying to estimate its parameters. It makes a big difference between if you draw just 10 points and if you draw 100 points from the distribution. | ||
+ | |||
+ | <center> | ||
+ | [[Image:true.png|320px|the true distribution]] [[Image:insignificant.png|320px|estimated from 10 points]] [[Image:significant.png|320px|estimated from 100 points]] | ||
+ | </center> | ||
+ | |||
+ | In the above, the first one is the true underlying normal distribution. The second one is being estimated from 10 points. The third one is being estimated from 100 points. Insufficient training points result in a poor estimation. | ||
+ | |||
+ | The '''curse of dimensionality''' describes a scenario in which you can't possibly obtain an exponential number of training points in high dimensional space which is just like that ''somehow after being cursed in the 1D world you can't easily obtain over 10 training points in the above example'', thus resulting in high difficulty of reliably estimating density function. This difficulty also arises in seeking a decision boundary between classes using a discriminative classifier. You never know whether the boundary drawn to separate the training points just purely separates some special training samples from the whole population or actually generalizes well to other data. The lack of ability to reliably estimate and generalize is called '''overfitting'''. Even with enough training data in low dimensional space, overfitting can still exist due to various reasons (''e.g.'', high model complexity, incorrect collection of training data, ''etc''). However, in high dimensional space, overfitting is more likely to happen. | ||
Revision as of 12:46, 16 April 2014
Curse of Dimensionality
A slecture by ECE student Haonan Yu
Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.
Needle in a haystack
You lost your needle in a haystack. If the world you live in has just one dimension, then you are lucky because the haystack is a line segment. Finding the needle only requires you to start from one end of the segment, heading towards the other end until hitting the needle:
Not bad, right? If you are a 2D creature, a needle in a haystack now becomes a small trouble because it can be any point in a square:
There is much more space for the possible location: within the haystack, the needle can move along either the $ x $-axis or the $ y $-axis. Things get worse and worse as the number of dimensions grows. You can't imagine how alien high-dimensional creatures would look for something missing in their world because even in three dimensions, this task turns out to be frustrating (from everyday life experience):
The difficulty of finding the needle starting from a certain point in a high-dimensional haystack implies an import fact: the points tend to distribute more sparsely (i.e., they are more dissimilar to each other) in high-dimensional space, because the Euclidean distance between two points is large as long as they are far apart in the direction of at least one dimension, even though they may be quite close in the direction of all the other dimensions. This fact can be further illustrated quantitatively by the following example. In $ N $-dimensional space, suppose we want to count all the neighbors around a central point within a hypercube with the edge length equal to 1. We also consider a bigger hypercude around the central point with the edge length equal to $ 1+2\epsilon $, i.e., the smaller hypercube is nested inside the bigger one. As a special case, in 3D space we have the following scenario:
Now if by assumption the points are evenly distributed in the space, then the ratio of the number of the points in the smaller hypercube to the number of the points in the bigger hypercube is given by
$ (\frac{1}{1+2\epsilon})^N $
If we take $ \epsilon=0.01 $, then for $ N= $1,2,3,5,10,100, and 1000 we have the ratio equal to 0.98, 0.96, 0.94, 0.91, 0.82, 0.14, and $ 2.51\text{e}^{-09} $. It can be seen that when $ N >= 10 $, the neighbors around a given point become so few that nearly all points are located inside the $ \epsilon $-shell around the smaller hypercube. To maintain a certain number of neighbors, exponentially many new points should be added into the space.
Curse of dimensionality
Math theorems can often generalize from low dimensions to higher dimensions. If you know how to compute gradients for a function with two variables, then you probably can also figure out how to do that for multi-variable functions. However, this is not the case when you try to generalize your two-dimensional density estimator to hundreds or thousands of dimensions. Theoretically, your estimator should still work in high dimension. But that requires you to collect an exponential number of training data to avoid sparsity. In practice, the available training data are often quite limited.
Your two-dimensional estimator stops working in high dimension because the limited training data in your hand are not statistically significant due to their sparsity, i.e., there is not enough information for you to recover the true underlying distribution. The concept of statistical significance can be illustrated in the one dimensional case as follows. Suppose you are drawing some points from a normal distribution and trying to estimate its parameters. It makes a big difference between if you draw just 10 points and if you draw 100 points from the distribution.
In the above, the first one is the true underlying normal distribution. The second one is being estimated from 10 points. The third one is being estimated from 100 points. Insufficient training points result in a poor estimation.
The curse of dimensionality describes a scenario in which you can't possibly obtain an exponential number of training points in high dimensional space which is just like that somehow after being cursed in the 1D world you can't easily obtain over 10 training points in the above example, thus resulting in high difficulty of reliably estimating density function. This difficulty also arises in seeking a decision boundary between classes using a discriminative classifier. You never know whether the boundary drawn to separate the training points just purely separates some special training samples from the whole population or actually generalizes well to other data. The lack of ability to reliably estimate and generalize is called overfitting. Even with enough training data in low dimensional space, overfitting can still exist due to various reasons (e.g., high model complexity, incorrect collection of training data, etc). However, in high dimensional space, overfitting is more likely to happen.
Questions and comments
If you have any questions, comments, etc. please post them on this page.