You are on page 1of 9

By,

S.Subha surya. R.nandhini

definition: The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in highdimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.

The curse of dimensionality one buzzword for many problem. Data analysis tools based on learning principles infer knowledge, or information, from available learning samples. Obviously, the models built through learning are only valid in the range or volume of the space where learning data are available. Whatever is the model or class of models, generalization on data that are much different from all learning points is impossible. In other words, relevant generalization is possible from interpolation but not from extrapolation.

The number of training samples What would the probability density function look like if the dimensionality is very high?
For a 7-dimensional space, where each variable could have 20 possible values, then the 7-d histogram contains 207 cells. To distributed a training set of some reasonable size (1000) among this many cells is to leave virtually all the cells empty

Accuracy and overfitting In theory, the higher the dimensionality, the less the error, the better the performance. However, in realistic PR problems, the opposite is often true. Why?
The assumption that pdf behaves like Gaussian is only approximately true When increasing the dimensionality, we may be overfitting the training set. Problem: excellent performance on the training set, poor performance on new data points which are in fact very close to the data within the training set

Visualization:

projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on query accuracy

Customer relationship management Text mining Image retrieval Microarray data analysis Protein classification Face recognition Handwritten digit recognition Intrusion detection

Given x RN, the goal is to find a linear transformation matrix U NxK such that y = UTx RK where K<<N Idea: represent vectors using a set of basis vectors in an appropriate lower dimensional space.
(1) Higher-dimensional space representation:

(2) Lower-dimensional space representation:

From a theoretical point of view, increasing the number of features should lead to better performance (assuming independent features). In practice, the inclusion of more features leads to worse performance (i.e., curse of dimensionality). Need exponential number of examples as dimensionality increases.

You might also like