Line 22: Line 22:
  
 
1) For a given input data, with no apriori knowledge, choosing appropriate distance metric is very important. Distance metrics are used in density estimation methods (Parzen windows), clustering (k-means) and instance based classification methods (Nearest Neighbors) etc. Euclidean distance is used in most of the cases, but in cases where the relationship between data points is non-linear, selection of a distance metric is a challenge. Here is a reference addressing this issue: [http://www.citeulike.org/user/sdvillal/article/673356]
 
1) For a given input data, with no apriori knowledge, choosing appropriate distance metric is very important. Distance metrics are used in density estimation methods (Parzen windows), clustering (k-means) and instance based classification methods (Nearest Neighbors) etc. Euclidean distance is used in most of the cases, but in cases where the relationship between data points is non-linear, selection of a distance metric is a challenge. Here is a reference addressing this issue: [http://www.citeulike.org/user/sdvillal/article/673356]
 +
 +
'''MEMORY CONSTRAINT'''
 +
 +
Several classifiers such as SVMs, artificial neural networks etc are memory constrained. Training and testing with huge data (data with many features) often needs huge memory resources. This problem can be reduced by 'dimension reduction' [http://en.wikipedia.org/wiki/Dimension_reduction]. Dimension reduction may result in improving/degrading the accuracy of the classifiers.
 +
 +
*If the data has many dimensions, some dimensions may be redundant or result in noise. These are eliminated in dimension reduction. In this case, the accuracy is improved and the memory requirements also reduce by a huge amount.
 +
 +
*If the data has fewer dimensions, and all of them contribute to the accuracy of the classifiers significantly, there might a loss in accuracy by reducing the number of dimensions. In this case, though the memory requirements are reduced by a great amount, the accuracy is affected a lot
 +
 +
To maintain/improve the accuracy with dimension reduction, it is better to cluster the data. Cluster preserving dimension reduction preserves the information present prior to classification.

Revision as of 14:41, 29 March 2008

Support Vector Machines

1. Requires solving a Quadratic Programming problem which can be computationally intensive 2. Finding the right kernel function, for linear classification of data required by SVMs, is non-trivial task. 3. Assuming a kernel function and optimizing the cost function are done as different steps (Neural Networks, where these are done simultaneously poses this as its advantage over SVMs)

Perceptron (with FLD)

1. Requires the data to be linearly separable. If the classification accuracy of the perceptron method is bad, kernel methods (eg. SVMs) might be required. 2. If the required class means and covariances are not known, they can be estimated from the training set. Parameter estimation methods like maximum likelihood estimate or the maximum a posteriori estimate may be used 3. Regularization might be required (for finding the inverse) to avoid overfitting issues.

KNN Classification

1. This classification method gives very good results if huge training data is available.

From yamini.nimmagadda.1 Fri Mar 7 17:03:13 -0500 2008 From: yamini.nimmagadda.1 Date: Fri, 07 Mar 2008 17:03:13 -0500 Subject: Distance Metric Learning Message-ID: <20080307170313-0500@https://engineering.purdue.edu>

1) For a given input data, with no apriori knowledge, choosing appropriate distance metric is very important. Distance metrics are used in density estimation methods (Parzen windows), clustering (k-means) and instance based classification methods (Nearest Neighbors) etc. Euclidean distance is used in most of the cases, but in cases where the relationship between data points is non-linear, selection of a distance metric is a challenge. Here is a reference addressing this issue: [1]

MEMORY CONSTRAINT

Several classifiers such as SVMs, artificial neural networks etc are memory constrained. Training and testing with huge data (data with many features) often needs huge memory resources. This problem can be reduced by 'dimension reduction' [2]. Dimension reduction may result in improving/degrading the accuracy of the classifiers.

  • If the data has many dimensions, some dimensions may be redundant or result in noise. These are eliminated in dimension reduction. In this case, the accuracy is improved and the memory requirements also reduce by a huge amount.
  • If the data has fewer dimensions, and all of them contribute to the accuracy of the classifiers significantly, there might a loss in accuracy by reducing the number of dimensions. In this case, though the memory requirements are reduced by a great amount, the accuracy is affected a lot

To maintain/improve the accuracy with dimension reduction, it is better to cluster the data. Cluster preserving dimension reduction preserves the information present prior to classification.

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett