m
Line 3: Line 3:
 
Please give your thoughts in the Forum discussion on this topic, "Separate Pages for Glossary Terms".
 
Please give your thoughts in the Forum discussion on this topic, "Separate Pages for Glossary Terms".
  
=== Glossary ===
+
= Glossary =
  
 
== [[Active Learning_Old Kiwi]] ==
 
== [[Active Learning_Old Kiwi]] ==
Line 85: Line 85:
 
Hilbert Space: A Hilbert space named after the German mathematician David Hilbert, generalizes the notion of Euclidean space in a way that extends methods of vector algebra from the two-dimensional plane and three-dimensional space to infinite-dimensional space. A Hilbert space is an inner product space which is complete under the induced metric. A Hilbert space is a special class of Banach space in the norm induced by the inner product, since the norm and the inner product both induce the same metric. Any finite-dimensional inner product space is a Hilbert space, but it is worth mentioning that some authors require the space to be infinite dimensional for it to be called a Hilbert space.
 
Hilbert Space: A Hilbert space named after the German mathematician David Hilbert, generalizes the notion of Euclidean space in a way that extends methods of vector algebra from the two-dimensional plane and three-dimensional space to infinite-dimensional space. A Hilbert space is an inner product space which is complete under the induced metric. A Hilbert space is a special class of Banach space in the norm induced by the inner product, since the norm and the inner product both induce the same metric. Any finite-dimensional inner product space is a Hilbert space, but it is worth mentioning that some authors require the space to be infinite dimensional for it to be called a Hilbert space.
  
#=== Reference ===
+
#== Reference ==
 
## Hilbert space definition and proposition
 
## Hilbert space definition and proposition
 
## Wikepedia-Hilbert space
 
## Wikepedia-Hilbert space

Revision as of 10:38, 28 February 2008

When inserting a new item into the glossary, put the term in *'s and []'s. Write a short description in the glossary. Then create a new page by clicking on the "?," and paste your short description there, too. You can link to any term in the glossary by simply putting the word in []'s from any other page. e.g. Glossary.

Please give your thoughts in the Forum discussion on this topic, "Separate Pages for Glossary Terms".

Glossary

Active Learning_Old Kiwi

Determining the optimal measurements to make under a cost constraint. A measurement is "optimal" when it is expected to give the most new information about the parameters of a model. Active learning is thus an application of decision theory to the process of learning. It is also known as experiment design.

Artificial Neural Networks: Ripley Chapter 5

Duda, Hart, and Stork Chapter 6

Nueral networks are a family of function approximation techniques, when the function is approximated,

$ f:x \rightarrow z $

is modeled as a composition of simple functions $ f_i $'s

$ f=f_n\bullet f_{n-1}\cdot\cdot\cdot f_1 $

Bayes Decision Rule_Old Kiwi

Bayes' decision rule creates an objective function which minimizes the probability of error (misclassification). This method assumes a known distribution for the feature vectors given each class.

Bayesian Parameter Estimation_Old Kiwi

Bayesian Parameter Estimation is a technique for parameter estimation which uses probability densities as estimates of the parameters instead of exact deterministic ones, as [MLE]? does.

Central Limit Theorem_Old Kiwi

The CLT explains why many distributions tend to be close to the normal distribution. The important point is that the "random variable" being observed should be the sum or mean of many independent random variables. (variables need not be iid)(See the PROOF )

Classes_Old Kiwi

The categories into which we are trying to classify the patterns.

Classification_Old Kiwi

Assigning a class to a measurement, or equivalently, identifying the probabilistic source of a measurement. The only statistical model that is needed is the conditional model of the class variable given the measurement. This conditional model can be obtained from a joint model or it can be learned directly. The former approach is generative since it models the measurements in each class. It is more work, but it can exploit more prior knowledge, needs less data, is more modular, and can handle missing or corrupted data. Methods include mixture models and Hidden Markov Models. The latter approach is discriminative since it focuses only on discriminating one class from another. It can be more efficient once trained and requires fewer modeling assumptions. Methods include logistic regression, generalized linear classifiers, and nearest-neighbor. See "Discriminative and Learning".

Clustering_Old Kiwi

Grouping similar objects in a multidimensional space. It is useful for constructing new features which are abstractions of the existing features. Some algorithms, like k-means, simply partition the feature space. Other algorithms, like single-link agglomeration, create nested partitionings which form a taxonomy.

Consistent Estimator_Old Kiwi

An unbiased estimator that converges more closely to the true value as the sample size increases is called a consistent estimator.

Curse of Dimensionality: Refers to the problem caused by exponential growth of hypervolume as a function of dimensionality. This term was coined by Richard Bellman in 1961.

Discriminant Analysis: Constructing new features via linear combination so that classification is easier. The notion of "easier" is quantified by Fisher's criterion, which applies exactly to Gaussian classes with equal variance and approximately to other models. Variants like Flexible discriminant analysis consider nonlinear combinations as well. See Duda&Hart, "Eigenfaces vs. Fisherfaces", and "Flexible Discriminant and Mixture Models".

Discriminant Function: One way to represent classifiers. Given classes w1, w2, .., wk and feature vector x, which is in n-dimensional space, the discriminant functions g1(x), g2(x), .., gk(x) where g#(x) maps n-dimensional space to real numbers, are used to make decisions. Decisions are defined as w# if g#(x) >= gj(x) for all j.

Distance: See metric.

Drawbacks of Euclidean distance (E.D): The drawback of the E.D approach is that it assumes that the sample points are distributed about the sample mean in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of a 'test point' belonging to the set to depend not only on the distance from the sample mean but also on the DIRECTION.

Dual Problem: Every linear programming problem, referred to as a primal problem, can be converted in a dual problem, which provides an upper bound to the optimal value of the primal problem. In matrix form, we express the primal problem as:

Maximize tex: c^Tx

Subject to tex: Ax \geq b, x\geq 0

The corresponding dual problem is:

Minimize tex: b^Ty

Subject to tex: A^Ty\geq c, y\geq 0

Where y is used instead of x as the variable vector.

There are two ideas fundamental to duality theory. One is the fact that the dual of a dual linear program is the original primal linear program. Additionally, every feasible solution for a linear program gives a bound on the optimal value of the objective function of its dual.

Article on Dual Problems

Euclidean Distance (ED): Is the distance between two points in a cartesian coordinate system that is "measured by a ruler". The Euclidean distance between the points X=[x1,x2,...xn]? and Y=[y1,y2,...yn]? is defined as dist(X,Y)=sum{(xi-yi)^2}.

Evidence: It a scale factor that insures that the posterior probabilities sum to one.

Feature Vector: Vector containing a numerical representation of features.

Features: The measurements which represent the data. The statistical model one uses is crucially dependent on the choice of features. Hence it is useful to consider alternative representations of the same measurements (i.e. different features). For example, different representations of the color values in an image. General techniques for finding new representations include discriminant analysis, principal component analysis, and clustering.

Fisher Linear Discriminant: Fisher's linear discriminant is a classification method that projects high-dimensional data onto a line and performs classification in this one-dimensional space. The projection maximizes the distance between the means of the two classes while minimizing the variance within each class. See Lecture 10 for detailed explanation.

Generalized Rayleigh Quotient: tex: J(\vec{w}) = \frac{\vec{w}^{\top}S_B \vec{w}}{\vec{w}^{\top}S_w \vec{w}}

Generic Property: In mathematics, properties that hold for typical examples are called generic properties. For instance, a generic property of a class of functions is one that is true of most of those functions, as in the statements, " A generic polynomial does not have a root at zero," or "A generic matrix is invertible." As another example, a generic property of a space is a property that holds at most points of the space, as in the statementm, "if f: M->N is a smooth function between smooth manifolds, then a generic point of N is not a critical value of f" (This is by Sard's theorem.)

Hilbert Space: A Hilbert space named after the German mathematician David Hilbert, generalizes the notion of Euclidean space in a way that extends methods of vector algebra from the two-dimensional plane and three-dimensional space to infinite-dimensional space. A Hilbert space is an inner product space which is complete under the induced metric. A Hilbert space is a special class of Banach space in the norm induced by the inner product, since the norm and the inner product both induce the same metric. Any finite-dimensional inner product space is a Hilbert space, but it is worth mentioning that some authors require the space to be infinite dimensional for it to be called a Hilbert space.

  1. == Reference ==
    1. Hilbert space definition and proposition
    2. Wikepedia-Hilbert space
    3. Mathworld-Hibert space

Histogram Density Estimation: Histogram Density Estimation is one of the primitive and easiest non-parametric density estimation methods. The given feature space is divided into equally-spaced bins or cells. The number of training feature samples that fall into each category is computed, (i.e) if 'n_i' is the number of feature samples that fall into the bin 'i', and 'V' is the volume of each cell (the volume of all the cells are equal because they are equi-spaced), the density is given by p(x) = n_i/V.

Informative Prior: In the Bayesian framework, the prior distribution of the variable is called an informative prior if it provide some specific information about the variable. Informative priors allows to incorporate a prior somewhat defined knowledge about the random variable under consideration. (See also: Uninformative Prior and Prior)

K-means : A parametric algorithm for clustering data into exactly k clusters. First, define some initial cluster parameters. Second, assign data points to clusters. Third, recompute better cluster parameters, given the data assignment. Iterate back to step two. It is related to the Expectation-Maximization algorithm for fitting a mixture of Gaussians.

Kernel Functions: These functions operate in the feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the mappings of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. Kernel functions have been introduced for sequence data, text, images, as well as vectors. (REF: wiki)

Lagrange Multipliers: In mathematical optimization problems, the method of Lagrange multipliers, named after Joseph Louis Lagrange, is a method for finding the extrema of a function of several variables subject to one or more constraints; it is the basic tool in nonlinear constrained optimization. Simply put, the technique is able to determine where on a particular set of points (such as a circle, sphere, or plane) a particular function is the smallest (or largest).

Likelihood Principle: The likelihood principle is a principle of statistical inference which asserts that all of the information in a sample is contained in the likelihood function. A likelihood function arises from a conditional probability distribution considered as a function of its second argument, holding the first fixed. Eg: consider a model which gives the probability density function of observable random variable X as a function of parameter O. Then for a specific value x of X, the function L(O|x) = P(X=x|O) is a likelihood function of O. It gives a measure of how likely any particular value of O is, if we know that X has a value x. Two likelihood functions are equivalent if one is a scalar multiple of the other.

Linear Discriminant Functions (LDF): Functions that are linear combinations of x.

g(x) = transpose(w)*x + w0

Where w is the weight vector and w0 is the bias as threshold. In the two category case g(x) defines the decisions surface where: w1, when g(x) > 0 w2, when g(x) < 0

LUT - Look-Up Table: In classification applications, LUT's can be used for comparing decisions, as long as memory is available. This method requires a discrete feature space that is reasonably small. (This method is typically not mentioned in textbooks.)

Mahalanobis Distance: Differs from Euclidean Distance in that it takes into account the correlations in the data (by using the covariance matrix). Mahalanobis distance is widely used in cluster analysis and other classification techniques. (Ref - Wikipedia)

Manhattan Distance: Also known as taxicab metric. The Manhattan distance between two points (X,Y) in a cartesian system is defined as sum{|xi-yi|}. This is equal to the length of all the paths connecting X and Y in a combination of horizontal and vertical segments.

MAP or Maximum A Posteriori: A parameter estimation heuristic that seeks parameter values that maximize the posterior density of the parameter. This implies that you have a prior distribution over the parameter, i.e. that you are Bayesian. MAP estimation has the highest chance of getting the parameter exactly right. But for predicting future data, it can be worse than Maximum Likelihood; predictive estimation is a better Bayesian method for that purpose. MAP is also not invariant to reparameterization.

Mercer's Condition Valid kernel functions are sometimes more precisely referred to as Mercer kernels , because they must satisfy Mercer's condition. for any given tex: g(\vec{x}) such that tex: \int{g(\vec{x})^{2}d\vec{x} is finite,

tex: \int{K(\vec{x},\vec{z})g(\vec{x})g(\vec{z})d\vec{x}d\vec{z}} \geq 0

We say that tex: K(\vec{x},\vec{z}) satisfies Mercy's condition if the above equation must hold for every tex: g(\vec{x}). A kernel function tex: K(\vec{x},\vec{z}) must be continuous, symmetric, and have a positive definite gram matrix. Such a tex: K(\vec{x},\vec{z}) means that there exists a mapping to a reproducing kernel Hilbert space (a Hilbert space is a vector space closed under dot products) such that the dot product there gives the same value as the function tex: K(\vec{x},\vec{z}).

  • Reference:

1. Mercer's Condition 2. Wikepedia-Mercer's Condition

Metric: Also known as distance. Is a function that defines the distance between two elements of a set. Examples of metrics are the Manhattan, Euclidean and, Mahalanobis.

[MLE]? or Maximum Likelihood Estimation: A parameter estimation heuristic that seeks parameter values that maximize the likelihood function for the parameter to calculate the best way of fitting a mathematical model to some data. This ignores the prior distribution and so is inconsistent with Bayesian probability theory, but it works reasonably well. (See Also: Lecture 7 and [BPE]?)

Non-parametric Model: Non-parametric models differ from parametric models in that the model structure is not specified a priori but is instead determined from data. The term nonparametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance.

Parameter Estimation: Density estimation when the density is assumed to be in a specific parametric family. Special cases include maximum likelihood, maximum a posteriori, unbiased estimation, and predictive estimation.

Parametric Classifiers: We find parametric decision boundaries to approximate true decision boundaries between classes. (Discussed in Lecture 9)

Parametric Model: A parametric model is a set of related mathematical equations in which alternative scenarios are defined by changing the assumed values of a set of fixed coefficients (parameters). In statistics, a parametric model is a parametrized family of probability distributions, one of which is presumed to describe the way a population is distributed.

Patterns: The items that we are trying to classify are vectors of these features.

Pazen Window : Parzen windows are very similar to K nearest neighborhoods(KNN). Both methods can generate very complex decision boundaries. The main difference is that instead of looking at the k closest points to a piece of training data, all points within a fixed distance are considered. In practice, one difference is that datasets with large gaps get treated much different. kNN will pick far away points, but it is possible that relatively small Parzen Windows will actually enclose zero points. In this case, only the priors can be used to classify. Unfortunately, this dataset had many holes in it at the fringes Thhe Parzen-window density estimate using n training samples and the window function tex: \pi is defined by

tex: p_n(x) = 1/n \sum_{i=1}^{n} 1/V_n \pi({x-x_i}/h_n)

The estimate tex: p_n(x) is an average of (window) functions. Usually the window function has its maximum at the origin and its values become smaller when we move further away from the origin. Then each training sample is contributing to the estimate in accordance with its distance from x.

  • Reference:

1. Pazen-window 2. Pazen-window density estimation 3. Pazen-window density estimation

Penalty Methods: In optimization, penalty methods are used to reformulate a constraint optimization problem into several unconstrained optimization problems. It can be shown that the solutions of these derived unconstrained optimization problems will converge to the solution of the original constrained problem. For example, a minimization problem with inequality constraints can be recast as as unconstrained problems by adding a penalty term which value is a function of the original constraints. Except for minor variations penalty methods are very similar to the so-called Barrier methods.

Perceptron: Maps an input to a single binary output value. (First introduced in Lecture 9)

Posterior: The probability, using prior knowledge, that a case belongs to a group.

Principal Component Analysis: Constructing new features which are the principal components of a data set. The principal components are random variables of maximal variance constructed from linear combinations of the input features. Equivalently, they are the projections onto the principal component axes, which are lines that minimize the average squared distance to each point in the data set. To ensure uniqueness, all of the principal component axes must be orthogonal. PCA is a maximum-likelihood technique for linear regression in the presence of Gaussian noise on both x and y. In some cases, PCA corresponds to a Fourier transform, such as the DCT used in JPEG image compression.

Prior: The probability that adjusts the classification of cases by using known information, hence the name prior. (See Also: Informative Prior and Uninformative Prior)

Quadratic Programming Problem: A linearly constrained optimization problem with a quadratic objective function is called a quadratic program (QP). Quadratic programming is often considered as a discipline in and of itself owing to its vast number of applications.

Reinforcement learning: A reinforcement learning machine learns to perform actions which will maximize rewards (or minimize punishments).

Slack Variable: In linear programming , a slack variable is referred to as an additional variable that has been introduced to the optimization problem to turn a inequality constraint into an equality constraint. This procedure is often performed to formulate linear optimization problems into a form which can be efficiently solve using a fast algorithm: the simplex algorithm. As a result a slack variable is always positive since this is a requirement for variables in the simplex method.

Supervised learning: Imagine an organism or machine which experiences a series of inputs from different sensors: x1, x2, x3, x4, . . . The machine is also given desired outputs y1, y2, . . ., and its goal is to learn to produce the correct output given a new input x. (See also: Unsupervised Learning)

Support Vector Machines or SVM: Algorithms which define optimal hyperplanes as the hyperplane whose distance from the nearest training patters (the support vectors) is maximum. http://en.wikipedia.org/wiki/Support_vector_machine

Taxicab Distance: see Manhattan Distance.

Testing: When we present a trained recognition system with an unknown pattern and ask it to classify it, we call that testing.

Training: When we present a pattern recognition with a set of classified patterns so that it can learn the characteristics of the set, we call it training.

Unbiased Estimator: An estimator is said to be unbiased if and only if the expected value (average) of its estimate is the true value of the thing estimated.

Uninformative Prior: An uninformative prior gives very little, a more vague or a general information about the variable. It is important to realize that the term uninformative does not mean no information. (See also: Informative Prior and Prior)

Unsupervised learning: The goal of the machine is to build a model of x that can be used for reasoning, decision making, predicting things, communicating etc. (See also: Supervised Learning)

Alumni Liaison

Abstract algebra continues the conceptual developments of linear algebra, on an even grander scale.

Dr. Paul Garrett