Revision as of 09:10, 21 April 2010 by Gmodeloh (Talk | contribs)

In Lecture 11, we continued our discussion of Parametric Density Estimation techniques. We discussed the Maximum Likelihood Estimation (MLE) method and look at a couple of 1-dimension examples for case when feature in dataset follows Gaussian distribution. First, we looked at case where mean parameter was unknown, but variance parameter is known. Then we followed with another example where both mean and variance where unknown. Finally, we looked at the slight "bias" problem when calculating the variance.

Below are the notes from lecture.

Maximum Likelihood Estimation (MLE)


General Principles

Given vague knowledge about a situation and some training data (i.e. feature vector values for which the class is known) $ \vec{x}_l, \qquad l=1,\ldots,\text{hopefully large number} $

we want to estimate $ p(\vec{x}|\omega_i), \qquad i=1,\ldots,k $

  1. Assume a parameter form for $ p(\vec{x}|\omega_i), \qquad i=1,\ldots,k $
  2. Use training data to estimate the parameters of $ p(\vec{x}|\omega_i) $, e.g. if you assume $ p(\vec{x}|\omega_i)=\mathcal{N}(\mu,\Sigma) $, then need to estimate $ \mu $ and $ \Sigma $.
  3. Hope that as cardinality of training set increases, estimate for parameters converges to true parameters.


Let $ \mathcal{D}_i $ be the training set for class $ \omega_i, \qquad i=1,\ldots,k $. Assume elements of $ \mathcal{D}_i $ are i.i.d. with $ p(\vec{x}|\omega_i) $. Choose a parametric form for $ p(\vec{x}|\omega_i) $.

$ p(\vec{x}|\omega_i, \vec{\Theta}_i) $ where $ \vec{\Theta}_i $ are the parameters.


How to estimate $ \vec{\Theta}_i $?

  • Consider each class separately
    • $ \vec{\Theta}_i \to \vec{\Theta} $
    • $ \mathcal{D}_i \to \mathcal{D} $
    • Let $ N = \vert \mathcal{D}_i \vert $
    • samples $ \vec{x}_1,\vec{x}_2,\ldots,\vec{x}_N $ are independent

So $ p(\mathcal{D}|\vec{\Theta})=\prod_{j=1}^N(p(x_j|\vec{\Theta})) $

Definition: The maximum likelihood estimate of $ \vec{\Theta} $ is the value $ \hat{\Theta} $ that maximizes $ p(\mathcal{D}|\vec{\Theta}) $.

Observe. $ \hat{\Theta} $ also maximizes $ l(\Theta)=\text{ln} p(\mathcal{D}|\vec{\Theta}) = \sum_{l=1}^N \text{ln} p(\vec{x}_l|\vec{\Theta}) $ where ln is the logarithm base of natural number $ e $.

$ l(\Theta) $ called log likelihood.

$ \hat{\Theta}=\arg\max_\vec{\Theta}l(\vec{\Theta}) $

If $ p(\mathcal{D}|\vec{\Theta}) $ is a derivative function of $ \hat{\Theta} \in \mathbb{R}^p $, the max can be obtained by $ \bigtriangledown_\Theta $ descent. Data does not have to be normally distributed to use this technique.


--Gmodeloh 13:37, 21 April 2010 (UTC)

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett