Revision as of 08:17, 11 May 2010 by Mboutin (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Notes for Lecture 11, ECE662, Spring 2010, Prof. Boutin

Maximum Likelihood Estimation (MLE)


General Principles

Given vague knowledge about a situation and some training data (i.e. feature vector values for which the class is known) $ \vec{x}_l, \qquad l=1,\ldots,\text{hopefully large number} $

we want to estimate $ p(\vec{x}|\omega_i), \qquad i=1,\ldots,k $

  1. Assume a parameter form for $ p(\vec{x}|\omega_i), \qquad i=1,\ldots,k $
  2. Use training data to estimate the parameters of $ p(\vec{x}|\omega_i) $, e.g. if you assume $ p(\vec{x}|\omega_i)=\mathcal{N}(\mu,\Sigma) $, then need to estimate $ \mu $ and $ \Sigma $.
  3. Hope that as cardinality of training set increases, estimate for parameters converges to true parameters.


Let $ \mathcal{D}_i $ be the training set for class $ \omega_i, \qquad i=1,\ldots,k $. Assume elements of $ \mathcal{D}_i $ are i.i.d. with $ p(\vec{x}|\omega_i) $. Choose a parametric form for $ p(\vec{x}|\omega_i) $.

$ p(\vec{x}|\omega_i, \vec{\Theta}_i) $ where $ \vec{\Theta}_i $ are the parameters.


How to estimate $ \vec{\Theta}_i $?

  • Consider each class separately
    • $ \vec{\Theta}_i \to \vec{\Theta} $
    • $ \mathcal{D}_i \to \mathcal{D} $
    • Let $ N = \vert \mathcal{D}_i \vert $
    • samples $ \vec{x}_1,\vec{x}_2,\ldots,\vec{x}_N $ are independent

So $ p(\mathcal{D}|\vec{\Theta})=\prod_{j=1}^N(p(x_j|\vec{\Theta})) $

Definition: The maximum likelihood estimate of $ \vec{\Theta} $ is the value $ \hat{\Theta} $ that maximizes $ p(\mathcal{D}|\vec{\Theta}) $.

Observe. $ \hat{\Theta} $ also maximizes $ l(\Theta)=\text{ln} p(\mathcal{D}|\vec{\Theta}) = \sum_{l=1}^N \text{ln} p(\vec{x}_l|\vec{\Theta}) $ where ln is the logarithm base of natural number $ e $.

$ l(\Theta) $ called log likelihood.

$ \hat{\Theta}=\arg\max_\vec{\Theta}l(\vec{\Theta}) $

If $ p(\mathcal{D}|\vec{\Theta}) $ is a derivative function of $ \hat{\Theta} \in \mathbb{R}^p $, the max can be obtained by $ \bigtriangledown_\Theta $ descent. Data does not have to be normally distributed to use this technique.


Example 1: 1D Gaussian with unknown mean

$ p(\vec{x}|\vec{\Theta})=\mathcal{N}(\mu,\Sigma) $ where $ \mu $ is unknown and $ \Sigma $ is known.

Then $ \vec{\Theta}=\mu $. We have $ p(\mathcal{D}|\mu)=\prod_{l=1}^N(p(x_l|\mu)) $

$ \text{ln }p(\mathcal{D}|\mu)=\sum_{l=1}^N\text{ln }p(x_l|\mu) = \sum_{l=1}^N\frac{-1}{2}\ln(2\pi|\Sigma|)-\frac{1}{2}(x_l-\mu)\Sigma^{-1}(x_l - \mu) $

$ \Rightarrow \bigtriangledown_{\mu} \ln p(\mathcal{D}|\mu)= \sum_{l=1}^N \bigtriangledown_{\mu}(\frac{-1}{2}\ln(2\pi|\Sigma|)-\frac{1}{2}(x_l-\mu)\Sigma^{-1}(x_l - \mu)) $

$ \bigtriangledown_{\mu} \ln p(\mathcal{D}|\mu)= \sum_{l=1}^N \sum^{-1}(x_l - \mu) $

Looking for $ \hat{\mu} $ such that $ \bigtriangledown l(\mu)|_{\hat{\mu}}=0 $

$ \Longleftrightarrow \sum_{l=1}^N \sum^{-1}(x_l - \mu) = 0 $

$ \Longleftrightarrow \sum_{l=1}^N x_l - \sum_{l=1}^N \hat{\mu} = 0 $

$ \hat{\mu}=\frac{1}{N}\sum_{l=1}^N x_l $

M.L. estimate for mean is the sample mean

Example 2: 1D Gaussian with unknown mean and variance

$ \hat{\Theta}=(\Theta_1,\Theta_2) \text{where} \Theta_1=\mu \text{and} \Theta_2=\Sigma=\sigma^2 $

$ \ln p(x_l|\mu,\sigma^2)=\frac{-1}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}(x_l-\mu)^2 $

$ \Rightarrow \ln p(\mathcal{D}|\mu,\sigma^2)=\sum_{l=1}^N \ln p(x_l|\mu,\sigma^2) = \sum_{l=1}^N \frac{-1}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}(x_l-\mu)^2 $

$ \bigtriangledown_{(\mu,\sigma^2)}\ln p(\mathcal{D}|\mu,\sigma^2)= \sum_{l=1}^N \left( \begin{array}{c} \frac{\partial}{\partial\mu} \ln p(\mathcal{D}|\mu,\sigma^2) \\ \frac{\partial}{\partial\sigma^2} \ln p(\mathcal{D}|\mu,\sigma^2) \\ \end{array}\right) = \sum_{l=1}^N \left( \begin{array}{c} \frac{1}{\sigma^2}(x_l-\mu) \\ \frac{-1}{\sigma^2}+\frac{(x_l-\mu)^2}{2\sigma^2} \end{array}\right) $

$ = \left( \begin{array}{c} \frac{1}{\sigma^2}(\sum_{l=1}^N x_l - N_{\mu}) \\ \frac{1}{2\sigma^2} (\sum_{l=1}^N \frac{(x_l-\mu)^2}{\sigma^2} - N_{\mu}) \\ \end{array}\right)= \left( \begin{array}{c} 0 \\ 0 \\ \end{array}\right) $

So we get $ \hat{\mu}=\frac{1}{N}\sum_{l=1}^N \vec{x}_l $ and $ \hat{\Sigma}=\frac{1}{N}\sum_{l=1}^N(\vec{x_l}-\hat{\mu})(\vec{x}_l-\hat{\mu})^T $

Slight problem "bias" of $ \hat{\Sigma} $

The expected value of $ \hat{\Sigma} $ over all data sets of size $ N $ is not the true variance $ \Sigma $.

Claim: $ \mathbb{E}(\hat{\Sigma})=\frac{N-1}{N}\Sigma $

Proof in 1D:

$ \mathbb{E}(\hat{\Sigma})=\mathbb{E}(\frac{1}{N}\Sigma_{l=1}^N (x_l-\hat{\mu})^2) = \frac{1}{N}\Sigma_{l=1}^N \mathbb{E}(x_l x_l - 2x_l\hat{\mu} + \hat{\mu}\hat{\mu}) $

where $ \hat{\mu}=\Sigma_{m=1}^N \frac{x_m}{N} $

$ \mathbb{E}(\hat{\Sigma})= \frac{1}{N} \Sigma_{l=1}^N(\mathbb{E}(x_l x_l) - 2\mathbb{E}(x_l \Sigma_{m=1}^N \frac{x_m}{N}) + \mathbb{E}(\Sigma_{m=1}^N \frac{x_m}{N} \Sigma_{s=1}^N \frac{x_s}{N})) $

$ \mathbb{E}(\hat{\Sigma})= \frac{1}{N} \Sigma_{l=1}^N [\mathbb{E}(x_l x_l) - \frac{2}{N}\Sigma_{m=1}^N \mathbb{E}(x_l x_m) + \frac{1}{N^2} \Sigma_{m,s=1}^N \mathbb{E}(x_m x_s)] $

by independence, for $ l \neq m $

$ \mathbb{E}(x_l x_m) = \mathbb{E}(x_l)\mathbb{E}(x_m) = \mu \mu = \mu^2 $

if $ l = m \Rightarrow \mathbb{E}(x_l x_l) = \mathbb{E}(x_l^2) $ and similarly when $ l = m $

Write $ \mathbb{E}(x_l^2) = \mathbb{E}(x^2) $

$ \Rightarrow \mathbb{E}(\hat{\Sigma}) = \frac{1}{N} \Sigma_{l=1}^N [ \mathbb{E}(x^2) - \frac{2}{N} \Sigma_{l \neq m} \mu^2 - \frac{2}{N} \Sigma_{l=m} \mathbb{E}(x^2) + \frac{1}{N^2} \Sigma_{m \neq s} \mu^2 + \frac{1}{N^2} \Sigma_{m = s} \mathbb{E}(x^2)] $

$ = \frac{1}{N} \Sigma_{l=1}^N (1 - \frac{1}{N})\mathbb{E}(x^2) + (\frac{1}{N} - 1)\mu^2 $

$ = \frac{1}{N} \cdot N ((1 - \frac{1}{N})\mathbb{E}(x^2) + (\frac{1}{N} - 1)\mu^2) $

$ = \frac{N-1}{N}(\mathbb{E}(x^2)-\mu^2) = \frac{N-1}{N}(\mathbb{E}((x-\mu)^2)) = \frac{N-1}{N} \sigma^2 $


End of lecture 11

--Gmodeloh 13:37, 21 April 2010 (UTC)


Back to ECE662 Spring 2010, Prof. Boutin

Alumni Liaison

Questions/answers with a recent ECE grad

Ryne Rayburn