(New page: =Notes for Lecture 11, ECE662, Spring 2010, Prof. Boutin= == Maximum Likelihood Estimation (MLE) == ---- '''General Principles''' Given vague knowledge about ...) |
(No difference)
|
Revision as of 08:17, 11 May 2010
Notes for Lecture 11, ECE662, Spring 2010, Prof. Boutin
Maximum Likelihood Estimation (MLE)
General Principles
Given vague knowledge about a situation and some training data (i.e. feature vector values for which the class is known) $ \vec{x}_l, \qquad l=1,\ldots,\text{hopefully large number} $
we want to estimate $ p(\vec{x}|\omega_i), \qquad i=1,\ldots,k $
- Assume a parameter form for $ p(\vec{x}|\omega_i), \qquad i=1,\ldots,k $
- Use training data to estimate the parameters of $ p(\vec{x}|\omega_i) $, e.g. if you assume $ p(\vec{x}|\omega_i)=\mathcal{N}(\mu,\Sigma) $, then need to estimate $ \mu $ and $ \Sigma $.
- Hope that as cardinality of training set increases, estimate for parameters converges to true parameters.
Let $ \mathcal{D}_i $ be the training set for class $ \omega_i, \qquad i=1,\ldots,k $. Assume elements of $ \mathcal{D}_i $ are i.i.d. with $ p(\vec{x}|\omega_i) $. Choose a parametric form for $ p(\vec{x}|\omega_i) $.
$ p(\vec{x}|\omega_i, \vec{\Theta}_i) $ where $ \vec{\Theta}_i $ are the parameters.
How to estimate $ \vec{\Theta}_i $?
- Consider each class separately
- $ \vec{\Theta}_i \to \vec{\Theta} $
- $ \mathcal{D}_i \to \mathcal{D} $
- Let $ N = \vert \mathcal{D}_i \vert $
- samples $ \vec{x}_1,\vec{x}_2,\ldots,\vec{x}_N $ are independent
So $ p(\mathcal{D}|\vec{\Theta})=\prod_{j=1}^N(p(x_j|\vec{\Theta})) $
Definition: The maximum likelihood estimate of $ \vec{\Theta} $ is the value $ \hat{\Theta} $ that maximizes $ p(\mathcal{D}|\vec{\Theta}) $.
Observe. $ \hat{\Theta} $ also maximizes $ l(\Theta)=\text{ln} p(\mathcal{D}|\vec{\Theta}) = \sum_{l=1}^N \text{ln} p(\vec{x}_l|\vec{\Theta}) $ where ln is the logarithm base of natural number $ e $.
$ l(\Theta) $ called log likelihood.
$ \hat{\Theta}=\arg\max_\vec{\Theta}l(\vec{\Theta}) $
If $ p(\mathcal{D}|\vec{\Theta}) $ is a derivative function of $ \hat{\Theta} \in \mathbb{R}^p $, the max can be obtained by $ \bigtriangledown_\Theta $ descent. Data does not have to be normally distributed to use this technique.
Example 1: 1D Gaussian with unknown mean
$ p(\vec{x}|\vec{\Theta})=\mathcal{N}(\mu,\Sigma) $ where $ \mu $ is unknown and $ \Sigma $ is known.
Then $ \vec{\Theta}=\mu $. We have $ p(\mathcal{D}|\mu)=\prod_{l=1}^N(p(x_l|\mu)) $
$ \text{ln }p(\mathcal{D}|\mu)=\sum_{l=1}^N\text{ln }p(x_l|\mu) = \sum_{l=1}^N\frac{-1}{2}\ln(2\pi|\Sigma|)-\frac{1}{2}(x_l-\mu)\Sigma^{-1}(x_l - \mu) $
$ \Rightarrow \bigtriangledown_{\mu} \ln p(\mathcal{D}|\mu)= \sum_{l=1}^N \bigtriangledown_{\mu}(\frac{-1}{2}\ln(2\pi|\Sigma|)-\frac{1}{2}(x_l-\mu)\Sigma^{-1}(x_l - \mu)) $
$ \bigtriangledown_{\mu} \ln p(\mathcal{D}|\mu)= \sum_{l=1}^N \sum^{-1}(x_l - \mu) $
Looking for $ \hat{\mu} $ such that $ \bigtriangledown l(\mu)|_{\hat{\mu}}=0 $
$ \Longleftrightarrow \sum_{l=1}^N \sum^{-1}(x_l - \mu) = 0 $
$ \Longleftrightarrow \sum_{l=1}^N x_l - \sum_{l=1}^N \hat{\mu} = 0 $
$ \hat{\mu}=\frac{1}{N}\sum_{l=1}^N x_l $
M.L. estimate for mean is the sample mean
Example 2: 1D Gaussian with unknown mean and variance
$ \hat{\Theta}=(\Theta_1,\Theta_2) \text{where} \Theta_1=\mu \text{and} \Theta_2=\Sigma=\sigma^2 $
$ \ln p(x_l|\mu,\sigma^2)=\frac{-1}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}(x_l-\mu)^2 $
$ \Rightarrow \ln p(\mathcal{D}|\mu,\sigma^2)=\sum_{l=1}^N \ln p(x_l|\mu,\sigma^2) = \sum_{l=1}^N \frac{-1}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}(x_l-\mu)^2 $
$ \bigtriangledown_{(\mu,\sigma^2)}\ln p(\mathcal{D}|\mu,\sigma^2)= \sum_{l=1}^N \left( \begin{array}{c} \frac{\partial}{\partial\mu} \ln p(\mathcal{D}|\mu,\sigma^2) \\ \frac{\partial}{\partial\sigma^2} \ln p(\mathcal{D}|\mu,\sigma^2) \\ \end{array}\right) = \sum_{l=1}^N \left( \begin{array}{c} \frac{1}{\sigma^2}(x_l-\mu) \\ \frac{-1}{\sigma^2}+\frac{(x_l-\mu)^2}{2\sigma^2} \end{array}\right) $
$ = \left( \begin{array}{c} \frac{1}{\sigma^2}(\sum_{l=1}^N x_l - N_{\mu}) \\ \frac{1}{2\sigma^2} (\sum_{l=1}^N \frac{(x_l-\mu)^2}{\sigma^2} - N_{\mu}) \\ \end{array}\right)= \left( \begin{array}{c} 0 \\ 0 \\ \end{array}\right) $
So we get $ \hat{\mu}=\frac{1}{N}\sum_{l=1}^N \vec{x}_l $ and $ \hat{\Sigma}=\frac{1}{N}\sum_{l=1}^N(\vec{x_l}-\hat{\mu})(\vec{x}_l-\hat{\mu})^T $
Slight problem "bias" of $ \hat{\Sigma} $
The expected value of $ \hat{\Sigma} $ over all data sets of size $ N $ is not the true variance $ \Sigma $.
Claim: $ \mathbb{E}(\hat{\Sigma})=\frac{N-1}{N}\Sigma $
Proof in 1D:
$ \mathbb{E}(\hat{\Sigma})=\mathbb{E}(\frac{1}{N}\Sigma_{l=1}^N (x_l-\hat{\mu})^2) = \frac{1}{N}\Sigma_{l=1}^N \mathbb{E}(x_l x_l - 2x_l\hat{\mu} + \hat{\mu}\hat{\mu}) $
where $ \hat{\mu}=\Sigma_{m=1}^N \frac{x_m}{N} $
$ \mathbb{E}(\hat{\Sigma})= \frac{1}{N} \Sigma_{l=1}^N(\mathbb{E}(x_l x_l) - 2\mathbb{E}(x_l \Sigma_{m=1}^N \frac{x_m}{N}) + \mathbb{E}(\Sigma_{m=1}^N \frac{x_m}{N} \Sigma_{s=1}^N \frac{x_s}{N})) $
$ \mathbb{E}(\hat{\Sigma})= \frac{1}{N} \Sigma_{l=1}^N [\mathbb{E}(x_l x_l) - \frac{2}{N}\Sigma_{m=1}^N \mathbb{E}(x_l x_m) + \frac{1}{N^2} \Sigma_{m,s=1}^N \mathbb{E}(x_m x_s)] $
by independence, for $ l \neq m $
$ \mathbb{E}(x_l x_m) = \mathbb{E}(x_l)\mathbb{E}(x_m) = \mu \mu = \mu^2 $
if $ l = m \Rightarrow \mathbb{E}(x_l x_l) = \mathbb{E}(x_l^2) $ and similarly when $ l = m $
Write $ \mathbb{E}(x_l^2) = \mathbb{E}(x^2) $
$ \Rightarrow \mathbb{E}(\hat{\Sigma}) = \frac{1}{N} \Sigma_{l=1}^N [ \mathbb{E}(x^2) - \frac{2}{N} \Sigma_{l \neq m} \mu^2 - \frac{2}{N} \Sigma_{l=m} \mathbb{E}(x^2) + \frac{1}{N^2} \Sigma_{m \neq s} \mu^2 + \frac{1}{N^2} \Sigma_{m = s} \mathbb{E}(x^2)] $
$ = \frac{1}{N} \Sigma_{l=1}^N (1 - \frac{1}{N})\mathbb{E}(x^2) + (\frac{1}{N} - 1)\mu^2 $
$ = \frac{1}{N} \cdot N ((1 - \frac{1}{N})\mathbb{E}(x^2) + (\frac{1}{N} - 1)\mu^2) $
$ = \frac{N-1}{N}(\mathbb{E}(x^2)-\mu^2) = \frac{N-1}{N}(\mathbb{E}((x-\mu)^2)) = \frac{N-1}{N} \sigma^2 $
End of lecture 11
--Gmodeloh 13:37, 21 April 2010 (UTC)