Notes for Lecture 11, ECE662, Spring 2010, Prof. Boutin

Maximum Likelihood Estimation (MLE)

General Principles

Given vague knowledge about a situation and some training data (i.e. feature vector values for which the class is known) $\vec{x}_l, \qquad l=1,\ldots,\text{hopefully large number}$

we want to estimate $p(\vec{x}|\omega_i), \qquad i=1,\ldots,k$

Assume a parameter form for $p(\vec{x}|\omega_i), \qquad i=1,\ldots,k$
Use training data to estimate the parameters of $p(\vec{x}|\omega_i)$ , e.g. if you assume $p(\vec{x}|\omega_i)=\mathcal{N}(\mu,\Sigma)$ , then need to estimate $\mu$ and $\Sigma$ .
Hope that as cardinality of training set increases, estimate for parameters converges to true parameters.

Let $\mathcal{D}_i$ be the training set for class $\omega_i, \qquad i=1,\ldots,k$ . Assume elements of $\mathcal{D}_i$ are i.i.d. with $p(\vec{x}|\omega_i)$ . Choose a parametric form for $p(\vec{x}|\omega_i)$ .

$p(\vec{x}|\omega_i, \vec{\Theta}_i)$ where $\vec{\Theta}_i$ are the parameters.

How to estimate $\vec{\Theta}_i$ ?

Consider each class separately
- $\vec{\Theta}_i \to \vec{\Theta}$
- $\mathcal{D}_i \to \mathcal{D}$
- Let $N = \vert \mathcal{D}_i \vert$
- samples $\vec{x}_1,\vec{x}_2,\ldots,\vec{x}_N$ are independent

So $p(\mathcal{D}|\vec{\Theta})=\prod_{j=1}^N(p(x_j|\vec{\Theta}))$

Definition: The maximum likelihood estimate of $\vec{\Theta}$ is the value $\hat{\Theta}$ that maximizes $p(\mathcal{D}|\vec{\Theta})$ .

Observe. $\hat{\Theta}$ also maximizes $l(\Theta)=\text{ln} p(\mathcal{D}|\vec{\Theta}) = \sum_{l=1}^N \text{ln} p(\vec{x}_l|\vec{\Theta})$ where ln is the logarithm base of natural number $$ e $$ .

$l(\Theta)$ called log likelihood.

$\hat{\Theta}=\arg\max_\vec{\Theta}l(\vec{\Theta})$

If $p(\mathcal{D}|\vec{\Theta})$ is a derivative function of $\hat{\Theta} \in \mathbb{R}^p$ , the max can be obtained by $\bigtriangledown_\Theta$ descent. Data does not have to be normally distributed to use this technique.

Example 1: 1D Gaussian with unknown mean

$p(\vec{x}|\vec{\Theta})=\mathcal{N}(\mu,\Sigma)$ where $\mu$ is unknown and $\Sigma$ is known.

Then $\vec{\Theta}=\mu$ . We have $p(\mathcal{D}|\mu)=\prod_{l=1}^N(p(x_l|\mu))$

$\text{ln }p(\mathcal{D}|\mu)=\sum_{l=1}^N\text{ln }p(x_l|\mu) = \sum_{l=1}^N\frac{-1}{2}\ln(2\pi|\Sigma|)-\frac{1}{2}(x_l-\mu)\Sigma^{-1}(x_l - \mu)$

$\Rightarrow \bigtriangledown_{\mu} \ln p(\mathcal{D}|\mu)= \sum_{l=1}^N \bigtriangledown_{\mu}(\frac{-1}{2}\ln(2\pi|\Sigma|)-\frac{1}{2}(x_l-\mu)\Sigma^{-1}(x_l - \mu))$

$\bigtriangledown_{\mu} \ln p(\mathcal{D}|\mu)= \sum_{l=1}^N \sum^{-1}(x_l - \mu)$

Looking for $\hat{\mu}$ such that $\bigtriangledown l(\mu)|_{\hat{\mu}}=0$

$\Longleftrightarrow \sum_{l=1}^N \sum^{-1}(x_l - \mu) = 0$

$\Longleftrightarrow \sum_{l=1}^N x_l - \sum_{l=1}^N \hat{\mu} = 0$

$\hat{\mu}=\frac{1}{N}\sum_{l=1}^N x_l$

M.L. estimate for mean is the sample mean

Example 2: 1D Gaussian with unknown mean and variance

$\hat{\Theta}=(\Theta_1,\Theta_2) \text{where} \Theta_1=\mu \text{and} \Theta_2=\Sigma=\sigma^2$

$\ln p(x_l|\mu,\sigma^2)=\frac{-1}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}(x_l-\mu)^2$

$\Rightarrow \ln p(\mathcal{D}|\mu,\sigma^2)=\sum_{l=1}^N \ln p(x_l|\mu,\sigma^2) = \sum_{l=1}^N \frac{-1}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}(x_l-\mu)^2$

$\bigtriangledown_{(\mu,\sigma^2)}\ln p(\mathcal{D}|\mu,\sigma^2)= \sum_{l=1}^N \left( \begin{array}{c} \frac{\partial}{\partial\mu} \ln p(\mathcal{D}|\mu,\sigma^2) \\ \frac{\partial}{\partial\sigma^2} \ln p(\mathcal{D}|\mu,\sigma^2) \\ \end{array}\right) = \sum_{l=1}^N \left( \begin{array}{c} \frac{1}{\sigma^2}(x_l-\mu) \\ \frac{-1}{\sigma^2}+\frac{(x_l-\mu)^2}{2\sigma^2} \end{array}\right)$

$= \left( \begin{array}{c} \frac{1}{\sigma^2}(\sum_{l=1}^N x_l - N_{\mu}) \\ \frac{1}{2\sigma^2} (\sum_{l=1}^N \frac{(x_l-\mu)^2}{\sigma^2} - N_{\mu}) \\ \end{array}\right)= \left( \begin{array}{c} 0 \\ 0 \\ \end{array}\right)$

So we get $\hat{\mu}=\frac{1}{N}\sum_{l=1}^N \vec{x}_l$ and $\hat{\Sigma}=\frac{1}{N}\sum_{l=1}^N(\vec{x_l}-\hat{\mu})(\vec{x}_l-\hat{\mu})^T$

Slight problem "bias" of $\hat{\Sigma}$

The expected value of $\hat{\Sigma}$ over all data sets of size $$ N $$ is not the true variance $\Sigma$ .

Claim: $\mathbb{E}(\hat{\Sigma})=\frac{N-1}{N}\Sigma$

Proof in 1D:

$\mathbb{E}(\hat{\Sigma})=\mathbb{E}(\frac{1}{N}\Sigma_{l=1}^N (x_l-\hat{\mu})^2) = \frac{1}{N}\Sigma_{l=1}^N \mathbb{E}(x_l x_l - 2x_l\hat{\mu} + \hat{\mu}\hat{\mu})$

where $\hat{\mu}=\Sigma_{m=1}^N \frac{x_m}{N}$

$\mathbb{E}(\hat{\Sigma})= \frac{1}{N} \Sigma_{l=1}^N(\mathbb{E}(x_l x_l) - 2\mathbb{E}(x_l \Sigma_{m=1}^N \frac{x_m}{N}) + \mathbb{E}(\Sigma_{m=1}^N \frac{x_m}{N} \Sigma_{s=1}^N \frac{x_s}{N}))$

$\mathbb{E}(\hat{\Sigma})= \frac{1}{N} \Sigma_{l=1}^N [\mathbb{E}(x_l x_l) - \frac{2}{N}\Sigma_{m=1}^N \mathbb{E}(x_l x_m) + \frac{1}{N^2} \Sigma_{m,s=1}^N \mathbb{E}(x_m x_s)]$

by independence, for $l \neq m$

$\mathbb{E}(x_l x_m) = \mathbb{E}(x_l)\mathbb{E}(x_m) = \mu \mu = \mu^2$

if $l = m \Rightarrow \mathbb{E}(x_l x_l) = \mathbb{E}(x_l^2)$ and similarly when $$ l = m $$

Write $\mathbb{E}(x_l^2) = \mathbb{E}(x^2)$

$\Rightarrow \mathbb{E}(\hat{\Sigma}) = \frac{1}{N} \Sigma_{l=1}^N [ \mathbb{E}(x^2) - \frac{2}{N} \Sigma_{l \neq m} \mu^2 - \frac{2}{N} \Sigma_{l=m} \mathbb{E}(x^2) + \frac{1}{N^2} \Sigma_{m \neq s} \mu^2 + \frac{1}{N^2} \Sigma_{m = s} \mathbb{E}(x^2)]$

$= \frac{1}{N} \Sigma_{l=1}^N (1 - \frac{1}{N})\mathbb{E}(x^2) + (\frac{1}{N} - 1)\mu^2$

$= \frac{1}{N} \cdot N ((1 - \frac{1}{N})\mathbb{E}(x^2) + (\frac{1}{N} - 1)\mu^2)$

$= \frac{N-1}{N}(\mathbb{E}(x^2)-\mu^2) = \frac{N-1}{N}(\mathbb{E}((x-\mu)^2)) = \frac{N-1}{N} \sigma^2$

End of lecture 11

--Gmodeloh 13:37, 21 April 2010 (UTC)

Back to ECE662 Spring 2010, Prof. Boutin

Noteslecture11ECE662S10 - Rhea

Notes for Lecture 11, ECE662, Spring 2010, Prof. Boutin

Maximum Likelihood Estimation (MLE)

Alumni Liaison