ECE662Selecture zhenpengMLE - Rhea

Expected Value of MLE estimate over standard deviation and expected deviation

A slecture by ECE student Zhenpeng Zhao

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

1. Motivation

Most likely converge as number of number of training sample increase.
Simpler than alternate methods such as Bayesian technique.

2. MLE as a Parametric Density Estimation

Statistical Density Theory Context
- Given c classes + some knowledge about features $x \in \mathbb{R}^n$ (or some other space)
- Given training data, $x_j\sim\rho(x)=\sum\limits_{i=1}^n\rho(x|w_i) Prob(w_i)$ , unknown class $w_{ij}$ for $$ x_j $$ is know, $\forall{j}=1,...,N$ (N hopefully large enough)
- In order to make decision, we need to estimate $\rho(x|w_i)$ , $$ Prob(w_i) $$ $\rightarrow$ use Bayes rule, or $\rho(x|w_i)$ $\rightarrow$ use Neyman-Pearson Criterion
- To estimate the above two, use training data.

The parametric pdf|Prob estimation problem
- Let $D={x_1,x_2,...,x_N}$ , $$ x_j $$ is drown independently from some probability law.
- Choose parametric from $\rho(x|\theta)$ for the pdf of x or $Prob(x|\theta)$ for the probability of x $\rightarrow$ an unknown parametric vector
- Use $$ D $$ to estimate $\theta$

Definition: The maximum likelihood estimate of $\theta$ is the value $\hat{\theta}$ that maximize $\rho_D(D|\theta)$ , if x is continuous R.V., or $Prob(D|\theta)$ , if x is discrete R.V.

Observation: By independence, $ \rho(D|\theta)=\rho(x_1,x_2,...,x_N|\theta) $ = $ \prod\limits_{j=1}^n\rho(x_j|\theta) $
- Simple Example One:

Those to estimate the priors: $$ Prob(w_1), Prob(w_2) $$ for $$ c=2 $$ classes.

Let $$ Prob(w_1)=P $$ , $\Rightarrow$ $$ Prob(w_2)=1-P $$ , as an unknown parameter ( $\theta=P$ )

Let $$ w_j $$ be the class of some $$ x_j $$ , ( $j\in{1,2,...N}$ )

$$ Prob(D|P) $$ = $\prod\limits_{j=1}^n Prob(w_{ij}|P)$ , $x\sim \rho(x)$

= $\prod\limits_{j=1}^{N_1} Prob(w_{ij}|P)\prod\limits_{j=1}^{N_2}Prob(w_{ij}|p)$

= $P^{N_1}\dot(1-P)^{N-N_1}$

, the first $w_{ij}=w_1$ and the second $w_{ij}=w_2$ ,

$$ N1 $$ = number of sample from class 1 Then, we $\infty$ differentiate P $$ (Prob(D|P)) $$ , so local max is where derivative = 0.

$\frac{d}{dP} Prob(D|P)=\frac{d}{dP} P^{N_1}(1-P)^{N-N_1}$

$=N_1P^{N_1-1}(1-P)^{N-N_1}-(N-N_1)P^{N_1}(1-p)$

$=p^{N_1-1}(1-P)^{N-N_1-1}[N_1(1-P)-(N-N_1)P]=0$

$\Rightarrow$ So either P=0 or P=1 $\rightarrow N_1(1-P)$

$\Leftrightarrow P=\frac{N_1}{N}$

- Simple Example Two: Continuous R.V.: Estimate mean of Gaussian with Known $\Sigma$

$\rho(\vec{x}|\vec{\mu})=N(\vec{\mu},\Sigma)$ , where $\mu$ is

unknown, and  $$ Sigma $$  is known.

$\rho(D|\vec{\mu}) = \prod\limits_{j=1}^{N}\rho(x_j|\vec{\mu})$

Observe the MLE $\in \hat{\theta}$ , also maximize $log\rho_D(D|\theta)$ since log is monotonic

= $\sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}})$ $\exp^{-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}}$

= $\sum\limits_{j=1}^{N}ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}})$ $-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2}$

which is $\infty$ many times differentiable for $\vec{\mu}$ , so local max are where $\nabla=0$

compute $\nabla$ , $\nabla_{\vec{\mu}}ln\rho_{D}(D|\vec{\mu})$

= $\sum\limits_{j=1}^{N}\nabla_{\vec{\mu}} (ln(\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}})$ $-\frac{(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})}{2})$

= $-1/2\sum\limits_{j=1}^{N}\nabla_{\vec{\mu}}[(x_j-\vec{\mu})^T\Sigma^{-1}(x_j-\vec{\mu})]$

= $-1/2\sum\limits_{j=1}^{N} \begin{bmatrix} \frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \frac{\partial}{\partial\mu_2} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \vdots \\ \frac{\partial}{\partial\mu_n} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})\\ \end{bmatrix}$

But $\frac{\partial}{\partial\mu_1} (x_j-{\mu})^T\Sigma^{-1}(x_j-{\mu})$

= $(\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1}$ $(x_j-\mu)+(x_j-\mu)^T\Sigma^{-1}\frac{\partial}{\partial \mu_i}(x_j-\mu)$

= $2\frac{\partial}{\partial \mu_i}(x_j-\mu)^T)\Sigma^{-1}(x_j-\mu)$

= $2(0,0,0,...,-1,0,...,0)\Sigma^{-1}(x_j-\mu)$

= $-2\vec{e_i}^{T}\Sigma^{-1}(x_j-\mu)$

so, $\nabla{ln\rho_D(D|\mu)} = -1/2\sum\limits_{j=1}^{N}$ $\begin{bmatrix} -2\vec{e_1}^{T}\Sigma^{-1}(x_j-{\mu})\\ -2\vec{e_2}^{T}\Sigma^{-1}(x_j-{\mu})\\ \vdots \\ -2\vec{e_n}^{T}\Sigma^{-1}(x_j-{\mu})\\ \end{bmatrix}$

= $\sum\limits_{j=1}^{N}$ $\begin{bmatrix} -2\vec{e_1}^{T}\\ -2\vec{e_2}^{T}\\ \vdots \\ -2\vec{e_n}^{T}\\ \end{bmatrix}$ $\Sigma^{-1}(x_j-\mu)$ , the vector of $\vec{e_i}$ is the space domain of feature

= $\sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu)$

= $\Sigma^{-1}\sum\limits_{j=1}^{N}(x_j-\mu)$ set to be 0

$\Rightarrow \Sigma\Sigma^{-1}\sum\limits_{j=1}^{N}\Sigma^{-1}(x_j-\mu) = \Sigma \cdot 0$

$\Rightarrow \sum\limits_{j=1}^{N}(x_j-\mu) = 0$

$\Rightarrow \frac{1}{N}\sum\limits_{j=1}^{N}x_j = \mu$

$\rightarrow$ the sample mean is the maximum likelihood estimate for $\mu$

- Example three: I.D. Gaussian, both $\mu$ and $\sigma^2$ unknown

$\theta = (\theta_1, \theta_2) = (\mu, \sigma^2)<math> We have <math>ln\rho(x_k|\mu,\sigma^2) =$ $ln(\frac{1}{\sqrt{2\pi}\sigma}\cdot e^{-\frac{(x-\mu)^2}{2\sigma^2}})$

= $-1/2ln(2\pi\sigma^2)-1/(2\sigma^2)(x_k-\mu)^2$ $ln\rho_D(D|\mu, \sigma^2)$

= $ln\prod\limits_{k=1}{N}\rho(x_k|\mu,\theta^2)$

= $\sum\limits_{k=1}^{N}(-\frac{1}{2}ln(2\pi\sigma^2)$ $-\frac{1}{2\sigma^2}(x_k-\mu)^2)$ $\nabla_{\mu,\sigma^2}ln_D(D|\mu,\sigma^2)$

= $\begin{bmatrix} \frac{\partial}{\partial \mu}ln\rho_D(D|\mu,\sigma^2)\\ \frac{\partial}{\partial \sigma^2}ln\rho_D(D|\mu,\sigma^2)\\ \end{bmatrix}$

= $\begin{bmatrix} \frac{\partial}{\partial \mu}(-\frac{N}{2}ln(2\pi\sigma^2) -\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ \frac{\partial}{\partial \sigma^2}(-\frac{N}{2}ln(2\pi\sigma^2) -\frac{1}{2\sigma^2}\sum\limits_{k=1}^{N}(x_k-\mu)^2)\\ \end{bmatrix}$

= $\begin{bmatrix} \frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)\\ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}- \frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2 \end{bmatrix}$

= $\begin{bmatrix} \frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)\\ -\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+ \frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2 \end{bmatrix}$ set to be 0

from $\frac{1}{\sigma^2}\sum\limits_{k=1}^{N} (x_k-\mu)=0$ $\Leftrightarrow \mu=$ $\sum\limits_{k=1}^{N}x_k-N\mu=0$

$\Leftrightarrow \mu=\frac{1}{N}\sum\limits_{k=1}^{N}x_k$ which is sample mean.

From $-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}-$ $\frac{-1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0$ and $\hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k \Rightarrow$

$-\frac{N}{2}\cdot\frac{2\pi}{2\pi\sigma^2}+$ $\frac{1}{2\sigma^4}\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow$

$-\frac{N}{2}+\frac{1}{2\sigma^2}$ $\sum\limits_{k=1}^{N}(x_k-\mu)^2=0 \Leftrightarrow$

$\frac{1}{2\sigma^2}=$ $\frac{N}{2}\cdot \frac{1}{\sum\limits_{k=1}^{N}(x_k-\mu)^2}$ $\Leftrightarrow$

$\sigma^2 = \frac{1}{N}\cdot \sum\limits_{k=1}^{N}(x_k-\mu)^2$ = $\hat{\sigma^2}$ which is the MLE of $\sigma$

In general, when $x\sim N(\vec{\mu}, \Sigma),$ $x\in \mathbb{R}^n, \vec{\mu}, \Sigma$ unknown, the MLE for $\vec{\mu}$ and $\Sigma$ are: $\hat{\mu}=\frac{1}{N}\sum\limits_{k=1}^{N}x_k$ $, \hat{\Sigma} = \frac{1}{N}\sum\limits_{k=1}^{N}$ $(x_k-\mu)(x_k-\mu)^T$

$\Sigma$ is non singular, but $\hat{\Sigma}$ can be singular $\Rightarrow$ no inverse $\rightarrow$ this happens when number of points N<n: feature space down.

What happens when repeat sampling and estimating?

Sample: $(x_1^i, x_2^i,...,x_N^i) \Rightarrow$ $\hat{\mu}^i = \frac{1}{N}\sum\limits_{k=1}^{N}x_k^i$

- $E(\hat{u})=?$

We have $E(\hat{u})= E(\frac{1}{N}\sum\limits_{k=1}^{N}(x_k))$ $\frac{1}{N}E(x_k)=\frac{1}{N}\sum\limits_{k=1}^{N}E(x)=$ $\frac{1}{N}\sum\limits_{k=1}^{N}u = \mu$

But how far do we expect to derivate from the mean?

$E(|\hat{\mu}-\mu|^2) = E((\hat{\mu}-\mu)(\hat{\mu}-\mu))$ $=E(\hat{\mu}\cdot\hat{\mu}-\hat{\mu}\cdot{\mu}$ $-{\mu}\cdot\hat{\mu}+{\mu}\cdot{\mu})$

$=E(\hat{\mu}\cdot\hat{\mu})-2\cdot \mu E(\hat{u})+\mu \cdot \mu$

$=E(\hat{\mu}\cdot\hat{\mu})-\mu\cdot\mu$

$=E(\frac{1}{N}\sum\limits_{k=1}^{N}x_k \cdot \frac{1}{N}\sum\limits_{j=1}^{N}x_j)-\mu\cdot\mu$

$=\frac{1}{N^2}\sum\limits_{k=1}^{N}E(x_k \cdot x_j)-\mu\cdot\mu$

$=\frac{1}{N^2}[\sum\limits_{k,j=1,k\neq j}^{N}$ $E(x_k )\cdot E(x_j)+\sum\limits_{k,j=1,k\neq j}^{N}$ $E(x_k )\cdot E(x_k)]-\mu\cdot\mu$

$=\frac{1}{N^2}[N\cdot (N-1)\mu\cdot \mu+$ $\sum\limits_{k=1}^{N}E(x^2)]-\mu\cdot\mu$

$-\frac{1}{N}\mu\cdot\mu+\frac{1}{N^2}\sum\limits_{k=1}^{N}E(x^2)$

by $E[(x-\mu)(x-\mu)] = \sigma^2 \Rightarrow$ $E(x \cdot x)-\mu^2 = \sigma^2 \rightarrow$ $E(x \cdot x) = \sigma^2+\mu^2$

So: $E(|\hat{\mu}-\mu|^2) = -\frac{1}{N}\mu \cdot \mu +$ $\frac{1}{N}(\sigma^2+\mu \cdot \mu) = \frac{1}{N}\sigma^2$

Bias: The maximum likelihood for the variance $\sigma^2$ is biased means

the expected value over all data sets of size n of the sample variance is not equal to the true variance:

$E[\frac{1}{n}\sum\limits_{k=1}^{N}(x_k-\bar{x})] = \frac{n-1}{n}$ $\sigma^2 \neq \sigma^2$

But we can tell that as n $\rightarrow \infty$ , the MLE of $\sigma$ is closing to $\sigma^2$

(create a question page and put a link below)

Questions and comments

If you have any questions, comments, etc. please post them on https://kiwi.ecn.purdue.edu/rhea/index.php/ECE662Selecture_ZHenpengMLE_Ques.

Back to ECE662, Spring 2014

ECE662Selecture zhenpengMLE - Rhea

1. Motivation

2. MLE as a Parametric Density Estimation

Questions and comments

Alumni Liaison