(55 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | + | [[Category:slecture]] | |
+ | [[Category:ECE662Spring2014Boutin]] | ||
+ | [[Category:ECE]] | ||
+ | [[Category:ECE662]] | ||
+ | [[Category:pattern recognition]] | ||
− | + | <center><font size= 4> | |
+ | Tutorial on Maximum Likelihood Estimation: A Parametric Density Estimation Method | ||
+ | </font size> | ||
− | + | A [http://www.projectrhea.org/learning/slectures.php slecture] by Sudhir Kylasa | |
+ | Partially based on the [[2014_Spring_ECE_662_Boutin_Statistical_Pattern_recognition_slectures|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]]. | ||
+ | </center> | ||
+ | |||
[[Media:Mle-tutorial.pdf|MLE Tutorial in PDF Format]] | [[Media:Mle-tutorial.pdf|MLE Tutorial in PDF Format]] | ||
− | |||
<br> | <br> | ||
− | + | ---- | |
+ | ---- | ||
= Motivation = | = Motivation = | ||
− | |||
---- | ---- | ||
Line 41: | Line 49: | ||
<span class="texhtml">''L''(θ / ''x'') = ''f''(''x'' / θ)</span> | <span class="texhtml">''L''(θ / ''x'') = ''f''(''x'' / θ)</span> | ||
− | + | x is regarded as fixed, and <span class="texhtml">θ</span> is regarded as the variable for L. The log-likelihood function<br>is defined as <span class="texhtml">''l''(θ / ''x'') = ''l'''o'''g '''L'''''<b>(θ / ''x'')</b></span> | |
The Maximum Likelihood Estimate (or MLE) is the value <math>\hat{ \theta } = \hat{\theta(x)} \in \Theta</math><br>maximizing <span class="texhtml">''L''(θ / ''x'')</span>, provided it exists: | The Maximum Likelihood Estimate (or MLE) is the value <math>\hat{ \theta } = \hat{\theta(x)} \in \Theta</math><br>maximizing <span class="texhtml">''L''(θ / ''x'')</span>, provided it exists: | ||
Line 53: | Line 61: | ||
= What is Likelihood function ? = | = What is Likelihood function ? = | ||
− | If the probability of an event X dependent on model parameters p is written as<br>'''<span class="texhtml">''P''(''X'' | ''p'')</span><br>''' | + | If the probability of an event X dependent on model parameters p is written as<br>'''<span class="texhtml">''P''(''X'' | ''p'')</span>'''<br> |
+ | |||
+ | then we talk about the likelihood <br> | ||
+ | |||
+ | <span class="texhtml">''L''(''p'' | ''X'')</span> | ||
+ | |||
+ | that is the likelihood of the parameters given the data. | ||
+ | |||
+ | For most sensible models, we will find that certain data are more probable than other data. | ||
+ | |||
+ | The aim of maximum likelihood estimation is to find the parameter value(s) that makes the | ||
+ | |||
+ | observed data most likely. This is because the likelihood of the parameters given the data | ||
+ | |||
+ | is defined to be equal to the probability of the data given the parameters. If we were in the | ||
+ | |||
+ | business of making predictions based on a set of solid assumptions, then we would be | ||
+ | |||
+ | interested in probabilities - the probability of certain outcomes occurring or not occurring. | ||
+ | |||
+ | However, in the case of data analysis, we have already observed all the data: once they have | ||
+ | |||
+ | been observed they are fixed, there is no 'probabilistic' part to them anymore (the word data | ||
+ | |||
+ | comes from the Latin word meaning 'given'). We are much more interested in the likelihood | ||
+ | |||
+ | of the model parameters that underly the fixed data. The following is the relation between the | ||
+ | |||
+ | likelihood and the probability spaces:'''<br>''' | ||
+ | |||
+ | <u>'''Probability:'''</u> | ||
+ | |||
+ | Knowing Parameters <math>\rightarrow</math> Prediction of Outcomes | ||
+ | |||
+ | <u>'''Likelihood:'''</u> | ||
+ | |||
+ | Observation of Data <math>\rightarrow</math> Estimation of Parameters | ||
+ | |||
+ | <br> | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ---- | ||
+ | |||
+ | = Method = | ||
+ | |||
+ | Maximum likelihood (ML) estimates need not exist nor be unique. In this section, we show how to compute ML <br>estimates when they exist and are unique. For computational convenience, the ML estimate<br>is obtained by maximizing the log-likelihood function, log L(<span class="texhtml">θ / ''x''</span>). This is because the <br>two functions log L(<span class="texhtml">θ / ''x''</span>) and L(<span class="texhtml">θ / ''x''</span>) are monotonically related to each other<br>so the same ML estimate is obtained by maximizing either one. Assume that the log-likelihood<br>function is differentiable, if <span class="texhtml">θ<sub>''M''''L''''E''</sub></span> exists, it must satisfy the following partial <br>differential equation known as the likelihood equation: | ||
+ | |||
+ | <math>\frac{d}{d\theta}\left( log L(\theta/x) \right) = 0</math> | ||
+ | |||
+ | at <span class="texhtml">θ$ = $θ<sub>''M''''L''''E''</sub></span>. This is because maximum or minimum of a continuously differentiable<br>function implies that its first derivatives vanishes at such points. | ||
+ | |||
+ | <br> | ||
+ | |||
+ | The likelihood equation represents a necessary condition for the existence of an MLE estimate. <br>An additional condition must also be satisfied to ensure that log L(<span class="texhtml">θ / ''x''</span>) is a maximum and not<br>minimum, since the first derivative cannot reveal this. To be a maximum, the shape of the log-likelihood<br>function should be convex in the neighborhood of <span class="texhtml">θ<sub>''M''''L''''E''</sub></span>. This can be checked by <br>calculating the second derivatives of the log-likelihoods and showing whether they are all negative<br>at <span class="texhtml">θ$ = $θ<sub>''M''''L''''E''</sub></span>. | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <math>\frac{d^{2}}{d\theta^{2}}\left( log L(\theta/x) \right) < 0</math> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ---- | ||
+ | |||
+ | = Properties = | ||
+ | |||
+ | Some general properties (advantages and disadvantages) of the Maximum Likelihood Estimate are as follows: | ||
+ | |||
+ | *For large data samples (large N) the likelihood function L approaches a Gaussian distribution | ||
+ | *Maximum Likelihood estimates are usually consistent. | ||
+ | *For large N the estimates converge to the true value of the parameters which are estimated. | ||
+ | *Maximum Likelihood Estimates are usually unbiased (asymptotically). | ||
+ | *Maximum Likelihood Estimate is efficient: (the estimates have the smallest variance). | ||
+ | *Maximum Likelihood Estimate is sufficient: (it uses all the information in the observations). | ||
+ | *The solution from the Maximum Likelihood Estimate is unique. | ||
+ | |||
+ | On the other hand, we must know the correct probability distribution for the problem at hand. <br> | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ---- | ||
+ | |||
+ | = Numerical Examples using Maximum Likelihood Estimates = | ||
+ | |||
+ | In the following section, we discuss the applications of MLE procedure in estimating unknown<br>parameters of various common density distributions. | ||
+ | |||
+ | <br> | ||
+ | |||
+ | == Estimating prior probability using MLE == | ||
+ | |||
+ | Consider a two-class classification problem, with classes (<span class="texhtml">ω<sub>1</sub>,ω<sub>2</sub></span>) and let<br>Prob(<span class="texhtml">ω<sub>1</sub></span>) = <span class="texhtml">''p''</span> and Prob(<span class="texhtml">ω<sub>2</sub></span>) = <span class="texhtml">1 − ''p''</span> (here <span class="texhtml">''p''</span> is the unknown parameter). By using<br>MLE, we can estimate <span class="texhtml">''p''</span> as follows: | ||
+ | |||
+ | Let the sample be <math>\mathcal{D} = (x_{1}, x_{2}, \ldots, x_{N})</math>. Let <span class="texhtml">ω<sub>''i''''j'''''</sub></span>''''' '''''be the class of the feature vector x<sub></sub>j<br>(N is the sample size and <span class="texhtml" />N<sub>1</sub> is the number of feature vectors belonging to class <span class="texhtml">ω<sub>1</sub></span>). | ||
+ | |||
+ | Also assume that samples'''''<math>x_{1}, x_{2} \ldots, x_{N}</math> '''''are independent events. <br> | ||
+ | |||
+ | <math>Prob(\mathcal{D}/p) = \prod_{j=1}^{N} Prob(\omega_{ij} / p) </math> <br> <math>= p^{N_1} * (1-p)^{N - N_1}</math> | ||
+ | |||
+ | Please note that <math>Prob(\mathcal{D}/p)</math> is infinitely differentiable function of <span class="texhtml">''p''</span>, so the local maxima lies where its derivative is zero. | ||
+ | |||
+ | <math>\frac{d}{dp} \left( p^{N_1} * (1-p)^{N - N_1} \right) = 0 </math> <br> <math>N_1 * p^{N_1 - 1} * (1 - p)^{N - N_1} - (N - N_1) * (1 - p)^{N - N_1 - 1} * p^{N} = 0 </math> <br> <math>p^{N_1} * (1-p)^{N - N_1 - 1} = 0 </math> <br> | ||
+ | |||
+ | Solving the above equation for <span class="texhtml">''p''</span>, we get the following: | ||
+ | |||
+ | <math>p^{N_1} * (1-p)^{N - N_1 - 1} = 0 </math> <br> <span class="texhtml">''N''<sub>1</sub> * (1 − ''p'') = (''N'' − ''N''<sub>1</sub>) * ''p''</span> <br> | ||
+ | |||
+ | So p is either 0 or 1 by eq. CHANGE HERE and p is <span class="texhtml">(''N''<sub>1</sub> / ''N'')</span> by eq. CHANGE HERE. Hence this | ||
+ | |||
+ | proves that taking the frequencies for probabilities of the feature vectors is optimum and using | ||
+ | |||
+ | MLE we showed that p is maximized. Hence the class probabilities are optimum | ||
+ | |||
+ | (the likelihood function is maximized using MLE). | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | == [[|Estimating <span class="texhtml">μ</span> of a Gaussian distribution when <span class="texhtml">Σ</span> is known]] == | ||
+ | |||
+ | In this section, we estimate the value of <span class="texhtml">μ</span> (<span class="texhtml">μ<sub>''M''''L''''E''</sub></span>), when the covariance matrix (<span class="texhtml">Σ</span>) in known, | ||
+ | |||
+ | for gaussian distribution in n-dimensional feature space. | ||
+ | |||
+ | <math>\rho( \mathcal{D} / \mu) = \prod_{j = 1}^{N} \rho ( x_{j} / \mu ) </math> <br> log likelihood function is <br> [[|<math>log \rho( \mathcal{D} / \mu) = \sum_{j = 1}^{N} log (\rho ( x_{j} / \mu ) ) </math> ]][[|]] <br> <math>= \sum_{j=1}^{N} log\left[ \frac{1}{ | ||
+ | (2\pi)^{\frac{n}{2}} |\Sigma|^{\frac{1}{2}} | ||
+ | } | ||
+ | exp^{-\frac{(x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu)}{2}} | ||
+ | \right] </math> <br> <math>= \sum_{j=1}^{N} log \left[ \frac{1}{ | ||
+ | (2 \pi)^{\frac{n}{2}} |\Sigma|^{\frac{1}{2}} | ||
+ | } | ||
+ | \right] | ||
+ | - \frac{1}{2} \left[(x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) \right] | ||
+ | </math> <br> | ||
+ | |||
+ | This function is infinitely differentiable function of unknown parameters (<span class="texhtml">μ</span>)'s. To find the maxima, we set the derivative of this eq. [[|]] to 0. | ||
+ | |||
+ | <math> | ||
+ | \begin{bmatrix} | ||
+ | \frac{d}{d\mu_1} \\ | ||
+ | \frac{d}{d\mu_2} \\ | ||
+ | \vdots \\ | ||
+ | \frac{d}{d\mu_N} \\ | ||
+ | \end{bmatrix}_{n \times 1} | ||
+ | = | ||
+ | \begin{bmatrix} | ||
+ | 0 \\ | ||
+ | 0 \\ | ||
+ | \vdots \\ | ||
+ | 0 \\ | ||
+ | \end{bmatrix}_{n \times 1} | ||
+ | </math> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | Differentiating the log likelihood functions yields the following results: | ||
+ | |||
+ | <br> | ||
+ | |||
+ | <math> | ||
+ | \triangledown_{\vec{\mu}} log (\mathcal{D}/\vec{\mu}) = | ||
+ | \frac{-1}{2} \sum_{j=1}^{N} \triangledown_{\vec{\mu}} | ||
+ | \left[ | ||
+ | (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) | ||
+ | \right] | ||
+ | </math> <br> <math>=\frac{-1}{2} \sum_{j=1}^{N} | ||
+ | \begin{bmatrix} | ||
+ | \frac{d}{d\mu_1} (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) \\ | ||
+ | \frac{d}{d\mu_2} (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) \\ | ||
+ | \vdots \\ | ||
+ | \frac{d}{d\mu_N} (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) \\ | ||
+ | \end{bmatrix}_{n \times 1} | ||
+ | </math> <br> | ||
+ | |||
+ | But we know that, | ||
+ | |||
+ | <math> | ||
+ | \frac{d}{d\mu_{i}} (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) = | ||
+ | \left[ | ||
+ | \frac{d}{d\mu} (x_j - \mu)^{T} | ||
+ | \right] {\Sigma}^{-1}(x_j - \mu) + | ||
+ | (x_j - \mu)^{T} \Sigma^{-1} | ||
+ | \left[ | ||
+ | \frac{d}{d\mu} (x_j - \mu) | ||
+ | \right] </math> <br> <math> | ||
+ | = | ||
+ | 2 \left[ | ||
+ | \frac{d}{d\mu} (x_j - \mu)^{T} | ||
+ | \right] {\Sigma}^{-1}(x_j - \mu) | ||
+ | </math> <br> <math> = -2 e_i^{T} {\Sigma}^{-1}(x_j - \mu) </math> <br> | ||
+ | |||
+ | where <math>e_i^{T} = [0,0,0, \ldots, 1, \ldots] </math> and 1 is located at position i in the array | ||
+ | |||
+ | <math> | ||
+ | \triangledown_{\vec{\mu}} log (\mathcal{D}/\vec{\mu}) | ||
+ | = | ||
+ | \frac{-1}{2} \sum_{j=1}^{N} | ||
+ | \begin{bmatrix} | ||
+ | -2 e_{1}^{T} {\Sigma}^{-1} (x_j - \mu) \\ | ||
+ | -2 e_{2}^{T} {\Sigma}^{-1} (x_j - \mu) \\ | ||
+ | \vdots \\ | ||
+ | -2 e_{N}^{T} {\Sigma}^{-1} (x_j - \mu) \\ | ||
+ | \end{bmatrix}_{n \times 1} </math> <br> <math> | ||
+ | = | ||
+ | \sum_{j=1}^{N} | ||
+ | \begin{bmatrix} | ||
+ | e_{1}^{T} \\ | ||
+ | e_{2}^{T} \\ | ||
+ | \vdots \\ | ||
+ | e_{N}^{T} \\ | ||
+ | \end{bmatrix}_{n \times 1} {\Sigma}^{-1} (x_j - \mu) | ||
+ | </math> <br> | ||
+ | |||
+ | Notice that, <math> | ||
+ | \begin{bmatrix} | ||
+ | e_{1}^{T} \\ | ||
+ | e_{2}^{T} \\ | ||
+ | \vdots \\ | ||
+ | e_{N}^{T} | ||
+ | \end{bmatrix}_{n \times 1} </math> is the identity matrix. | ||
+ | |||
+ | <br> | ||
+ | |||
+ | So the above equation reduces to. <br> | ||
+ | |||
+ | [[|<math> | ||
+ | \triangledown_{\vec{\mu}} log (\mathcal{D}/\vec{\mu}) | ||
+ | = | ||
+ | \sum_{j=1}^{N} {\Sigma}^{-1} (x_j - \mu) | ||
+ | = | ||
+ | {\Sigma}^{-1} \sum_{j=1}^{N} (x_j - \mu) | ||
+ | </math> ]]<br> | ||
+ | |||
+ | By solving the equation. [[|]] for <span class="texhtml">μ</span>. <br> | ||
+ | |||
+ | <math>{\Sigma}^{-1} \sum_{j=1}^{N} (x_j - \mu) = 0 </math><br> <math>\sum {\Sigma}^{-1} \sum_{j=1}^{N} (x_j - \mu) = 0 </math><br> <math>\sum_{j=1}^{N} (x_j - \mu) = 0 </math><br> <math>\sum_{j=1}^{N} (x_j) - \sum_{j=1}^{N} (\mu) = 0 </math><br> <math>\mu = \frac{1}{N} (\sum_{j=1}^{N}(x_j)) = \mu_{MLE} </math><br> | ||
+ | |||
+ | Hence, we proved that using MLE the sample mean is the maximum likelihood estimate of any given sample. | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ---- | ||
+ | |||
+ | == [[|Estimating <span class="texhtml">μ</span> and <span class="texhtml">σ<sup>2</sup></span> for 1-D gaussian distribution using MLE]] == | ||
+ | |||
+ | ---- | ||
+ | |||
+ | In this subsection, we estimate the <span class="texhtml">μ</span> and <span class="texhtml">σ<sup>2</sup></span> for one-dimensional gaussian data. Here <span class="texhtml">θ</span> = <span class="texhtml">(θ<sub>1</sub>,θ<sub>2</sub>)</span> are <span class="texhtml">(μ,σ<sup>2</sup>)</span>, | ||
+ | |||
+ | which are unknown parameters. We estimate these parameter using the procedure discussed in the section [[|]]. | ||
+ | |||
+ | The log-likelihood function for this case is given by the following equation: <br> <math> | ||
+ | log \rho(x_k / \mu, \sigma^2) = | ||
+ | log \left[ | ||
+ | \frac{1}{\surd{2\pi}\sigma^2}exp(-\frac{x_k - \mu}{2\sigma^2}) | ||
+ | \right] | ||
+ | </math><br> <math> | ||
+ | = | ||
+ | -\frac{1}{2} log (2\pi\sigma^2) - \frac{1}{2\sigma^2}(x_k - \mu)^2 | ||
+ | log \rho(\mathcal{D}/\mu,\sigma^2) | ||
+ | </math><br> | ||
+ | |||
+ | <math> | ||
+ | = | ||
+ | \prod_{k=1}^{N} log \rho (x_k / \mu,\sigma^2 ) | ||
+ | </math> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | Since <math>(x_1, x_2, \ldots, x_N)</math> are I.I.D's, the density function can be written in product form as follows: | ||
+ | |||
+ | [[|<math> | ||
+ | \rho(\mathcal{D}/\mu,\sigma^2) = \rho (( x_1, x_2, \ldots, x_N ) / \mu,\sigma^2) | ||
+ | = | ||
+ | \rho(x_1/\mu,\sigma^2)\rho(x_2/\mu,\sigma^2) \ldots \rho(x_N/\mu,\sigma^2) | ||
+ | </math>]]<br> <math> | ||
+ | log \rho(\mathcal{D}/\mu,\sigma^2) = | ||
+ | log \prod_{k=1}^{N} \rho (x_k / \mu,\sigma^2 ) | ||
+ | </math><br> <math> | ||
+ | = | ||
+ | \sum_{k=1}^{N} \left[ | ||
+ | -\frac{1}{2} log (1\pi\sigma^2) - \frac{1}{2\sigma^2} (x_k - \mu)^2 | ||
+ | \right] | ||
+ | </math><br> | ||
+ | |||
+ | Differentiating and setting the eq. [[|]] to 0, yields the following equations: | ||
+ | |||
+ | <br> <math> | ||
+ | \triangledown_{\mu,\sigma^2} \rho(\mathcal{D}/\mu,\sigma^2) = | ||
+ | \begin{bmatrix} | ||
+ | \frac{d}{d\mu} log \rho (\mathcal{D}/\mu,\sigma^2) \\ | ||
+ | \frac{d}{d\sigma^2} log \rho (\mathcal{D}/\mu,\sigma^2) \\ | ||
+ | \end{bmatrix}_{2 \times 1} | ||
+ | </math> <br> <math> | ||
+ | = | ||
+ | \begin{bmatrix} | ||
+ | \frac{d}{d\mu} (-\frac{N}{2} log (2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{k=1}^{N}(x_k - \mu)) \\ | ||
+ | \frac{d}{d\sigma^2} (-\frac{N}{2} log (2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{k=1}^{N}(x_k - \mu))\\ | ||
+ | \end{bmatrix}_{2 \times 1} | ||
+ | </math> <br> [[|<math> | ||
+ | = | ||
+ | \begin{bmatrix} | ||
+ | \frac{1}{\sigma^2} \sum_{k=1}^{N}(x_k - \mu) \\ | ||
+ | -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{k=1}^{N}(x_k - \mu)^2\\ | ||
+ | \end{bmatrix}_{2 \times 1} | ||
+ | </math>]]<br> | ||
+ | |||
+ | Solving the eq. [[|]] for <span class="texhtml">μ</span> and <span class="texhtml">σ<sup>2</sup></span> yields the following: <br> <math> | ||
+ | \frac{1}{\sigma^2} \left( | ||
+ | \sum_{k=1}^{N}(x_k - \mu) | ||
+ | \right) = 0 | ||
+ | </math> <br> <math> | ||
+ | \sum_{k=1}^{N}(x_k - \mu) = 0 | ||
+ | </math> <br> <math> | ||
+ | \hat{\mu} = \mu = \frac{1}{N}\sum_{k=1}^{N}(x_k) | ||
+ | -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{k=1}^{N}(x_k - \mu)^2 = 0 | ||
+ | </math> <br> <math> | ||
+ | \sum_{k=1}^{N}(x_k - \mu)^2 = N\sigma^2 | ||
+ | </math> <br> <math> | ||
+ | \hat{\sigma^2} = \sigma^2 = \frac{ | ||
+ | \sum_{k=1}^{N}(x_k - \hat{\mu})^2 | ||
+ | }{N} | ||
+ | </math> <br> which are the mean and standard deviation of the empirical data. | ||
+ | |||
+ | In general for N-dimensional Gaussian data, when X <span class="texhtml">˜</span> N(<span class="texhtml">μ,Σ</span>) where X <math>\in \mathcal{R}^n</math>, | ||
+ | |||
+ | with <math>\vec{\mu}</math> and <span class="texhtml">Σ</span> are unknown parameters, then MLE for <span class="texhtml">μ</span> and <span class="texhtml">Σ</span> are as follows: <br> <math> | ||
+ | \hat{\mu} = \frac{1}{N}\sum_{k=1}^{N}(\vec{x_k}) | ||
+ | </math> <br> <math> | ||
+ | \hat{\sigma^2} = \frac{ | ||
+ | \sum_{k=1}^{N}(\vec{x_k} - \vec{\mu})(\vec{x_k} - \vec{\mu})^T | ||
+ | }{N} | ||
+ | </math> <br> | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ---- | ||
+ | |||
+ | == How far does the <math>\hat{\mu}</math> deviate from the true <span class="texhtml">μ</span> when MLE is used. == | ||
+ | |||
+ | ---- | ||
+ | |||
+ | As discussed in section [[|]], we know that | ||
+ | |||
+ | <math> | ||
+ | E[\hat{\mu}] = \frac{1}{N}\sum_{k=1}^{N}(\vec{x_k}) | ||
+ | = \mu | ||
+ | </math> <br> | ||
+ | |||
+ | Now we compute the expected value of <math>(|\hat{\mu} - \mu|)^2</math> as follows: | ||
+ | |||
+ | <math>E[ (|\hat{\mu} - \mu|)^2 ] = E[(\hat{\mu} - \mu)(\hat{\mu} - \mu)] </math> <br> <math> =E[\hat{\mu}\mu - \mu\hat{\mu} - \hat{\mu}\mu + \mu\mu] </math> <br> <math> =E[\hat{\mu}\hat{\mu}] - 2 E[\mu\hat{\mu}] + E[\mu\mu] </math> <br> Substituting E[<math>\hat{\mu}</math>] = <span class="texhtml">μ</span> we have <math> =E[\hat{\mu}\hat{\mu}] - \mu\mu </math> <br> <math> =E(\frac{1}{N} \sum_{k=1}^{N}X_k * \sum_{j=1}^{N}X_j) - \mu\mu </math> <br> <math> =\frac{1}{N^2}\sum_{j,k=1}^{N}E[X_jX_k] - \mu\mu </math> <br> | ||
+ | |||
+ | <br> Treating <span class="texhtml">''X''<sub>''j''</sub></span> as random variables in the above equations, we have | ||
+ | |||
+ | <math> | ||
+ | E[ (|\hat{\mu} - \mu|)^2 ] =\frac{1}{N^2}\left[ | ||
+ | \sum_{j,k=1}^{N}E(X_j)E(X_k) + \sum_{j,k=1}^{N}E(X_j * X_k) | ||
+ | \right] - \mu\mu </math> <br> <math> =\frac{1}{N^2}\left[ | ||
+ | N(N-1)\mu\mu + N*E[X^2] | ||
+ | \right] \mu\mu </math> <br> <math>E[ (|\hat{\mu} - \mu|)^2 ] = \frac{1}{N}E[X^2] - \frac{1}{N} \mu\mu | ||
+ | </math> <br> | ||
+ | |||
+ | But, we know that | ||
+ | |||
+ | <span class="texhtml">''E''[ | ''X'' − μ | <sup>2</sup>] = ''E''[(''X'' − μ)(''X'' − μ)] = σ<sup>2</sup></span> <br> <span class="texhtml">''E''[''X'' * ''X''] − μ<sup>2</sup> = σ<sup>2</sup></span> <br> <span class="texhtml">''E''[''X'' * ''X''] = σ<sup>2</sup> * μ<sup>2</sup></span> <br> | ||
+ | |||
+ | <br> Therefore, we have the following result: <br> <math>E[ (|\hat{\mu} - \mu|)^2 ] = \frac{1}{N}(\sigma^2 * \mu^2) - \frac{1}{N}(\mu\mu) </math> <br> <math>= \frac{1}{N} \sigma^2 </math> <br> | ||
+ | |||
+ | So the expected value of <math>(|\hat{\mu} - \mu|)^2</math> is proportional to the true standard deviation. | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ---- | ||
+ | |||
+ | == Estimate for <span class="texhtml">Σ</span> is biased, when MLE is used. == | ||
+ | |||
+ | ---- | ||
+ | |||
+ | <br> In the following section, we will show that the estimate for covariance is biased (when <span class="texhtml">μ</span> is unknown) | ||
+ | |||
+ | when MLE is used to estimate its value and equals to true covariance when <span class="texhtml">μ</span> is known. <br> <math>E[ \hat{\Sigma} ] = E[ \frac{1}{N} \sum_{k=1}^{N} (X_k - \mu)(X_k - \mu)^T] </math> <br> <math> = \sum_{k=1}^{N} E[ \frac{1}{N} (X_k - \mu)(X_k - \mu)^T] </math> <br> <math> = \frac{1}{N} \sum_{k=1}^{N} E [ | ||
+ | X_k X_k^T - X_K \mu^T - \mu X^T + \mu \mu^T | ||
+ | ] </math> <br> <math> = \frac{1}{N} N E[ X_K X_K^T ] - \mu\mu^T </math> <br> <span class="texhtml"> = ''E''[''X''''X'''''<b><sup></sup></b>'''''T'' − μμ<sup></sup>''T'']'''</span>'''''<br> <span class="texhtml"> = </span>''E''[''''''<i>X</i><sup>''T''</sup> − 2μμ<sup>''T''</sup> − μμ<sup>''T''</sup>] <br> <span class="texhtml"> = ''E''[(''X'' − μ)(''X'' − μ)<sup>''T''</sup>] = Σ</span> <br> | ||
+ | |||
+ | <br> If the <span class="texhtml">μ</span> is known, then it turns out that Estimated value of <math>\hat{\Sigma}</math> is equal to true <span class="texhtml">Σ</span>. | ||
+ | |||
+ | We now show the derivation that in case where <span class="texhtml">μ</span> is not known, then estimated value of <math>\hat{\Sigma}</math> is not equal to true <span class="texhtml">Σ</span> | ||
+ | |||
+ | <br> <math>E[ \hat{\Sigma} ] = \frac{1}{N} E[ \sum_{k=1}^{N} (X_k - \hat{\mu})(X_k - \hat{\mu})^T] </math> <br> <math> = \frac{1}{N} \sum_{k=1}^{N} E [ | ||
+ | X_k X_k^T - X_K \hat{\mu}^T - \hat{\mu} X^T + \hat{\mu} \hat{\mu}^T | ||
+ | ] </math> '''<br>''' | ||
+ | |||
+ | <math>= \frac{1}{N} \sum_{k=1}^{N} ( E[ X_k X_k^T ] - E[X_K \hat{\mu}^T] - E[ \hat{\mu} X^T] - E[\hat{\mu} \hat{\mu}^T] )</math><br> <br> | ||
+ | |||
+ | <math> = \frac{1}{N} \sum_{k=1}^{N} \left( | ||
+ | E[ X_k X_k^T ] - E [(\frac{1}{N}\sum_{l=1}^{N}X_l)X_k] | ||
+ | \right) | ||
+ | - E [\frac{1}{N}\sum_{l=1}^{N}X_k X_l^T] | ||
+ | + E[\frac{1}{N}\sum_{l=1}^{N}X_l \frac{1}{N}\sum_{m=1}^{N}X_m^T] </math> <br> | ||
+ | |||
+ | <br> Splitting the summation above into two parts, l = k and l <math>\neq</math> k, we have the following: | ||
+ | |||
+ | <br> <math> | ||
+ | E[ \hat{\Sigma} ] = \frac{1}{N} \sum_{k=1}^{N} \{ | ||
+ | E[X X^T] | ||
+ | - \frac{1}{N} \Sigma_{l=1, l \neq k}^{N}E(X_l) E (X_k^T) | ||
+ | -\frac{1}{N} \Sigma_{l=1, l = k}^{N} E( X_k X_k^T ) | ||
+ | -\frac{1}{N}\Sigma_{l=1, l \neq k}^{N} E(X_k) E(X_l^T) | ||
+ | </math> <br> <math> | ||
+ | -\frac{1}{N} \Sigma_{l=1, l = k}^{N} E( X_k X_k^T ) | ||
+ | + \frac{1}{N^2} \Sigma_{l,m=1, l \neq m}^{K} E(X_l)E(X_m^T) | ||
+ | + \frac{1}{N^2} \Sigma_{l,m=1, l = m}^{K} E(X_l)E(X_l^T) | ||
+ | \} | ||
+ | </math> <br> <math> | ||
+ | = \frac{1}{N} \sum_{k=1}^{N} \{ | ||
+ | E[X X^T] - \frac{1}{N} \Sigma_{l=1, l \neq k}^{N}\mu\mu^T | ||
+ | -\frac{1}{N} \Sigma_{l=1, l = k}^{N} E( X_k X_k^T ) | ||
+ | -\frac{1}{N}\Sigma_{l=1, l \neq k}^{N} \mu\mu^T | ||
+ | </math> <br> <math> | ||
+ | |||
+ | -\frac{1}{N} \Sigma_{l=1, l = k}^{N} E( X_k X_k^T ) | ||
+ | + \frac{1}{N^2} \Sigma_{l,m=1, l \neq m}^{K} \mu\mu^T | ||
+ | + \frac{1}{N^2} \Sigma_{l,m=1, l = m}^{K} E(X_l)E(X_l^T) | ||
+ | \} | ||
+ | </math> <br> <math> | ||
+ | E[ \hat{\Sigma} ] = \frac{1}{N} \left[ | ||
+ | N(1 - \frac{1}{N} E( X X^T ) + N( \frac{1}{N} - 1) \mu\mu^T | ||
+ | \right] </math> <br> <math> = (1 - \frac{1}{N} ) \left[ | ||
+ | E( X X^T ) - \mu\mu^T | ||
+ | \right] </math> <br> <math> = ( \frac{N - 1}{N} ) E(X X^T - \mu\mu^T)) </math> <br> <math> = ( \frac{N - 1}{N} )E[(X - \mu)(X - \mu)^T] \neq \Sigma | ||
+ | </math> <br> | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ---- | ||
+ | |||
+ | <br> | ||
+ | |||
+ | == Binomial Distribution == | ||
+ | |||
+ | ---- | ||
+ | |||
+ | This section discusses the estimation of p in a typical binomial distribution. Assuming that parameter p, is unknown | ||
+ | |||
+ | in a binomial distribution give by the equation below: | ||
+ | |||
+ | <br> <math>Prob(r/p) = \binom{n}{r}p^r(1-p)^{n-r}</math> <br> <math>log Prob(r/p) = log \binom{n}{r} = r log ( p )+ (n-r)log(1-p) </math> <br> <math>log Prob(\mathcal{D}/p) = log \prod_{i=1}^{N} Prob( r_i / p) </math> <br> <math> = \sum_{i=1}^{N}log[ Prob(r_i/p) ] </math> <br> <math>= \sum_{i=1}^{N}[ log \binom{n}{r_i} + r_i log p + (n - r_i) log(1 - p) ]</math> <br> | ||
+ | |||
+ | <br> To find the maximum, we set the derivative of log-likelihood function to be 0: | ||
+ | |||
+ | <br> <math>\frac{d}{dp}log( \mathcal{D}/p) = 0 </math> <br> <math>\sum_{i=1}^{N} [ \frac{r_i}{p} + \frac{n - r_i}{1 - p} (-1) ] = 0 </math> <br> <math>\frac{1}{p} \sum_{i=1}^{N} r_i = \frac{1}{1-p} \sum_{i=1}^{N}(n - r_i) </math> <br> <math>(1-p)\sum_{i=1}^{N} r_i = p \sum_{i=1}^{N}(n - r_i) </math> <br> <math>p = \frac{1}{N} \sum_{i=1}^{N} r_i/n </math> <br> | ||
+ | |||
+ | Hence p, which is the unknown parameter is equal to the average of number of successes in the sample space. | ||
+ | |||
+ | ---- | ||
+ | |||
+ | ---- | ||
+ | |||
+ | References: | ||
+ | |||
+ | ---- | ||
+ | |||
+ | 1. http://en.wikipedia.org/wiki/Maximum_likelihood<br>2. Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification. <br>3. Mimi Boutin, ECE 662 Pattern Recognition Lectures. <br>4. http://people.missouristate.edu/songfengzheng/Teaching/MTH541/Lecture%20notes/Bayesian.pdf<br>5. http://www.saylor.org/site/wp-content/uploads/2011/11/CS405-6.2.3.2.2.1-WIKIPEDIA.pdf<br>6. http://www.colorado.edu/economics/morey/7818/estimation/maxlik/maxlik.pdf<br>7. https://files.nyu.edu/mrg217/public/mle_introduction1.pdf<br>8. http://www.cs.haifa.ac.il/~rita/ml_course/lectures/MLE_Tutorial.pdf<br>9. In Jae Myung, Tutorial on maximum likelihood estimation | ||
+ | ---- | ||
+ | ---- | ||
+ | ==[[ECE662_Spring2014_MLE_tutorial_slecture_review|Questions and comments]]== | ||
+ | |||
+ | If you have any questions, comments, etc. please post them on [[ECE662_Spring2014_MLE_tutorial_slecture_review|this page]]. | ||
+ | ---- | ||
+ | [[2014_Spring_ECE_662_Boutin|Back to ECE662, Spring 2014]] |
Latest revision as of 09:49, 22 January 2015
Tutorial on Maximum Likelihood Estimation: A Parametric Density Estimation Method
A slecture by Sudhir Kylasa
Partially based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.
Contents
- 1 Motivation
- 2 Definition
- 3 What is Likelihood function ?
- 4 Method
- 5 Properties
- 6 Numerical Examples using Maximum Likelihood Estimates
- 6.1 Estimating prior probability using MLE
- 6.2 [[|Estimating μ of a Gaussian distribution when Σ is known]]
- 6.3 [[|Estimating μ and σ2 for 1-D gaussian distribution using MLE]]
- 6.4 How far does the $ \hat{\mu} $ deviate from the true μ when MLE is used.
- 6.5 Estimate for Σ is biased, when MLE is used.
- 6.6 Binomial Distribution
- 6.7 Questions and comments
Motivation
Suppose one wishes to determine just how biased an unfair coin is. Call the probability of
tossing a HEAD is p. The goal then is to determine p.
Also suppose the coin is tossed 80 times: i.e., the sample might be something like x1 = H,
x2 = T, …, x8 = T, and the count of number of HEADS, "H" is observed.
The probability of tossing TAILS is 1 − p. Suppose the outcome is 49 HEADS and 31 TAILS,
and suppose the coin was taken from a box containing three coins: one which gives HEADS
with probability p = 1 / 3, one which gives HEADS with probability p = 1 / 2 and another which
gives HEADS with probability p = 2 / 3. The coins have lost their labels, so which one it was is
unknown. Clearly the probability mass function for this experiment is binomial distribution with
sample size equal to 80, number of successes equal to 49 but different values of p. We have
the following probability mass functions for each of the above mentioned cases:
$ Pr(H = 49 | p = {1}/{3}) = \binom{80}{49}(1/3)^{49}(1 - 1/3)^31 \approx 0.000 $
$ Pr(H = 49 | p = {1}/{2}) = \binom{80}{49}(1/2)^{49}(1 - 1/2)^31 \approx 0.012 $
$ Pr(H = 49 | p = {2}/{3}) = \binom{80}{49}(2/3)^{49}(1 - 2/3)^31 \approx 0.054 $
Based on the above equations, we can conclude that the coin with p = 2 / 3 was more likely
to be picked up for the observations which we were given to begin with.
Definition
The generic situation is that we observe a n-dimensional random vector X with probability
density (or mass) function f(x / θ). It is assumed that θ is a fixed, unknown constant
belonging to the set $ \Theta \subset \mathbb{R}^{n} $.
For $ x \in \mathbb{R}^{n} $, the likelihood function of θ is defined as
L(θ / x) = f(x / θ)
x is regarded as fixed, and θ is regarded as the variable for L. The log-likelihood function
is defined as l(θ / x) = log L(θ / x)
The Maximum Likelihood Estimate (or MLE) is the value $ \hat{ \theta } = \hat{\theta(x)} \in \Theta $
maximizing L(θ / x), provided it exists:
$ L(\hat{\theta}/(x)) = \underset{\theta}{argmax}[ L(\theta/x) ] $
What is Likelihood function ?
If the probability of an event X dependent on model parameters p is written as
P(X | p)
then we talk about the likelihood
L(p | X)
that is the likelihood of the parameters given the data.
For most sensible models, we will find that certain data are more probable than other data.
The aim of maximum likelihood estimation is to find the parameter value(s) that makes the
observed data most likely. This is because the likelihood of the parameters given the data
is defined to be equal to the probability of the data given the parameters. If we were in the
business of making predictions based on a set of solid assumptions, then we would be
interested in probabilities - the probability of certain outcomes occurring or not occurring.
However, in the case of data analysis, we have already observed all the data: once they have
been observed they are fixed, there is no 'probabilistic' part to them anymore (the word data
comes from the Latin word meaning 'given'). We are much more interested in the likelihood
of the model parameters that underly the fixed data. The following is the relation between the
likelihood and the probability spaces:
Probability:
Knowing Parameters $ \rightarrow $ Prediction of Outcomes
Likelihood:
Observation of Data $ \rightarrow $ Estimation of Parameters
Method
Maximum likelihood (ML) estimates need not exist nor be unique. In this section, we show how to compute ML
estimates when they exist and are unique. For computational convenience, the ML estimate
is obtained by maximizing the log-likelihood function, log L(θ / x). This is because the
two functions log L(θ / x) and L(θ / x) are monotonically related to each other
so the same ML estimate is obtained by maximizing either one. Assume that the log-likelihood
function is differentiable, if θM'L'E exists, it must satisfy the following partial
differential equation known as the likelihood equation:
$ \frac{d}{d\theta}\left( log L(\theta/x) \right) = 0 $
at θ$ = $θM'L'E. This is because maximum or minimum of a continuously differentiable
function implies that its first derivatives vanishes at such points.
The likelihood equation represents a necessary condition for the existence of an MLE estimate.
An additional condition must also be satisfied to ensure that log L(θ / x) is a maximum and not
minimum, since the first derivative cannot reveal this. To be a maximum, the shape of the log-likelihood
function should be convex in the neighborhood of θM'L'E. This can be checked by
calculating the second derivatives of the log-likelihoods and showing whether they are all negative
at θ$ = $θM'L'E.
$ \frac{d^{2}}{d\theta^{2}}\left( log L(\theta/x) \right) < 0 $
Properties
Some general properties (advantages and disadvantages) of the Maximum Likelihood Estimate are as follows:
- For large data samples (large N) the likelihood function L approaches a Gaussian distribution
- Maximum Likelihood estimates are usually consistent.
- For large N the estimates converge to the true value of the parameters which are estimated.
- Maximum Likelihood Estimates are usually unbiased (asymptotically).
- Maximum Likelihood Estimate is efficient: (the estimates have the smallest variance).
- Maximum Likelihood Estimate is sufficient: (it uses all the information in the observations).
- The solution from the Maximum Likelihood Estimate is unique.
On the other hand, we must know the correct probability distribution for the problem at hand.
Numerical Examples using Maximum Likelihood Estimates
In the following section, we discuss the applications of MLE procedure in estimating unknown
parameters of various common density distributions.
Estimating prior probability using MLE
Consider a two-class classification problem, with classes (ω1,ω2) and let
Prob(ω1) = p and Prob(ω2) = 1 − p (here p is the unknown parameter). By using
MLE, we can estimate p as follows:
Let the sample be $ \mathcal{D} = (x_{1}, x_{2}, \ldots, x_{N}) $. Let ωi'j be the class of the feature vector xj
(N is the sample size and <span class="texhtml" />N1 is the number of feature vectors belonging to class ω1).
Also assume that samples$ x_{1}, x_{2} \ldots, x_{N} $ are independent events.
$ Prob(\mathcal{D}/p) = \prod_{j=1}^{N} Prob(\omega_{ij} / p) $
$ = p^{N_1} * (1-p)^{N - N_1} $
Please note that $ Prob(\mathcal{D}/p) $ is infinitely differentiable function of p, so the local maxima lies where its derivative is zero.
$ \frac{d}{dp} \left( p^{N_1} * (1-p)^{N - N_1} \right) = 0 $
$ N_1 * p^{N_1 - 1} * (1 - p)^{N - N_1} - (N - N_1) * (1 - p)^{N - N_1 - 1} * p^{N} = 0 $
$ p^{N_1} * (1-p)^{N - N_1 - 1} = 0 $
Solving the above equation for p, we get the following:
$ p^{N_1} * (1-p)^{N - N_1 - 1} = 0 $
N1 * (1 − p) = (N − N1) * p
So p is either 0 or 1 by eq. CHANGE HERE and p is (N1 / N) by eq. CHANGE HERE. Hence this
proves that taking the frequencies for probabilities of the feature vectors is optimum and using
MLE we showed that p is maximized. Hence the class probabilities are optimum
(the likelihood function is maximized using MLE).
[[|Estimating μ of a Gaussian distribution when Σ is known]]
In this section, we estimate the value of μ (μM'L'E), when the covariance matrix (Σ) in known,
for gaussian distribution in n-dimensional feature space.
$ \rho( \mathcal{D} / \mu) = \prod_{j = 1}^{N} \rho ( x_{j} / \mu ) $
log likelihood function is
[[|$ log \rho( \mathcal{D} / \mu) = \sum_{j = 1}^{N} log (\rho ( x_{j} / \mu ) ) $ ]][[|]]
$ = \sum_{j=1}^{N} log\left[ \frac{1}{ (2\pi)^{\frac{n}{2}} |\Sigma|^{\frac{1}{2}} } exp^{-\frac{(x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu)}{2}} \right] $
$ = \sum_{j=1}^{N} log \left[ \frac{1}{ (2 \pi)^{\frac{n}{2}} |\Sigma|^{\frac{1}{2}} } \right] - \frac{1}{2} \left[(x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) \right] $
This function is infinitely differentiable function of unknown parameters (μ)'s. To find the maxima, we set the derivative of this eq. [[|]] to 0.
$ \begin{bmatrix} \frac{d}{d\mu_1} \\ \frac{d}{d\mu_2} \\ \vdots \\ \frac{d}{d\mu_N} \\ \end{bmatrix}_{n \times 1} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ \end{bmatrix}_{n \times 1} $
Differentiating the log likelihood functions yields the following results:
$ \triangledown_{\vec{\mu}} log (\mathcal{D}/\vec{\mu}) = \frac{-1}{2} \sum_{j=1}^{N} \triangledown_{\vec{\mu}} \left[ (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) \right] $
$ =\frac{-1}{2} \sum_{j=1}^{N} \begin{bmatrix} \frac{d}{d\mu_1} (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) \\ \frac{d}{d\mu_2} (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) \\ \vdots \\ \frac{d}{d\mu_N} (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) \\ \end{bmatrix}_{n \times 1} $
But we know that,
$ \frac{d}{d\mu_{i}} (x_j - \mu)^{T} \Sigma^{-1}(x_j - \mu) = \left[ \frac{d}{d\mu} (x_j - \mu)^{T} \right] {\Sigma}^{-1}(x_j - \mu) + (x_j - \mu)^{T} \Sigma^{-1} \left[ \frac{d}{d\mu} (x_j - \mu) \right] $
$ = 2 \left[ \frac{d}{d\mu} (x_j - \mu)^{T} \right] {\Sigma}^{-1}(x_j - \mu) $
$ = -2 e_i^{T} {\Sigma}^{-1}(x_j - \mu) $
where $ e_i^{T} = [0,0,0, \ldots, 1, \ldots] $ and 1 is located at position i in the array
$ \triangledown_{\vec{\mu}} log (\mathcal{D}/\vec{\mu}) = \frac{-1}{2} \sum_{j=1}^{N} \begin{bmatrix} -2 e_{1}^{T} {\Sigma}^{-1} (x_j - \mu) \\ -2 e_{2}^{T} {\Sigma}^{-1} (x_j - \mu) \\ \vdots \\ -2 e_{N}^{T} {\Sigma}^{-1} (x_j - \mu) \\ \end{bmatrix}_{n \times 1} $
$ = \sum_{j=1}^{N} \begin{bmatrix} e_{1}^{T} \\ e_{2}^{T} \\ \vdots \\ e_{N}^{T} \\ \end{bmatrix}_{n \times 1} {\Sigma}^{-1} (x_j - \mu) $
Notice that, $ \begin{bmatrix} e_{1}^{T} \\ e_{2}^{T} \\ \vdots \\ e_{N}^{T} \end{bmatrix}_{n \times 1} $ is the identity matrix.
So the above equation reduces to.
[[|$ \triangledown_{\vec{\mu}} log (\mathcal{D}/\vec{\mu}) = \sum_{j=1}^{N} {\Sigma}^{-1} (x_j - \mu) = {\Sigma}^{-1} \sum_{j=1}^{N} (x_j - \mu) $ ]]
By solving the equation. [[|]] for μ.
$ {\Sigma}^{-1} \sum_{j=1}^{N} (x_j - \mu) = 0 $
$ \sum {\Sigma}^{-1} \sum_{j=1}^{N} (x_j - \mu) = 0 $
$ \sum_{j=1}^{N} (x_j - \mu) = 0 $
$ \sum_{j=1}^{N} (x_j) - \sum_{j=1}^{N} (\mu) = 0 $
$ \mu = \frac{1}{N} (\sum_{j=1}^{N}(x_j)) = \mu_{MLE} $
Hence, we proved that using MLE the sample mean is the maximum likelihood estimate of any given sample.
[[|Estimating μ and σ2 for 1-D gaussian distribution using MLE]]
In this subsection, we estimate the μ and σ2 for one-dimensional gaussian data. Here θ = (θ1,θ2) are (μ,σ2),
which are unknown parameters. We estimate these parameter using the procedure discussed in the section [[|]].
The log-likelihood function for this case is given by the following equation:
$ log \rho(x_k / \mu, \sigma^2) = log \left[ \frac{1}{\surd{2\pi}\sigma^2}exp(-\frac{x_k - \mu}{2\sigma^2}) \right] $
$ = -\frac{1}{2} log (2\pi\sigma^2) - \frac{1}{2\sigma^2}(x_k - \mu)^2 log \rho(\mathcal{D}/\mu,\sigma^2) $
$ = \prod_{k=1}^{N} log \rho (x_k / \mu,\sigma^2 ) $
Since $ (x_1, x_2, \ldots, x_N) $ are I.I.D's, the density function can be written in product form as follows:
[[|$ \rho(\mathcal{D}/\mu,\sigma^2) = \rho (( x_1, x_2, \ldots, x_N ) / \mu,\sigma^2) = \rho(x_1/\mu,\sigma^2)\rho(x_2/\mu,\sigma^2) \ldots \rho(x_N/\mu,\sigma^2) $]]
$ log \rho(\mathcal{D}/\mu,\sigma^2) = log \prod_{k=1}^{N} \rho (x_k / \mu,\sigma^2 ) $
$ = \sum_{k=1}^{N} \left[ -\frac{1}{2} log (1\pi\sigma^2) - \frac{1}{2\sigma^2} (x_k - \mu)^2 \right] $
Differentiating and setting the eq. [[|]] to 0, yields the following equations:
$ \triangledown_{\mu,\sigma^2} \rho(\mathcal{D}/\mu,\sigma^2) = \begin{bmatrix} \frac{d}{d\mu} log \rho (\mathcal{D}/\mu,\sigma^2) \\ \frac{d}{d\sigma^2} log \rho (\mathcal{D}/\mu,\sigma^2) \\ \end{bmatrix}_{2 \times 1} $
$ = \begin{bmatrix} \frac{d}{d\mu} (-\frac{N}{2} log (2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{k=1}^{N}(x_k - \mu)) \\ \frac{d}{d\sigma^2} (-\frac{N}{2} log (2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{k=1}^{N}(x_k - \mu))\\ \end{bmatrix}_{2 \times 1} $
[[|$ = \begin{bmatrix} \frac{1}{\sigma^2} \sum_{k=1}^{N}(x_k - \mu) \\ -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{k=1}^{N}(x_k - \mu)^2\\ \end{bmatrix}_{2 \times 1} $]]
Solving the eq. [[|]] for μ and σ2 yields the following:
$ \frac{1}{\sigma^2} \left( \sum_{k=1}^{N}(x_k - \mu) \right) = 0 $
$ \sum_{k=1}^{N}(x_k - \mu) = 0 $
$ \hat{\mu} = \mu = \frac{1}{N}\sum_{k=1}^{N}(x_k) -\frac{N}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{k=1}^{N}(x_k - \mu)^2 = 0 $
$ \sum_{k=1}^{N}(x_k - \mu)^2 = N\sigma^2 $
$ \hat{\sigma^2} = \sigma^2 = \frac{ \sum_{k=1}^{N}(x_k - \hat{\mu})^2 }{N} $
which are the mean and standard deviation of the empirical data.
In general for N-dimensional Gaussian data, when X ˜ N(μ,Σ) where X $ \in \mathcal{R}^n $,
with $ \vec{\mu} $ and Σ are unknown parameters, then MLE for μ and Σ are as follows:
$ \hat{\mu} = \frac{1}{N}\sum_{k=1}^{N}(\vec{x_k}) $
$ \hat{\sigma^2} = \frac{ \sum_{k=1}^{N}(\vec{x_k} - \vec{\mu})(\vec{x_k} - \vec{\mu})^T }{N} $
How far does the $ \hat{\mu} $ deviate from the true μ when MLE is used.
As discussed in section [[|]], we know that
$ E[\hat{\mu}] = \frac{1}{N}\sum_{k=1}^{N}(\vec{x_k}) = \mu $
Now we compute the expected value of $ (|\hat{\mu} - \mu|)^2 $ as follows:
$ E[ (|\hat{\mu} - \mu|)^2 ] = E[(\hat{\mu} - \mu)(\hat{\mu} - \mu)] $
$ =E[\hat{\mu}\mu - \mu\hat{\mu} - \hat{\mu}\mu + \mu\mu] $
$ =E[\hat{\mu}\hat{\mu}] - 2 E[\mu\hat{\mu}] + E[\mu\mu] $
Substituting E[$ \hat{\mu} $] = μ we have $ =E[\hat{\mu}\hat{\mu}] - \mu\mu $
$ =E(\frac{1}{N} \sum_{k=1}^{N}X_k * \sum_{j=1}^{N}X_j) - \mu\mu $
$ =\frac{1}{N^2}\sum_{j,k=1}^{N}E[X_jX_k] - \mu\mu $
Treating Xj as random variables in the above equations, we have
$ E[ (|\hat{\mu} - \mu|)^2 ] =\frac{1}{N^2}\left[ \sum_{j,k=1}^{N}E(X_j)E(X_k) + \sum_{j,k=1}^{N}E(X_j * X_k) \right] - \mu\mu $
$ =\frac{1}{N^2}\left[ N(N-1)\mu\mu + N*E[X^2] \right] \mu\mu $
$ E[ (|\hat{\mu} - \mu|)^2 ] = \frac{1}{N}E[X^2] - \frac{1}{N} \mu\mu $
But, we know that
E[ | X − μ | 2] = E[(X − μ)(X − μ)] = σ2
E[X * X] − μ2 = σ2
E[X * X] = σ2 * μ2
Therefore, we have the following result:
$ E[ (|\hat{\mu} - \mu|)^2 ] = \frac{1}{N}(\sigma^2 * \mu^2) - \frac{1}{N}(\mu\mu) $
$ = \frac{1}{N} \sigma^2 $
So the expected value of $ (|\hat{\mu} - \mu|)^2 $ is proportional to the true standard deviation.
Estimate for Σ is biased, when MLE is used.
In the following section, we will show that the estimate for covariance is biased (when μ is unknown)
when MLE is used to estimate its value and equals to true covariance when μ is known.
$ E[ \hat{\Sigma} ] = E[ \frac{1}{N} \sum_{k=1}^{N} (X_k - \mu)(X_k - \mu)^T] $
$ = \sum_{k=1}^{N} E[ \frac{1}{N} (X_k - \mu)(X_k - \mu)^T] $
$ = \frac{1}{N} \sum_{k=1}^{N} E [ X_k X_k^T - X_K \mu^T - \mu X^T + \mu \mu^T ] $
$ = \frac{1}{N} N E[ X_K X_K^T ] - \mu\mu^T $
= E[X'XT − μμT]
= E['XT − 2μμT − μμT]
= E[(X − μ)(X − μ)T] = Σ
If the μ is known, then it turns out that Estimated value of $ \hat{\Sigma} $ is equal to true Σ.
We now show the derivation that in case where μ is not known, then estimated value of $ \hat{\Sigma} $ is not equal to true Σ
$ E[ \hat{\Sigma} ] = \frac{1}{N} E[ \sum_{k=1}^{N} (X_k - \hat{\mu})(X_k - \hat{\mu})^T] $
$ = \frac{1}{N} \sum_{k=1}^{N} E [ X_k X_k^T - X_K \hat{\mu}^T - \hat{\mu} X^T + \hat{\mu} \hat{\mu}^T ] $
$ = \frac{1}{N} \sum_{k=1}^{N} ( E[ X_k X_k^T ] - E[X_K \hat{\mu}^T] - E[ \hat{\mu} X^T] - E[\hat{\mu} \hat{\mu}^T] ) $
$ = \frac{1}{N} \sum_{k=1}^{N} \left( E[ X_k X_k^T ] - E [(\frac{1}{N}\sum_{l=1}^{N}X_l)X_k] \right) - E [\frac{1}{N}\sum_{l=1}^{N}X_k X_l^T] + E[\frac{1}{N}\sum_{l=1}^{N}X_l \frac{1}{N}\sum_{m=1}^{N}X_m^T] $
Splitting the summation above into two parts, l = k and l $ \neq $ k, we have the following:
$ E[ \hat{\Sigma} ] = \frac{1}{N} \sum_{k=1}^{N} \{ E[X X^T] - \frac{1}{N} \Sigma_{l=1, l \neq k}^{N}E(X_l) E (X_k^T) -\frac{1}{N} \Sigma_{l=1, l = k}^{N} E( X_k X_k^T ) -\frac{1}{N}\Sigma_{l=1, l \neq k}^{N} E(X_k) E(X_l^T) $
$ -\frac{1}{N} \Sigma_{l=1, l = k}^{N} E( X_k X_k^T ) + \frac{1}{N^2} \Sigma_{l,m=1, l \neq m}^{K} E(X_l)E(X_m^T) + \frac{1}{N^2} \Sigma_{l,m=1, l = m}^{K} E(X_l)E(X_l^T) \} $
$ = \frac{1}{N} \sum_{k=1}^{N} \{ E[X X^T] - \frac{1}{N} \Sigma_{l=1, l \neq k}^{N}\mu\mu^T -\frac{1}{N} \Sigma_{l=1, l = k}^{N} E( X_k X_k^T ) -\frac{1}{N}\Sigma_{l=1, l \neq k}^{N} \mu\mu^T $
$ -\frac{1}{N} \Sigma_{l=1, l = k}^{N} E( X_k X_k^T ) + \frac{1}{N^2} \Sigma_{l,m=1, l \neq m}^{K} \mu\mu^T + \frac{1}{N^2} \Sigma_{l,m=1, l = m}^{K} E(X_l)E(X_l^T) \} $
$ E[ \hat{\Sigma} ] = \frac{1}{N} \left[ N(1 - \frac{1}{N} E( X X^T ) + N( \frac{1}{N} - 1) \mu\mu^T \right] $
$ = (1 - \frac{1}{N} ) \left[ E( X X^T ) - \mu\mu^T \right] $
$ = ( \frac{N - 1}{N} ) E(X X^T - \mu\mu^T)) $
$ = ( \frac{N - 1}{N} )E[(X - \mu)(X - \mu)^T] \neq \Sigma $
Binomial Distribution
This section discusses the estimation of p in a typical binomial distribution. Assuming that parameter p, is unknown
in a binomial distribution give by the equation below:
$ Prob(r/p) = \binom{n}{r}p^r(1-p)^{n-r} $
$ log Prob(r/p) = log \binom{n}{r} = r log ( p )+ (n-r)log(1-p) $
$ log Prob(\mathcal{D}/p) = log \prod_{i=1}^{N} Prob( r_i / p) $
$ = \sum_{i=1}^{N}log[ Prob(r_i/p) ] $
$ = \sum_{i=1}^{N}[ log \binom{n}{r_i} + r_i log p + (n - r_i) log(1 - p) ] $
To find the maximum, we set the derivative of log-likelihood function to be 0:
$ \frac{d}{dp}log( \mathcal{D}/p) = 0 $
$ \sum_{i=1}^{N} [ \frac{r_i}{p} + \frac{n - r_i}{1 - p} (-1) ] = 0 $
$ \frac{1}{p} \sum_{i=1}^{N} r_i = \frac{1}{1-p} \sum_{i=1}^{N}(n - r_i) $
$ (1-p)\sum_{i=1}^{N} r_i = p \sum_{i=1}^{N}(n - r_i) $
$ p = \frac{1}{N} \sum_{i=1}^{N} r_i/n $
Hence p, which is the unknown parameter is equal to the average of number of successes in the sample space.
References:
1. http://en.wikipedia.org/wiki/Maximum_likelihood
2. Richard O. Duda, Peter E. Hart, David G. Stork, Pattern Classification.
3. Mimi Boutin, ECE 662 Pattern Recognition Lectures.
4. http://people.missouristate.edu/songfengzheng/Teaching/MTH541/Lecture%20notes/Bayesian.pdf
5. http://www.saylor.org/site/wp-content/uploads/2011/11/CS405-6.2.3.2.2.1-WIKIPEDIA.pdf
6. http://www.colorado.edu/economics/morey/7818/estimation/maxlik/maxlik.pdf
7. https://files.nyu.edu/mrg217/public/mle_introduction1.pdf
8. http://www.cs.haifa.ac.il/~rita/ml_course/lectures/MLE_Tutorial.pdf
9. In Jae Myung, Tutorial on maximum likelihood estimation
Questions and comments
If you have any questions, comments, etc. please post them on this page.