Line 99: Line 99:
 
Where <math>\alpha</math> is introduced as a 'scale' coefficient in order to simplify the derivation. Please note that <math>\alpha</math> is completely independent of <math>\mu</math>.
 
Where <math>\alpha</math> is introduced as a 'scale' coefficient in order to simplify the derivation. Please note that <math>\alpha</math> is completely independent of <math>\mu</math>.
  
Now, substitute <math>p(x_k|\mu)</math> and <math>p(u)</math> with:
+
As <math>x_k</math> is normally distributed we update the <math>p(x_k|\mu)</math> and <math>p(u)</math> with the known distribution function:
  
 
<center><math>p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center>
 
<center><math>p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center>
Line 105: Line 105:
 
<center><math>p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>
 
<center><math>p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>
  
The equation has now become:  
+
Again, substitute <math>p(x_k|\mu)</math> and <math>p(u)</math> in equation <math>p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)</math>, we obtained:  
  
 
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>
 
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>

Revision as of 23:21, 13 May 2014

WARNING: THIS MATERIAL WAS PLAGIARIZED FROM DUDA AND HART!!!!!

Bayesian Parameter Estimation: Gaussian Case

A slecture by ECE student Shaobo Fang

Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.



Introduction: Bayesian Estimation

Although the estimator obtained from Maximum Likelihood Estimation (MLE) and Bayersian Parameter Estimation(BPE) would be similar or even identical for most of the time, the key idea(structure) for MLE and BPE is completely different. For Maximum Likelihood Estimation, we can consider the parameter estimated to be a fixed number (or several numbers if more than one parameters), while in BPE the estimated parameter is a vector (r.v.).

To start with, Bayes' formula was transformed into the following form given samples class $ \mathcal{D} $:


$ P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)} $

Based on the observation on above equations, it can be concluded that both class-conditional densities and the priori could be obtained based on the training data.

Now, assuming that the we are working on a supervised case with labelled training data, that is all samples from the training data could be separated accurately into c subsets $ \mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c $.

Hence, the above equation could be further developed into the following form:

$ P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)} $

Now, assume that a set of $ N $ independent samples were obtained from a certain class $ \mathcal{D} = \{x_1, x_2, ... , x_N \} $ and for each of the sample there exist a probability function with the parameter form: p(x). In order to form a BPE estimation, we will consider $ \theta $ to be a vector (random variable). More specifically, a probability function given a class condition of D and a parameter vector of $ \theta $ is defined as below:

$ p(x|D) $ can be computed as:

$ p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta $


Bayesian Parameter Estimation: General Theory

In order to provide better understanding regarding Bayesian Parameter Estimation (BPE) technique, first of all we will briefly discuss the general technique. For the BPE method, as $ \theta $ is considered to be a random variable (vector) hence it is assumed to be unknown. Although $ \theta $ in general is unknown, another assumption need to be made that $ \theta $ has the priori distribution of the form $ p(\theta) $ which is considered to be known. Hence, in order to estimate the parameter $ \theta $ both the information in priori and the information from set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ need to be utilized. Since the training data is known and well labelled, obviously the density function of a sample x with parameter $ \theta $ is known, denoted as $ p(x|\theta) $.

From the previous section we have already obtained:

$ p(x|D) = \int p(x|\theta)p(\theta|D)d\theta $

Furthermore, by Bayes Theorem (with some transformation),

$ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta} $

Although we are very close already, we still need to substitute the class condition 'D' with the samples $ x_k $, based on our assumption made at the beginning of this section. In order to do that, first the probability function of class 'D' with $ \theta $ as a parameter need to be computed:

$ p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta) $

Now for a class $ \mathcal{D} $ which contains 'N' samples: $ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $, we can further transform the above equation to the following form, as $ p(x_n|\theta) $ is assumed to be known and $ p(D^{n-1}|\theta) $ is the probability function of class D with N-1 samples:


$ p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) $

Hence, after we substituting class condition 'D' with samples $ x_k $, the Bayesian Parameter Estimation equation then transformed into the following form:

$ p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta} $




Bayesian Parameter Estimation: Gaussian Case

The Univariate Case: $ p(\mu|\mathcal{D}) $

As was done in MLE first will start with a simple case with only the mean: $ \mu $ unknown. As usual we will assume sample $ x_k $ is normally distributed as:

$ p(x|\mu) \sim N(\mu, \sigma^2) $

and the parameter $ \mu $ has the distribution of:

$ p(\mu) \sim N(\mu_0, \sigma_0^2), $

as parameter $ \mu $ is not estimated to be a number but a random variable.


Using Bayes' formula and the corresponding derivation from the previous section the corresponding function could be easily obatined:

$ p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu) $

Where $ \alpha $ is introduced as a 'scale' coefficient in order to simplify the derivation. Please note that $ \alpha $ is completely independent of $ \mu $.

As $ x_k $ is normally distributed we update the $ p(x_k|\mu) $ and $ p(u) $ with the known distribution function:

$ p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $
$ p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $

Again, substitute $ p(x_k|\mu) $ and $ p(u) $ in equation $ p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu) $, we obtained:

$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $
$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $

Update the scaling factor to $ \alpha' $ and $ \alpha'' $ correspondingly,

$ p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}) $
$ p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu] $

With the knowledge of Gaussian distribution:

$ p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}] $

Finally, the estimate of $ u_n $ can be obtained:

$ \mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0 $

Where $ \bar{x_n} $ is defined as sample means and $ n $ is the sample size.

In order to form a Gaussian distribution, the variance $ \sigma_n^2 $ associated with $ \mu_n $ could also be obtained correspondingly as:

$ \sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2} $


Observation:

With $ N \to \infty $,
$ \sigma_D \to 0 $

$ p(\mu|D) $ becomes more sharply peaked around $ \mu_D $

The Univariate Case: $ p(x|\mathcal{D}) $

Having obtained the posteriori density for the mean $ u_n $ of set $ \mathcal{D} $, the remaining of the task is to estimate the "class-conditional" density for $ p(x|D) $.

Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:


$ p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu $
$ p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}] \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu $


$ p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n) $


Where $ f(\sigma, \sigma_n) $ is defined as:


$ f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu $

Hence, $ p(x|D) $ is normally distributed as:

$ p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2) $


References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.


Questions and comments

If you have any questions, comments, etc. please post them on this page.



Back to ECE 662 S14 course wiki

Back to ECE 662 course page

Alumni Liaison

EISL lab graduate

Mu Qiao