(34 intermediate revisions by 3 users not shown)
Line 3: Line 3:
  
 
<center><font size= 5>
 
<center><font size= 5>
[[ECE662_Bayesian_Parameter_Estimation_S14_SF|Bayesian Parameter Estimation: Gaussian Case]]
+
Bayesian Parameter Estimation: Gaussian Case
 
</font size>
 
</font size>
  
A [https://www.projectrhea.org/learning/slectures.php slecture] by [[ECE]] student [[user:SFang | Shaobo Fang]]
+
A [http://www.projectrhea.org/learning/slectures.php slecture] by [[ECE]] student Shaobo Fang  
  
Loosely based on the [[2014_Spring_ECE_662_Boutin|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]].  
+
Partly based on the [[2014_Spring_ECE_662_Boutin_Statistical_Pattern_recognition_slectures|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]].  
 
</center>
 
</center>
  
Line 17: Line 17:
  
  
According to Chapter #3.3 (Duda's book), although the answers we get by BPE will generally be nearly identical to those obtained by maximum likelihood estimation, the conceptual difference is significant. For maximum likelihood estimation, the parameter <math>\theta</math> is a fixed while in Bayersian estimation <math>\theta</math> is considered to be a random variable.
+
Although the estimator obtained from Maximum Likelihood Estimation (MLE) and Bayersian Parameter Estimation(BPE) would be similar or even identical for most of the time, the key idea(structure) for MLE and BPE is completely different. For Maximum Likelihood Estimation, we can consider the parameter estimated to be a fixed number (or several numbers if more than one parameters), while in BPE the estimated parameter is a vector (r.v.).
  
By definition, given samples class <math>\mathcal{D}</math>, Bayes' formula then becomes:  
+
To start with, Bayes' formula was transformed into the following form given samples class <math>\mathcal{D}</math>:  
  
  
Line 26: Line 26:
 
</center>
 
</center>
  
As the above equation suggests, we can use the information provided by the training data to help determine both the class-conditional densities and the priori probabilities.  
+
Based on the observation on above equations, it can be concluded that both class-conditional densities and the priori could be obtained based on the training data.  
  
Furthermore, since we are treating supervised case, we can separate the training samples by class into c subsets <math>\mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c</math>, accordingly:
+
Now, assuming that the we are working on a supervised case with labelled training data, that is all samples from the training data could be separated accurately into c subsets <math>\mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c</math>.
 +
 
 +
Hence, the above equation could be further developed into the following form:
  
 
<center>
 
<center>
Line 34: Line 36:
 
</center>
 
</center>
  
Now, assume <math>p(x)</math> has a parameter form. We are given a set of <math>N</math> independent samples <math>\mathcal{D} = \{x_1, x_2, ... , x_N \}</math>. View <math>\theta</math> as a random variable. Consider more specifically in continuous case:
+
Now, assume that a set of <math>N</math> independent samples were obtained from a certain class <math>\mathcal{D} = \{x_1, x_2, ... , x_N \}</math> and for each of the sample there exist a probability function with the parameter form: p(x). In order to form a BPE estimation, we will consider <math>\theta</math> to be a vector (random variable). More specifically, a probability function given a class condition of D and a parameter vector of <math>\theta</math> is defined as below:
  
 
<math>p(x|D)</math> can be computed as:
 
<math>p(x|D)</math> can be computed as:
Line 45: Line 47:
  
  
 +
In order to provide better understanding regarding Bayesian Parameter Estimation (BPE) technique, first of all we will briefly discuss the general technique. For the BPE method, as <math>\theta</math> is considered to be a random variable (vector) hence it is assumed to be unknown. Although <math>\theta</math> in general is unknown, another assumption need to be made that <math>\theta</math> has the priori distribution of the form <math>p(\theta)</math> which is considered to be known. Hence, in order to estimate the parameter <math>\theta</math> both the information in priori and the information from set <math>\mathcal{D}</math> of n samples <math>x_1, x_2, ... , x_n</math> need to be utilized. Since the training data is known and well labelled, obviously the density function of a sample x with parameter <math>\theta</math> is known, denoted as <math>p(x|\theta)</math>. 
  
We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows:
+
From the previous section we have already obtained:
 
+
1. The form of the density <math>p(x|\theta)</math> is assumed to be known, but the value of the parameter vector <math>\theta</math> is not known exactly.
+
 
+
2. The initial knowledge about <math>\theta</math> is assumed to be contained in a known a priori density <math>p(\theta)</math>.
+
 
+
3. The rest of the knowledge about <math>\theta</math> is contained in a set <math>\mathcal{D}</math> of n samples <math>x_1, x_2, ... , x_n</math> drawn independently according to the unknown probability density <math>p(x)</math>.
+
 
+
Accordingly, already know:
+
  
 
<center><math>p(x|D) = \int p(x|\theta)p(\theta|D)d\theta</math></center>
 
<center><math>p(x|D) = \int p(x|\theta)p(\theta|D)d\theta</math></center>
  
and By Bayes Theorem,
+
Furthermore, by Bayes Theorem (with some transformation),
  
 
<center><math>p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}</math></center>
 
<center><math>p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}</math></center>
  
 
+
Although we are very close already, we still need to substitute the class condition 'D' with the samples <math>x_k</math>, based on our assumption made at the beginning of this section. In order to do that, first the probability function of class 'D' with <math>\theta</math> as a parameter need to be computed:
Now, since we are attempting to transform the equation to be based on samples <math>x_k</math>, by independent assumption,  
+
  
 
<center><math>p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta)</math></center>
 
<center><math>p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta)</math></center>
  
Hence, if a sample <math>\mathcal{D}</math> has n samples, we can denote the sample space as:
+
Now for a class <math>\mathcal{D}</math> which contains 'N' samples: <math>\mathcal{D}^n = \{x_1, x_2, ... x_n\}</math>, we can further transform the above equation to the following form, as <math>p(x_n|\theta)</math> is assumed to be known and <math>p(D^{n-1}|\theta)</math> is the probability function of class D with N-1 samples:
<math>\mathcal{D}^n = \{x_1, x_2, ... x_n\}</math>.
+
 
+
Combine the sample space definition with the equation above:
+
  
  
Line 76: Line 67:
 
                                 <center><math> p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) </math></center>
 
                                 <center><math> p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) </math></center>
  
Using this equation, we can transform the Bayesian Parameter Estimation to:
+
Hence, after we substituting class condition 'D' with samples <math>x_k</math>, the Bayesian Parameter Estimation equation then transformed into the following form:
  
 
<center><math>p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta}</math></center>
 
<center><math>p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta}</math></center>
Line 88: Line 79:
 
== ''The Univariate Case: <math>p(\mu|\mathcal{D})</math>'' ==
 
== ''The Univariate Case: <math>p(\mu|\mathcal{D})</math>'' ==
  
Consider the case where <math>\mu</math> is the only unknown parameter. For simplicity we assume:
+
As was done in MLE first will start with a simple case with only the mean: <math>\mu</math> unknown. As usual we will assume sample <math>x_k</math> is normally distributed as:
  
                                         <center><math>p(x|\mu) \sim N(\mu, \sigma^2)</math></center> and
+
                                         <center><math>p(x|\mu) \sim N(\mu, \sigma^2)</math></center>  
  
<center><math>p(\mu) \sim N(\mu_0, \sigma_0^2)</math></center>
+
and the parameter <math>\mu</math> has the distribution of:
  
From the previous section, the following expression could be easily obtained using Bayes' formula:
+
<center><math>p(\mu) \sim N(\mu_0, \sigma_0^2),</math></center>
 +
 
 +
as parameter <math>\mu</math> is not estimated to be a number but a random variable.
 +
 
 +
 
 +
 
 +
Using Bayes' formula and the corresponding derivation from the previous section the corresponding function could be easily obatined:
  
 
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)</math></center>
 
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)</math></center>
  
Where <math>\alpha</math> is a factorization factor independent of <math>\mu</math>.
+
Where <math>\alpha</math> is introduced as a 'scale' coefficient in order to simplify the derivation. Please note that <math>\alpha</math> is completely independent of <math>\mu</math>.
  
Now, substitute <math>p(x_k|\mu)</math> and <math>p(u)</math> with:
+
As <math>x_k</math> is normally distributed we update the <math>p(x_k|\mu)</math> and <math>p(u)</math> with the known distribution function:
  
 
<center><math>p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center>
 
<center><math>p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center>
Line 106: Line 103:
 
<center><math>p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>
 
<center><math>p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>
  
The equation has now become:  
+
Again, substitute <math>p(x_k|\mu)</math> and <math>p(u)</math> in equation <math>p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)</math>, we obtained:  
  
 
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>
 
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>
Line 112: Line 109:
 
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}  -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center>
 
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}  -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center>
  
Update the scaling factor to <math>\alpha'</math> and <math>\alpha''</math> correspondingly,
+
Similarly, in order to simplify the derivation we update the scaling factors to <math>\alpha'</math> and <math>\alpha''</math>, and correspondingly,
  
<center><math>p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}  -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2})</math></center>
+
<center><math>p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}  -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2})</math></center>
  
<center><math>p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2  -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu]</math></center>
+
<center><math>p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2  -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu]</math></center>
  
With the knowledge of Gaussian distribution:  
+
Finally, compare derived <math>p(u|D)</math> to the Gaussian Distribution in the standard form:  
  
 
<center><math>p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}]</math></center>
 
<center><math>p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}]</math></center>
  
Finally, the estimate of <math>u_n</math> can be obtained:
+
Based on knowledge on Gaussian Distribution, <math>\mu_n</math> and <math>\sigma_n^2</math> could be obtained accordingly:
  
 
<center><math>\mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0</math></center>
 
<center><math>\mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0</math></center>
 
Where <math>\bar{x_n}</math> is defined as sample means and <math>n</math> is the sample size.
 
 
In order to form a Gaussian distribution, the variance <math>\sigma_n^2</math> associated with <math>\mu_n</math> could also be obtained correspondingly as:
 
  
 
<center><math>\sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2}</math></center>
 
<center><math>\sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2}</math></center>
  
 +
(Why is it important to emphasize both variance: <math>\sigma_n^2</math> and mean: <math>\mu_n</math>? Since the distribution is in Gaussian Form and the two most decisive parameters in Gaussian R.V. are mean and variance) Please note that <math>\bar{x_n}</math> is the empirical mean in our known training data.
  
Observation:
 
  
With <math>N \to \infty</math>, <center><math>\sigma_D \to 0</math></center>
 
<math>p(\mu|D)</math> becomes more sharply peaked around <math>\mu_D</math>
 
  
 
== ''The Univariate Case: <math>p(x|\mathcal{D})</math>'' ==
 
== ''The Univariate Case: <math>p(x|\mathcal{D})</math>'' ==
  
  
Having obtained the posteriori density for the mean <math>u_n</math> of set <math>\mathcal{D}</math>, the remaining of the task is to estimate the "class-conditional" density for <math>p(x|D)</math>.
+
Given the posteriori density <math>p(\mu|D)</math> successfully derived (variance: <math>\sigma_n^2</math> and mean: <math>\mu_n</math> now known), the final step is to estimate <math>p(x|D)</math> based on the conclusions in previous sections.
  
 
Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:  
 
Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:  
 
  
 
<center><math>p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu</math></center>
 
<center><math>p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu</math></center>
 
<center><math>p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}]  \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu</math></center>
 
<center><math>p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}]  \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu</math></center>
  
 +
Finally, substitute <math>\sigma_n^2</math> and <math>\mu_n</math> the probability function <math>p(x|\mathcal{D})</math> is obtained:
  
<center><math>p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n)</math></center>
 
 
 
Where <math>f(\sigma, \sigma_n)</math> is defined as:
 
  
 +
<center><math>p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}] \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 \bar{x}_n+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu</math></center>
  
<center><math>f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu</math></center>
 
  
 
Hence, <math>p(x|D)</math> is normally distributed as:
 
Hence, <math>p(x|D)</math> is normally distributed as:
Line 173: Line 160:
 
----
 
----
  
==[[Talk:Question_and_Comments_on_BPE|Questions and comments]]==
+
==[[Question_and_Comments_on_BPE|Questions and comments]]==
  
If you have any questions, comments, etc. please post them on [[Talk:Question_and_Comments_on_BPE|this page]].
+
If you have any questions, comments, etc. please post them on [[Question_and_Comments_on_BPE|this page]].
  
  

Latest revision as of 09:51, 22 January 2015


Bayesian Parameter Estimation: Gaussian Case

A slecture by ECE student Shaobo Fang

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.



Introduction: Bayesian Estimation

Although the estimator obtained from Maximum Likelihood Estimation (MLE) and Bayersian Parameter Estimation(BPE) would be similar or even identical for most of the time, the key idea(structure) for MLE and BPE is completely different. For Maximum Likelihood Estimation, we can consider the parameter estimated to be a fixed number (or several numbers if more than one parameters), while in BPE the estimated parameter is a vector (r.v.).

To start with, Bayes' formula was transformed into the following form given samples class $ \mathcal{D} $:


$ P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)} $

Based on the observation on above equations, it can be concluded that both class-conditional densities and the priori could be obtained based on the training data.

Now, assuming that the we are working on a supervised case with labelled training data, that is all samples from the training data could be separated accurately into c subsets $ \mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c $.

Hence, the above equation could be further developed into the following form:

$ P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)} $

Now, assume that a set of $ N $ independent samples were obtained from a certain class $ \mathcal{D} = \{x_1, x_2, ... , x_N \} $ and for each of the sample there exist a probability function with the parameter form: p(x). In order to form a BPE estimation, we will consider $ \theta $ to be a vector (random variable). More specifically, a probability function given a class condition of D and a parameter vector of $ \theta $ is defined as below:

$ p(x|D) $ can be computed as:

$ p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta $


Bayesian Parameter Estimation: General Theory

In order to provide better understanding regarding Bayesian Parameter Estimation (BPE) technique, first of all we will briefly discuss the general technique. For the BPE method, as $ \theta $ is considered to be a random variable (vector) hence it is assumed to be unknown. Although $ \theta $ in general is unknown, another assumption need to be made that $ \theta $ has the priori distribution of the form $ p(\theta) $ which is considered to be known. Hence, in order to estimate the parameter $ \theta $ both the information in priori and the information from set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ need to be utilized. Since the training data is known and well labelled, obviously the density function of a sample x with parameter $ \theta $ is known, denoted as $ p(x|\theta) $.

From the previous section we have already obtained:

$ p(x|D) = \int p(x|\theta)p(\theta|D)d\theta $

Furthermore, by Bayes Theorem (with some transformation),

$ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta} $

Although we are very close already, we still need to substitute the class condition 'D' with the samples $ x_k $, based on our assumption made at the beginning of this section. In order to do that, first the probability function of class 'D' with $ \theta $ as a parameter need to be computed:

$ p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta) $

Now for a class $ \mathcal{D} $ which contains 'N' samples: $ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $, we can further transform the above equation to the following form, as $ p(x_n|\theta) $ is assumed to be known and $ p(D^{n-1}|\theta) $ is the probability function of class D with N-1 samples:


$ p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) $

Hence, after we substituting class condition 'D' with samples $ x_k $, the Bayesian Parameter Estimation equation then transformed into the following form:

$ p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta} $




Bayesian Parameter Estimation: Gaussian Case

The Univariate Case: $ p(\mu|\mathcal{D}) $

As was done in MLE first will start with a simple case with only the mean: $ \mu $ unknown. As usual we will assume sample $ x_k $ is normally distributed as:

$ p(x|\mu) \sim N(\mu, \sigma^2) $

and the parameter $ \mu $ has the distribution of:

$ p(\mu) \sim N(\mu_0, \sigma_0^2), $

as parameter $ \mu $ is not estimated to be a number but a random variable.


Using Bayes' formula and the corresponding derivation from the previous section the corresponding function could be easily obatined:

$ p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu) $

Where $ \alpha $ is introduced as a 'scale' coefficient in order to simplify the derivation. Please note that $ \alpha $ is completely independent of $ \mu $.

As $ x_k $ is normally distributed we update the $ p(x_k|\mu) $ and $ p(u) $ with the known distribution function:

$ p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $
$ p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $

Again, substitute $ p(x_k|\mu) $ and $ p(u) $ in equation $ p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu) $, we obtained:

$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}] $
$ p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] $

Similarly, in order to simplify the derivation we update the scaling factors to $ \alpha' $ and $ \alpha'' $, and correspondingly,

$ p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}) $
$ p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu] $

Finally, compare derived $ p(u|D) $ to the Gaussian Distribution in the standard form:

$ p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}] $

Based on knowledge on Gaussian Distribution, $ \mu_n $ and $ \sigma_n^2 $ could be obtained accordingly:

$ \mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0 $
$ \sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2} $

(Why is it important to emphasize both variance: $ \sigma_n^2 $ and mean: $ \mu_n $? Since the distribution is in Gaussian Form and the two most decisive parameters in Gaussian R.V. are mean and variance) Please note that $ \bar{x_n} $ is the empirical mean in our known training data.


The Univariate Case: $ p(x|\mathcal{D}) $

Given the posteriori density $ p(\mu|D) $ successfully derived (variance: $ \sigma_n^2 $ and mean: $ \mu_n $ now known), the final step is to estimate $ p(x|D) $ based on the conclusions in previous sections.

Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:

$ p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu $
$ p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}] \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu $

Finally, substitute $ \sigma_n^2 $ and $ \mu_n $ the probability function $ p(x|\mathcal{D}) $ is obtained:


$ p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}] \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 \bar{x}_n+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu $


Hence, $ p(x|D) $ is normally distributed as:

$ p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2) $


References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.


Questions and comments

If you have any questions, comments, etc. please post them on this page.



Back to ECE 662 S14 course wiki

Back to ECE 662 course page

Alumni Liaison

Sees the importance of signal filtering in medical imaging

Dhruv Lamba, BSEE2010