Line 93: | Line 93: | ||
<center><math>p(\mu) \sim N(\mu_0, \sigma_0^2)</math></center> | <center><math>p(\mu) \sim N(\mu_0, \sigma_0^2)</math></center> | ||
− | From the previous section, the following expression could be easily obtained: | + | From the previous section, the following expression could be easily obtained using Bayes' formula: |
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)</math></center> | <center><math>p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)</math></center> | ||
Line 105: | Line 105: | ||
<center><math>p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center> | <center><math>p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center> | ||
− | + | The equation has now become: | |
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center> | <center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center> | ||
Line 113: | Line 113: | ||
Update the scaling factor to <math>\beta</math>, | Update the scaling factor to <math>\beta</math>, | ||
− | <center><math>p(\mu|D) = \beta exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2})</math></center> | + | <center><math>p(\mu|D) = \beta' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2})</math></center> |
<center><math>p(\mu|D) = \gamma exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu]</math></center> | <center><math>p(\mu|D) = \gamma exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu]</math></center> |
Revision as of 03:37, 29 April 2014
A slecture by ECE student Shaobo Fang
Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.
Contents
Introduction: Bayesian Estimation
According to Chapter #3.3 (Duda's book), although the answers we get by BPE will generally be nearly identical to those obtained by maximum likelihood estimation, the conceptual difference is significant. For maximum likelihood estimation, the parameter $ \theta $ is a fixed while in Bayersian estimation $ \theta $ is considered to be a random variable.
By definition, given samples class $ \mathcal{D} $, Bayes' formula then becomes:
$ P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)} $
As the above equation suggests, we can use the information provided by the training data to help determine both the class-conditional densities and the priori probabilities.
Furthermore, since we are treating supervised case, we can separate the training samples by class into c subsets $ \mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c $, accordingly:
$ P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)} $
Now, assume $ p(x) $ has a parameter form. We are given a set of $ N $ independent samples $ \mathcal{D} = \{x_1, x_2, ... , x_N \} $. View $ \theta $ as a random variable. Consider more specifically in continuous case:
$ p(x|D) $ can be computed as:
$ p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta $
Bayesian Parameter Estimation: General Theory
We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows:
1. The form of the density $ p(x|\theta) $ is assumed to be known, but the value of the parameter vector $ \theta $ is not known exactly.
2. The initial knowledge about $ \theta $ is assumed to be contained in a known a priori density $ p(\theta) $.
3. The rest of the knowledge about $ \theta $ is contained in a set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ drawn independently according to the unknown probability density $ p(x) $.
Accordingly, already know:
and By Bayes Theorem,
Now, since we are attempting to transform the equation to be based on samples $ x_k $, by independent assumption,
Hence, if a sample $ \mathcal{D} $ has n samples, we can denote the sample space as: $ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $.
Combine the sample space definition with the equation above:
Using this equation, we can transform the Bayesian Parameter Estimation to:
Bayesian Parameter Estimation: Gaussian Case
The Univariate Case: $ p(\mu|\mathcal{D}) $
Consider the case where $ \mu $ is the only unknown parameter. For simplicity we assume:
From the previous section, the following expression could be easily obtained using Bayes' formula:
Where $ \alpha $ is a factorization factor independent of $ \mu $.
Now, substitute $ p(x_k|\mu) $ and $ p(u) $ with:
The equation has now become:
Update the scaling factor to $ \beta $,
Furthermore, since
Finally, the estimate of $ u_n $ can be obtained:
Where $ \bar{x_n} $ is defined as sample means and $ n $ is the sample size.
In order to form a Gaussian distribution, the variance $ \sigma_n^2 $ associated with $ \mu_n $ is defined as:
The Univariate Case: $ p(x|\mathcal{D}) $
Having obtained the posteriori density for the mean $ u_n $ of set $ \mathcal{D} $, the remaining of the task is to estimate the "class-conditional" density for $ p(x|D) $.
Based on the text by \textbf{Duda's},
Where $ f(\sigma, \sigma_n) $ is defined as:
Hence, $ p(x|D) $ is normally distributed as:
\subsection{Experiment of Bayesian Parameter Estimation}
\paragraph{Design}
Assume n samples were obtained from the class $ \mathcal{D} $ of unknown mean $ \mu $ (known $ \sigma $). Assume,
$ p(x|\mu) \sim N(\mu, \sigma^2) $
$ p(\mu) \sim N(\mu_0, \sigma_0^2) $
While $ \sigma = \sigma_0 = constant $, and $ \mu_0 = 0 $ (It does not matter what $ \mu_0 $ it was assumed to be, this will be verified shortly after). Based on the sample data $ x_i \in \mathcal{D}, i = 1,2,3,...,n $, $ \mu $ is desired to be estimated.
The following results will be obtained: \begin{enumerate} \item The impact of $\mu_0$ on estimated $\hat{\mu}$ \item The impact of sample size $n$ have on estimation accuracy \end{enumerate}
\paragraph{Results} \begin{center} \includegraphics[scale=1]{BPE_1.png}
Figure 21. The impact of $\mu_0$ on estimated $\hat{\mu}$ averaged over 50 samples
\includegraphics[scale=1]{BPE_2.png}
Figure 22. The impact of $\mu_0$ on the variance of estimated $\hat{\mu}$ over 50 samples
\end{center}
The estimated mean is shifting up with $\mu_0$ increasing. \textbf{Based on the experiment it can be concluded that the most 'accurate' estimate could be obtained if $ \mu_0 = \mu $. But, according to the plot, even if the $\mu_0$ is different different from $\mu$, the error of estimation is still acceptable. (In our case, within [-0.1,+0.06] region)} However, the variance of estimated mean could be assumed to be identical as the \textbf{real empirical mean}.
\begin{center}
\includegraphics[scale=0.7]{e23456.png}
Figure 23. The impact of sample size $n$ have on estimation shape accuracy (sample sizes = 2,3,4,5,6)
\includegraphics[scale=0.6]{ece662_14.png}
Figure 24. The impact of sample size $n$ have on estimation shape accuracy (sample sizes = 4,10,20,50,100)
\paragraph{Conclusion} Figure 23. and Figure 24. have demonstrated that with insufficient sample size the result would be really poor regarding prediction of points distribution.
References
[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.
[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.