Line 51: | Line 51: | ||
In order to provide better understanding regarding Bayesian Parameter Estimation (BPE) technique, first of all we will briefly discuss the general technique. For the BPE method, as <math>\theta</math> is considered to be a random variable (vector) hence it is assumed to be unknown. Although <math>\theta</math> in general is unknown, another assumption need to be made that <math>\theta</math> has the priori distribution of the form <math>p(\theta)</math> which is considered to be known. Hence, in order to estimate the parameter <math>\theta</math> both the information in priori and the information from set <math>\mathcal{D}</math> of n samples <math>x_1, x_2, ... , x_n</math> need to be utilized. Since the training data is known and well labelled, obviously the density function of a sample x with parameter <math>\theta</math> is known, denoted as <math>p(x|\theta)</math>. | In order to provide better understanding regarding Bayesian Parameter Estimation (BPE) technique, first of all we will briefly discuss the general technique. For the BPE method, as <math>\theta</math> is considered to be a random variable (vector) hence it is assumed to be unknown. Although <math>\theta</math> in general is unknown, another assumption need to be made that <math>\theta</math> has the priori distribution of the form <math>p(\theta)</math> which is considered to be known. Hence, in order to estimate the parameter <math>\theta</math> both the information in priori and the information from set <math>\mathcal{D}</math> of n samples <math>x_1, x_2, ... , x_n</math> need to be utilized. Since the training data is known and well labelled, obviously the density function of a sample x with parameter <math>\theta</math> is known, denoted as <math>p(x|\theta)</math>. | ||
− | + | From the previous section we have already obtained: | |
<center><math>p(x|D) = \int p(x|\theta)p(\theta|D)d\theta</math></center> | <center><math>p(x|D) = \int p(x|\theta)p(\theta|D)d\theta</math></center> | ||
− | + | Furthermore, by Bayes Theorem (with some transformation), | |
<center><math>p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}</math></center> | <center><math>p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}</math></center> |
Revision as of 22:47, 13 May 2014
WARNING: THIS MATERIAL WAS PLAGIARIZED FROM DUDA AND HART!!!!!
Bayesian Parameter Estimation: Gaussian Case
A slecture by ECE student Shaobo Fang
Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.
Contents
Introduction: Bayesian Estimation
Although the estimator obtained from Maximum Likelihood Estimation (MLE) and Bayersian Parameter Estimation(BPE) would be similar or even identical for most of the time, the key idea(structure) for MLE and BPE is completely different. For Maximum Likelihood Estimation, we can consider the parameter estimated to be a fixed number (or several numbers if more than one parameters), while in BPE the estimated parameter is a vector (r.v.).
To start with, Bayes' formula was transformed into the following form given samples class $ \mathcal{D} $:
$ P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)} $
Based on the observation on above equations, it can be concluded that both class-conditional densities and the priori could be obtained based on the training data.
Now, assuming that the we are working on a supervised case with labelled training data, that is all samples from the training data could be separated accurately into c subsets $ \mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c $.
Hence, the above equation could be further developed into the following form:
$ P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)} $
Now, assume that a set of $ N $ independent samples were obtained from a certain class $ \mathcal{D} = \{x_1, x_2, ... , x_N \} $ and for each of the sample there exist a probability function with the parameter form: p(x). In order to form a BPE estimation, we will consider $ \theta $ to be a vector (random variable). More specifically, a probability function given a class condition of D and a parameter vector of $ \theta $ is defined as below:
$ p(x|D) $ can be computed as:
$ p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta $
Bayesian Parameter Estimation: General Theory
In order to provide better understanding regarding Bayesian Parameter Estimation (BPE) technique, first of all we will briefly discuss the general technique. For the BPE method, as $ \theta $ is considered to be a random variable (vector) hence it is assumed to be unknown. Although $ \theta $ in general is unknown, another assumption need to be made that $ \theta $ has the priori distribution of the form $ p(\theta) $ which is considered to be known. Hence, in order to estimate the parameter $ \theta $ both the information in priori and the information from set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ need to be utilized. Since the training data is known and well labelled, obviously the density function of a sample x with parameter $ \theta $ is known, denoted as $ p(x|\theta) $.
From the previous section we have already obtained:
Furthermore, by Bayes Theorem (with some transformation),
Now, since we are attempting to transform the equation to be based on samples $ x_k $, by independent assumption,
Hence, if a sample $ \mathcal{D} $ has n samples, we can denote the sample space as: $ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $.
Combine the sample space definition with the equation above:
Using this equation, we can transform the Bayesian Parameter Estimation to:
Bayesian Parameter Estimation: Gaussian Case
The Univariate Case: $ p(\mu|\mathcal{D}) $
Consider the case where $ \mu $ is the only unknown parameter. For simplicity we assume:
From the previous section, the following expression could be easily obtained using Bayes' formula:
Where $ \alpha $ is a factorization factor independent of $ \mu $.
Now, substitute $ p(x_k|\mu) $ and $ p(u) $ with:
The equation has now become:
Update the scaling factor to $ \alpha' $ and $ \alpha'' $ correspondingly,
With the knowledge of Gaussian distribution:
Finally, the estimate of $ u_n $ can be obtained:
Where $ \bar{x_n} $ is defined as sample means and $ n $ is the sample size.
In order to form a Gaussian distribution, the variance $ \sigma_n^2 $ associated with $ \mu_n $ could also be obtained correspondingly as:
Observation:
$ p(\mu|D) $ becomes more sharply peaked around $ \mu_D $
The Univariate Case: $ p(x|\mathcal{D}) $
Having obtained the posteriori density for the mean $ u_n $ of set $ \mathcal{D} $, the remaining of the task is to estimate the "class-conditional" density for $ p(x|D) $.
Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:
Where $ f(\sigma, \sigma_n) $ is defined as:
Hence, $ p(x|D) $ is normally distributed as:
References
[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.
[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.
Questions and comments
If you have any questions, comments, etc. please post them on this page.