(76 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
[[Category:ECE662]] | [[Category:ECE662]] | ||
− | <center><font size= | + | <center><font size= 5> |
− | [[ | + | [[ECE662_Bayesian_Parameter_Estimation_S14_SF|Bayesian Parameter Estimation: Gaussian Case]] |
</font size> | </font size> | ||
− | A [https://www.projectrhea.org/learning/slectures.php slecture] by [[ECE]] student [[user: | + | A [https://www.projectrhea.org/learning/slectures.php slecture] by [[ECE]] student [[user:SFang | Shaobo Fang]] |
Loosely based on the [[2014_Spring_ECE_662_Boutin|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]]. | Loosely based on the [[2014_Spring_ECE_662_Boutin|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]]. | ||
</center> | </center> | ||
− | |||
---- | ---- | ||
---- | ---- | ||
− | == '''Introduction''' == | + | == '''Introduction: Bayesian Estimation''' == |
− | |||
+ | According to Chapter #3.3 (Duda's book), although the answers we get by BPE will generally be nearly identical to those obtained by maximum likelihood estimation, the conceptual difference is significant. For maximum likelihood estimation, the parameter <math>\theta</math> is a fixed while in Bayersian estimation <math>\theta</math> is considered to be a random variable. | ||
− | + | By definition, given samples class <math>\mathcal{D}</math>, Bayes' formula then becomes: | |
+ | <center> | ||
+ | <math>P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)}</math> | ||
+ | </center> | ||
− | As | + | As the above equation suggests, we can use the information provided by the training data to help determine both the class-conditional densities and the priori probabilities. |
+ | Furthermore, since we are treating supervised case, we can separate the training samples by class into c subsets <math>\mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c</math>, accordingly: | ||
− | + | <center> | |
+ | <math>P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)}</math> | ||
+ | </center> | ||
+ | Now, assume <math>p(x)</math> has a parameter form. We are given a set of <math>N</math> independent samples <math>\mathcal{D} = \{x_1, x_2, ... , x_N \}</math>. View <math>\theta</math> as a random variable. Consider more specifically in continuous case: | ||
− | <math> | + | <math>p(x|D)</math> can be computed as: |
+ | <center> | ||
+ | <math>p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta</math> | ||
+ | </center> | ||
+ | ---- | ||
− | + | == '''Bayesian Parameter Estimation: General Theory''' == | |
− | |||
+ | We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows: | ||
− | + | 1. The form of the density <math>p(x|\theta)</math> is assumed to be known, but the value of the parameter vector <math>\theta</math> is not known exactly. | |
− | + | ||
− | + | ||
− | + | ||
− | 1. The form of the density | + | |
2. The initial knowledge about <math>\theta</math> is assumed to be contained in a known a priori density <math>p(\theta)</math>. | 2. The initial knowledge about <math>\theta</math> is assumed to be contained in a known a priori density <math>p(\theta)</math>. | ||
Line 48: | Line 54: | ||
3. The rest of the knowledge about <math>\theta</math> is contained in a set <math>\mathcal{D}</math> of n samples <math>x_1, x_2, ... , x_n</math> drawn independently according to the unknown probability density <math>p(x)</math>. | 3. The rest of the knowledge about <math>\theta</math> is contained in a set <math>\mathcal{D}</math> of n samples <math>x_1, x_2, ... , x_n</math> drawn independently according to the unknown probability density <math>p(x)</math>. | ||
− | Accordingly, | + | Accordingly, already know: |
+ | <center><math>p(x|D) = \int p(x|\theta)p(\theta|D)d\theta</math></center> | ||
− | + | and By Bayes Theorem, | |
− | + | <center><math>p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}</math></center> | |
− | |||
+ | Now, since we are attempting to transform the equation to be based on samples <math>x_k</math>, by independent assumption, | ||
− | + | <center><math>p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta)</math></center> | |
− | + | ||
− | <math>p(D|\theta) = \prod_{k = 1}^n p( | + | |
Hence, if a sample <math>\mathcal{D}</math> has n samples, we can denote the sample space as: | Hence, if a sample <math>\mathcal{D}</math> has n samples, we can denote the sample space as: | ||
− | |||
<math>\mathcal{D}^n = \{x_1, x_2, ... x_n\}</math>. | <math>\mathcal{D}^n = \{x_1, x_2, ... x_n\}</math>. | ||
− | <math>p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta)</math> | + | Combine the sample space definition with the equation above: |
+ | |||
+ | |||
+ | |||
+ | <center><math> p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) </math></center> | ||
Using this equation, we can transform the Bayesian Parameter Estimation to: | Using this equation, we can transform the Bayesian Parameter Estimation to: | ||
− | <math>p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta}</math> | + | <center><math>p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta}</math></center> |
+ | ---- | ||
+ | == '''Bayesian Parameter Estimation: Gaussian Case''' == | ||
+ | == ''The Univariate Case: <math>p(\mu|\mathcal{D})</math>'' == | ||
− | + | Consider the case where <math>\mu</math> is the only unknown parameter. For simplicity we assume: | |
− | + | ||
− | + | ||
− | <math>p(x|\mu) \sim N(\mu, \sigma^2)</math> | + | <center><math>p(x|\mu) \sim N(\mu, \sigma^2)</math></center> and |
− | <math>p(\mu) \sim N(\mu_0, \sigma_0^2)</math> | + | <center><math>p(\mu) \sim N(\mu_0, \sigma_0^2)</math></center> |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | From the previous section, the following expression could be easily obtained using Bayes' formula: | |
− | + | <center><math>p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)</math></center> | |
− | + | Where <math>\alpha</math> is a factorization factor independent of <math>\mu</math>. | |
− | + | Now, substitute <math>p(x_k|\mu)</math> and <math>p(u)</math> with: | |
− | + | <center><math>p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | <center><math>p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center> | |
− | + | ||
− | + | The equation has now become: | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
+ | <center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center> | ||
− | \ | + | <center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center> |
− | + | Update the scaling factor to <math>\alpha'</math> and <math>\alpha''</math> correspondingly, | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | <center><math>p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2})</math></center> | |
− | \ | + | |
− | \ | + | |
− | \ | + | |
− | \ | + | |
− | \ | + | <center><math>p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu]</math></center> |
− | \ | + | |
− | \ | + | |
− | + | With the knowledge of Gaussian distribution: | |
− | \ | + | <center><math>p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}]</math></center> |
− | + | Finally, the estimate of <math>u_n</math> can be obtained: | |
+ | <center><math>\mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0</math></center> | ||
− | \ | + | Where <math>\bar{x_n}</math> is defined as sample means and <math>n</math> is the sample size. |
− | \ | + | In order to form a Gaussian distribution, the variance <math>\sigma_n^2</math> associated with <math>\mu_n</math> could also be obtained correspondingly as: |
+ | <center><math>\sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2}</math></center> | ||
− | |||
− | |||
− | + | Observation: | |
+ | With <math>N \to \infty</math>, <center><math>\sigma_D \to 0</math></center> | ||
+ | <math>p(\mu|D)</math> becomes more sharply peaked around <math>\mu_D</math> | ||
− | + | == ''The Univariate Case: <math>p(x|\mathcal{D})</math>'' == | |
− | |||
+ | Having obtained the posteriori density for the mean <math>u_n</math> of set <math>\mathcal{D}</math>, the remaining of the task is to estimate the "class-conditional" density for <math>p(x|D)</math>. | ||
+ | |||
+ | Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes: | ||
+ | |||
+ | |||
+ | <center><math>p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu</math></center> | ||
+ | <center><math>p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}] \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu</math></center> | ||
+ | |||
+ | |||
+ | <center><math>p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n)</math></center> | ||
+ | |||
+ | |||
+ | Where <math>f(\sigma, \sigma_n)</math> is defined as: | ||
+ | |||
+ | |||
+ | <center><math>f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu</math></center> | ||
+ | |||
+ | Hence, <math>p(x|D)</math> is normally distributed as: | ||
+ | |||
+ | <center><math>p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2)</math></center> | ||
+ | |||
+ | ---- | ||
+ | ---- | ||
− | + | == '''References''' == | |
− | + | [1]. [https://engineering.purdue.edu/~mboutin/ Mireille Boutin], "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014. | |
− | + | [2]. R. Duda, P. Hart, ''Pattern Classification''. Wiley-Interscience. Second Edition, 2000. |
Latest revision as of 07:31, 29 April 2014
Bayesian Parameter Estimation: Gaussian Case
A slecture by ECE student Shaobo Fang
Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.
Contents
Introduction: Bayesian Estimation
According to Chapter #3.3 (Duda's book), although the answers we get by BPE will generally be nearly identical to those obtained by maximum likelihood estimation, the conceptual difference is significant. For maximum likelihood estimation, the parameter $ \theta $ is a fixed while in Bayersian estimation $ \theta $ is considered to be a random variable.
By definition, given samples class $ \mathcal{D} $, Bayes' formula then becomes:
$ P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)} $
As the above equation suggests, we can use the information provided by the training data to help determine both the class-conditional densities and the priori probabilities.
Furthermore, since we are treating supervised case, we can separate the training samples by class into c subsets $ \mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c $, accordingly:
$ P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)} $
Now, assume $ p(x) $ has a parameter form. We are given a set of $ N $ independent samples $ \mathcal{D} = \{x_1, x_2, ... , x_N \} $. View $ \theta $ as a random variable. Consider more specifically in continuous case:
$ p(x|D) $ can be computed as:
$ p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta $
Bayesian Parameter Estimation: General Theory
We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows:
1. The form of the density $ p(x|\theta) $ is assumed to be known, but the value of the parameter vector $ \theta $ is not known exactly.
2. The initial knowledge about $ \theta $ is assumed to be contained in a known a priori density $ p(\theta) $.
3. The rest of the knowledge about $ \theta $ is contained in a set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ drawn independently according to the unknown probability density $ p(x) $.
Accordingly, already know:
and By Bayes Theorem,
Now, since we are attempting to transform the equation to be based on samples $ x_k $, by independent assumption,
Hence, if a sample $ \mathcal{D} $ has n samples, we can denote the sample space as: $ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $.
Combine the sample space definition with the equation above:
Using this equation, we can transform the Bayesian Parameter Estimation to:
Bayesian Parameter Estimation: Gaussian Case
The Univariate Case: $ p(\mu|\mathcal{D}) $
Consider the case where $ \mu $ is the only unknown parameter. For simplicity we assume:
From the previous section, the following expression could be easily obtained using Bayes' formula:
Where $ \alpha $ is a factorization factor independent of $ \mu $.
Now, substitute $ p(x_k|\mu) $ and $ p(u) $ with:
The equation has now become:
Update the scaling factor to $ \alpha' $ and $ \alpha'' $ correspondingly,
With the knowledge of Gaussian distribution:
Finally, the estimate of $ u_n $ can be obtained:
Where $ \bar{x_n} $ is defined as sample means and $ n $ is the sample size.
In order to form a Gaussian distribution, the variance $ \sigma_n^2 $ associated with $ \mu_n $ could also be obtained correspondingly as:
Observation:
$ p(\mu|D) $ becomes more sharply peaked around $ \mu_D $
The Univariate Case: $ p(x|\mathcal{D}) $
Having obtained the posteriori density for the mean $ u_n $ of set $ \mathcal{D} $, the remaining of the task is to estimate the "class-conditional" density for $ p(x|D) $.
Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:
Where $ f(\sigma, \sigma_n) $ is defined as:
Hence, $ p(x|D) $ is normally distributed as:
References
[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.
[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.