Line 17: | Line 17: | ||
− | Suppose that we have an observable random variable<math> X</math> for an experiment, that takes values in a set S. Suppose that distribution of <math> X</math> depends on a parameter | + | Suppose that we have an observable random variable<math> X</math> for an experiment, that takes values in a set S. Suppose that distribution of <math> X</math> depends on a parameter <math>\theta</math> taking values in a parameter space <math>\Theta</math>. We will denote the probability density function of <math> X</math> for a given value of <math>\theta</math> by <math>f( \mathbf{x} \mid \theta)</math> for <math> x \in S </math> and <math>\theta \in S</math>. Of course, our data variable X is almost always vector-valued. The parameter <math>\theta</math> may also be vector-valued. |
− | In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter <math>\theta</math> as a random variable, with a given probability density function <math>h(\theta)</math> for <math>\theta \in \Theta </math>. The corresponding distribution is called the prior distribution of <math>\theta </math> and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing <math>x \in S</math>, we then use Bayes' theorem, to compute the conditional probability density function of <math>\theta</math> given <math>\ | + | In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter <math>\theta</math> as a random variable, with a given probability density function <math>h(\theta)</math> for <math>\theta \in \Theta </math>. The corresponding distribution is called the prior distribution of <math>\theta </math> and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing <math>x \in S</math>, we then use Bayes' theorem, to compute the conditional probability density function of <math>\theta</math> given <math>\mathbf{X}=\mathbf x</math>. |
− | First recall that the joint probability density function of <math>(\ | + | First recall that the joint probability density function of <math>(\mathbf X,\theta)</math> is the mapping on <math>S \times \Theta </math> given by |
<center><math>(x, \theta) \mapsto h(\theta) f(x \mid \theta)</math></center> | <center><math>(x, \theta) \mapsto h(\theta) f(x \mid \theta)</math></center> | ||
Next recall that the (marginal) probability density function f of <math>X</math> is given by | Next recall that the (marginal) probability density function f of <math>X</math> is given by | ||
− | + | ||
− | <math>f(x) = \sum_{\theta \in \Theta} h(\theta) f(x | \theta), \quad x \in S</math> | + | <center><math>f(x) = \sum_{\theta \in \Theta} h(\theta) f(x | \theta), \quad x \in S</math></center> |
− | + | ||
if the parameter has a discrete distribution, or | if the parameter has a discrete distribution, or | ||
− | + | ||
− | <math>f(x) = \int_\Theta h(\theta) f(x| \theta) \, d\theta, \quad | + | <center><math>f(x) = \int_\Theta h(\theta) f(x| \theta) \, d\theta, \quad x\in S</math></center> |
− | + | ||
− | if the parameter has a continuous distribution. Finally, the conditional probability density function of | + | if the parameter has a continuous distribution. Finally, the conditional probability density function of <math>\theta<math> given <math> X= x</math> is |
− | + | ||
− | <math>h(\theta \mid x) = \frac{h(\theta) f(x \mid \theta)}{f(x)}; \quad \theta \in \Theta, \; x\in S</math> | + | <center><math>h(\theta \mid x) = \frac{h(\theta) f(x \mid \theta)}{f(x)}; \quad \theta \in \Theta, \; x\in S</math></center> |
− | + | ||
The conditional distribution of <math>\theta</math> given <math> X= x</math> is called the \textit{posterior} distribution, and is an updated distribution, given the information in the data. | The conditional distribution of <math>\theta</math> given <math> X= x</math> is called the \textit{posterior} distribution, and is an updated distribution, given the information in the data. | ||
Finally, if <math>\theta</math> is a real parameter, the conditional expected value <math>\mathbb{E}(\theta \mid X)</math> is the Bayes' estimator of <math>\theta</math>. Recall that <math>\mathbb{E}(\theta \mid X) </math>is a function of X and, among all functions of X, is closest to <math>\theta</math> in the mean square sense. Thus, once we collect the data and observe <math> X= x</math>, the estimate of <math>\theta</math> is <math>\mathbb{E}(\theta \mid X)</math>. | Finally, if <math>\theta</math> is a real parameter, the conditional expected value <math>\mathbb{E}(\theta \mid X)</math> is the Bayes' estimator of <math>\theta</math>. Recall that <math>\mathbb{E}(\theta \mid X) </math>is a function of X and, among all functions of X, is closest to <math>\theta</math> in the mean square sense. Thus, once we collect the data and observe <math> X= x</math>, the estimate of <math>\theta</math> is <math>\mathbb{E}(\theta \mid X)</math>. |
Revision as of 09:28, 1 May 2014
Bayesian Parameter Estimation with examples
A slecture by ECE student Yu Wang
Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.
Contents
Introduction: Bayesian Estimation
Suppose that we have an observable random variable$ X $ for an experiment, that takes values in a set S. Suppose that distribution of $ X $ depends on a parameter $ \theta $ taking values in a parameter space $ \Theta $. We will denote the probability density function of $ X $ for a given value of $ \theta $ by $ f( \mathbf{x} \mid \theta) $ for $ x \in S $ and $ \theta \in S $. Of course, our data variable X is almost always vector-valued. The parameter $ \theta $ may also be vector-valued.
In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter $ \theta $ as a random variable, with a given probability density function $ h(\theta) $ for $ \theta \in \Theta $. The corresponding distribution is called the prior distribution of $ \theta $ and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing $ x \in S $, we then use Bayes' theorem, to compute the conditional probability density function of $ \theta $ given $ \mathbf{X}=\mathbf x $.
First recall that the joint probability density function of $ (\mathbf X,\theta) $ is the mapping on $ S \times \Theta $ given by
Next recall that the (marginal) probability density function f of $ X $ is given by
if the parameter has a discrete distribution, or
if the parameter has a continuous distribution. Finally, the conditional probability density function of $ \theta<math> given <math> X= x $ is
The conditional distribution of $ \theta $ given $ X= x $ is called the \textit{posterior} distribution, and is an updated distribution, given the information in the data. Finally, if $ \theta $ is a real parameter, the conditional expected value $ \mathbb{E}(\theta \mid X) $ is the Bayes' estimator of $ \theta $. Recall that $ \mathbb{E}(\theta \mid X) $is a function of X and, among all functions of X, is closest to $ \theta $ in the mean square sense. Thus, once we collect the data and observe $ X= x $, the estimate of $ \theta $ is $ \mathbb{E}(\theta \mid X) $.
Bayesian Parameter Estimation: General Theory
We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows:
1. The form of the density $ p(x|\theta) $ is assumed to be known, but the value of the parameter vector $ \theta $ is not known exactly.
2. The initial knowledge about $ \theta $ is assumed to be contained in a known a priori density $ p(\theta) $.
3. The rest of the knowledge about $ \theta $ is contained in a set $ \mathcal{D} $ of n samples $ x_1, x_2, ... , x_n $ drawn independently according to the unknown probability density $ p(x) $.
Accordingly, already know:
and By Bayes Theorem,
Now, since we are attempting to transform the equation to be based on samples $ x_k $, by independent assumption,
Hence, if a sample $ \mathcal{D} $ has n samples, we can denote the sample space as: $ \mathcal{D}^n = \{x_1, x_2, ... x_n\} $.
Combine the sample space definition with the equation above:
Using this equation, we can transform the Bayesian Parameter Estimation to:
Bayesian Parameter Estimation: Gaussian Case
The Univariate Case: $ p(\mu|\mathcal{D}) $
Consider the case where $ \mu $ is the only unknown parameter. For simplicity we assume:
From the previous section, the following expression could be easily obtained using Bayes' formula:
Where $ \alpha $ is a factorization factor independent of $ \mu $.
Now, substitute $ p(x_k|\mu) $ and $ p(u) $ with:
The equation has now become:
Update the scaling factor to $ \alpha' $ and $ \alpha'' $ correspondingly,
With the knowledge of Gaussian distribution:
Finally, the estimate of $ u_n $ can be obtained:
Where $ \bar{x_n} $ is defined as sample means and $ n $ is the sample size.
In order to form a Gaussian distribution, the variance $ \sigma_n^2 $ associated with $ \mu_n $ could also be obtained correspondingly as:
Observation:
$ p(\mu|D) $ becomes more sharply peaked around $ \mu_D $
The Univariate Case: $ p(x|\mathcal{D}) $
Having obtained the posteriori density for the mean $ u_n $ of set $ \mathcal{D} $, the remaining of the task is to estimate the "class-conditional" density for $ p(x|D) $.
Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:
Where $ f(\sigma, \sigma_n) $ is defined as:
Hence, $ p(x|D) $ is normally distributed as:
References
[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.
[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.
Questions and comments
If you have any questions, comments, etc. please post them on this page.