Difference between revisions of "Bayes Parameter Estimation with examples" - Rhea

Revision as of 09:16, 1 May 2014

Bayesian Parameter Estimation with examples

Loosely based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

Introduction: Bayesian Estimation

Suppose that we have an observable random variable $\bs X$ for an experiment, that takes values in a set S. Suppose that distribution of $\bs X$ depends on a parameter $\theta$ taking values in a parameter space $\Theta$ . We will denote the probability density function of $\bs X$ for a given value of $\theta$ by $f(\bs x \mid \theta)$ for $\bs x \in S$ and $\theta \in S$ . Of course, our data variable X is almost always vector-valued. The parameter $\theta$ may also be vector-valued.

In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter $\theta$ as a random variable, with a given probability density function $h(\theta)$ for $\theta \in \Theta$ . The corresponding distribution is called the prior distribution of $\theta$ and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing $x \in S$ , we then use Bayes' theorem, to compute the conditional probability density function of $\theta$ given $\bs X=\bs x$ .

First recall that the joint probability density function of $(\bs X,\theta)$ is the mapping on $S \times \Theta$ given by \[ $(\bs{x}, \theta) \mapsto h(\theta) f(\bs{x} \mid \theta)$ \] Next recall that the (marginal) probability density function f of $\bs X$ is given by \[ $f(\bs{x}) = \sum_{\theta \in \Theta} h(\theta) f(\bs{x} | \theta), \quad \bs{x} \in S$ \] if the parameter has a discrete distribution, or \[ $f(\bs{x}) = \int_\Theta h(\theta) f(\bs{x} | \theta) \, d\theta, \quad \bs{x} \in S$ \] if the parameter has a continuous distribution. Finally, the conditional probability density function of $\theta$ given $\bs X= \bs x$ is \[ $h(\theta \mid \bs{x}) = \frac{h(\theta) f(\bs{x} \mid \theta)}{f(\bs{x})}; \quad \theta \in \Theta, \; \bs{x} \in S$ \] The conditional distribution of $\theta$ given $\bs X=\bs x$ is called the \textit{posterior} distribution, and is an updated distribution, given the information in the data. Finally, if $\theta$ is a real parameter, the conditional expected value $\mathbb{E}(\theta \mid \bs X)$ is the Bayes' estimator of $\theta$ . Recall that $\mathbb{E}(\theta \mid \bs X)$ is a function of X and, among all functions of X, is closest to $\theta$ in the mean square sense. Thus, once we collect the data and observe $\bs X=\bs x$ , the estimate of $\theta$ is $\mathbb{E}(\theta \mid \bs X)$ .

Bayesian Parameter Estimation: General Theory

We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows:

1. The form of the density $p(x|\theta)$ is assumed to be known, but the value of the parameter vector $\theta$ is not known exactly.

2. The initial knowledge about $\theta$ is assumed to be contained in a known a priori density $p(\theta)$ .

3. The rest of the knowledge about $\theta$ is contained in a set $\mathcal{D}$ of n samples $$ x_1, x_2, ... , x_n $$ drawn independently according to the unknown probability density $$ p(x) $$ .

Accordingly, already know:

p(x|D) = \int p(x|\theta)p(\theta|D)d\theta

and By Bayes Theorem,

p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}

Now, since we are attempting to transform the equation to be based on samples $$ x_k $$ , by independent assumption,

p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta)

Hence, if a sample $\mathcal{D}$ has n samples, we can denote the sample space as: $\mathcal{D}^n = \{x_1, x_2, ... x_n\}$ .

Combine the sample space definition with the equation above:

p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta)

Using this equation, we can transform the Bayesian Parameter Estimation to:

p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta}

Bayesian Parameter Estimation: Gaussian Case

The Univariate Case: $p(\mu|\mathcal{D})$

Consider the case where $\mu$ is the only unknown parameter. For simplicity we assume:

p(x|\mu) \sim N(\mu, \sigma^2)

and

p(\mu) \sim N(\mu_0, \sigma_0^2)

From the previous section, the following expression could be easily obtained using Bayes' formula:

p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)

Where $\alpha$ is a factorization factor independent of $\mu$ .

Now, substitute $p(x_k|\mu)$ and $$ p(u) $$ with:

p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]

p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]

The equation has now become:

p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]

p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]

Update the scaling factor to $\alpha'$ and $\alpha''$ correspondingly,

p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2} -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2})

p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2 -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu]

With the knowledge of Gaussian distribution:

p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}]

Finally, the estimate of $$ u_n $$ can be obtained:

\mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0

Where $\bar{x_n}$ is defined as sample means and $$ n $$ is the sample size.

In order to form a Gaussian distribution, the variance $\sigma_n^2$ associated with $\mu_n$ could also be obtained correspondingly as:

\sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2}

Observation:

With

N \to \infty

,

\sigma_D \to 0

$p(\mu|D)$ becomes more sharply peaked around $\mu_D$

The Univariate Case: $p(x|\mathcal{D})$

Having obtained the posteriori density for the mean $$ u_n $$ of set $\mathcal{D}$ , the remaining of the task is to estimate the "class-conditional" density for $$ p(x|D) $$ .

Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:

p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu

p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}] \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu

p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n)

Where $f(\sigma, \sigma_n)$ is defined as:

f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu

Hence, $$ p(x|D) $$ is normally distributed as:

p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2)

References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. R. Duda, P. Hart, Pattern Classification. Wiley-Interscience. Second Edition, 2000.

Questions and comments

If you have any questions, comments, etc. please post them on this page.

Back to ECE662, Spring 2014

@@ Line 17: / Line 17: @@
-Suppose that we have an observable random variable $\bs X$  for an experiment, that takes values in a set S. Suppose that distribution of $\bs X$  depends on a parameter $\theta$ taking values in a parameter space $\Theta$. We will denote the probability density function of $\bs X$ for a given value of $\theta$ by $f(\bs x \mid \theta)$ for $\bs x \in S$ and $\theta \in S$. Of course, our data variable X is almost always vector-valued. The parameter $\theta$ may also be vector-valued.
+Suppose that we have an observable random variable<math>\bs X</math> for an experiment, that takes values in a set S. Suppose that distribution of <math>\bs X</math>  depends on a parameter $\theta$ taking values in a parameter space <math>\Theta</math>. We will denote the probability density function of <math>\bs X</math> for a given value of <math>\theta</math> by <math>f(\bs x \mid \theta)</math> for <math>\bs x \in S </math> and <math>\theta \in S</math>. Of course, our data variable X is almost always vector-valued. The parameter <math>\theta</math> may also be vector-valued.
-In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter $\theta$ as a random variable, with a given probability density function $h(\theta)$ for $\theta \in \Theta$. The corresponding distribution is called the prior distribution of $\theta$ and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing $x \in S$, we then use Bayes' theorem, to compute the conditional probability density function of $\theta$ given $\bs X=\bs x$.
+In Bayesian analysis, named for the famous Thomas Bayes, we treat the parameter <math>\theta</math> as a random variable, with a given probability density function <math>h(\theta)</math> for <math>\theta \in \Theta </math>. The corresponding distribution is called the prior distribution of <math>\theta </math> and is intended to reflect our knowledge (if any) of the parameter, before we gather data. After observing <math>x \in S</math>, we then use Bayes' theorem, to compute the conditional probability density function of <math>\theta</math> given <math>\bs X=\bs x</math>.
-First recall that the joint probability density function of $(\bs X,\theta)$ is the mapping on $S \times \Theta$ given by
+First recall that the joint probability density function of <math>(\bs X,\theta)</math> is the mapping on <math>S \times \Theta </math> given by
 \[
-(\bs{x}, \theta) \mapsto h(\theta) f(\bs{x} \mid \theta)
+<math>(\bs{x}, \theta) \mapsto h(\theta) f(\bs{x} \mid \theta)</math>
 \]
 Next recall that the (marginal) probability density function f of $\bs X$ is given by
 \[
-f(\bs{x}) = \sum_{\theta \in \Theta} h(\theta) f(\bs{x} | \theta), \quad \bs{x} \in S
+<math>f(\bs{x}) = \sum_{\theta \in \Theta} h(\theta) f(\bs{x} | \theta), \quad \bs{x} \in S</math>
 \]
 if the parameter has a discrete distribution, or
 \[
-f(\bs{x}) = \int_\Theta h(\theta) f(\bs{x} | \theta) \, d\theta, \quad \bs{x} \in S
+<math>f(\bs{x}) = \int_\Theta h(\theta) f(\bs{x} | \theta) \, d\theta, \quad \bs{x} \in S</math>
 \]
-if the parameter has a continuous distribution. Finally, the conditional probability density function of $\theta$ given $\bs X= \bs x$ is
+if the parameter has a continuous distribution. Finally, the conditional probability density function of $\theta$ given <math>\bs X= \bs x</math> is
 \[
-h(\theta \mid \bs{x}) = \frac{h(\theta) f(\bs{x} \mid \theta)}{f(\bs{x})}; \quad \theta \in \Theta, \; \bs{x} \in S
+<math>h(\theta \mid \bs{x}) = \frac{h(\theta) f(\bs{x} \mid \theta)}{f(\bs{x})}; \quad \theta \in \Theta, \; \bs{x} \in S</math>
 \]
-The conditional distribution of $\theta$ given $\bs X=\bs x$ is called the \textit{posterior} distribution, and is an updated distribution, given the information in the data.
+The conditional distribution of <math>\theta</math> given <math>\bs X=\bs x</math> is called the \textit{posterior} distribution, and is an updated distribution, given the information in the data.
-Finally, if $\theta$ is a real parameter, the conditional expected value $\mathbb{E}(\theta \mid \bs X)$ is the Bayes' estimator of $\theta$. Recall that $\mathbb{E}(\theta \mid \bs X) $is a function of X and, among all functions of X, is closest to $\theta$ in the mean square sense. Thus, once we collect the data and observe $\bs X=\bs x$, the estimate of $\theta$ is $\mathbb{E}(\theta \mid \bs X)$.
+Finally, if <math>\theta</math> is a real parameter, the conditional expected value <math>\mathbb{E}(\theta \mid \bs X)</math> is the Bayes' estimator of <math>\theta</math>. Recall that <math>\mathbb{E}(\theta \mid \bs X) </math>is a function of X and, among all functions of X, is closest to <math>\theta</math> in the mean square sense. Thus, once we collect the data and observe <math>\bs X=\bs x</math>, the estimate of <math>\theta</math> is <math>\mathbb{E}(\theta \mid \bs X)</math>.
 ----

Difference between revisions of "Bayes Parameter Estimation with examples" - Rhea

Revision as of 09:16, 1 May 2014

Contents

Introduction: Bayesian Estimation

Bayesian Parameter Estimation: General Theory

Bayesian Parameter Estimation: Gaussian Case

The Univariate Case: $p(\mu|\mathcal{D})$

The Univariate Case: $p(x|\mathcal{D})$

References

Questions and comments

Alumni Liaison

Difference between revisions of "Bayes Parameter Estimation with examples" - Rhea

Revision as of 09:16, 1 May 2014

Contents

Introduction: Bayesian Estimation

Bayesian Parameter Estimation: General Theory

Bayesian Parameter Estimation: Gaussian Case

The Univariate Case: $ p(\mu|\mathcal{D}) $

The Univariate Case: $ p(x|\mathcal{D}) $

References

Questions and comments

Alumni Liaison

The Univariate Case: $p(\mu|\mathcal{D})$

The Univariate Case: $p(x|\mathcal{D})$