(41 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
[[Category:slecture]]
 
[[Category:slecture]]
 
[[Category:ECE662]]
 
[[Category:ECE662]]
 +
[[Category:ECE662Spring2014Boutin]]
 +
[[Category:ECE]]
  
 
<center><font size= 5>
 
<center><font size= 5>
[[ECE662_Bayesian_Parameter_Estimation_S14_SF|Bayesian Parameter Estimation: Gaussian Case]]
+
Bayesian Parameter Estimation with examples
 
</font size>
 
</font size>
  
A [https://www.projectrhea.org/learning/slectures.php slecture] by [[ECE]] student [[user:SFang | Shaobo Fang]]
+
A [http://www.projectrhea.org/learning/slectures.php slecture] by [[ECE]] student Yu Wang
  
Loosely based on the [[2014_Spring_ECE_662_Boutin|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]].  
+
Partly based on the [[2014_Spring_ECE_662_Boutin_Statistical_Pattern_recognition_slectures|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]].  
 
</center>
 
</center>
  
Line 17: Line 19:
  
  
According to Chapter #3.3 (Duda's book), although the answers we get by BPE will generally be nearly identical to those obtained by maximum likelihood estimation, the conceptual difference is significant. For maximum likelihood estimation, the parameter <math>\theta</math> is a fixed while in Bayersian estimation <math>\theta</math> is considered to be a random variable.
+
First recall that the joint probability density function of <math>(\mathbf X,\theta)</math> is the mapping on <math>S \times \Theta </math> given by
  
By definition, given samples class <math>\mathcal{D}</math>, Bayes' formula then becomes:
+
<center><math>(x, \theta) \mapsto \pi(\theta) f(x \mid \theta)</math></center>
  
 +
Next recall that the (marginal) probability density function f of <math>X</math> is given by
  
<center>
+
<center><math>f(x) = \sum_{\theta \in \Theta} \pi(\theta) f(x | \theta), \quad x \in S</math></center>
<math>P(w_i|x,D) = \frac{p(x|w_i,D)P(w_i|D)}{\sum_{j = 1}^c p(x|w_j,D)P(w_j|D)}</math>
+
</center>
+
  
As the above equation suggests, we can use the information provided by the training data to help determine both the class-conditional densities and the priori probabilities.
+
if the parameter has a discrete distribution, or
  
Furthermore, since we are treating supervised case, we can separate the training samples by class into c subsets <math>\mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_c</math>, accordingly:
+
<center><math>f(x) = \int_\Theta \pi(\theta) f(x| \theta) \, d\theta, \quad x\in S</math></center>
  
<center>
+
if the parameter has a continuous distribution. Finally, according to Bayes rule, the conditional probability density function of <math>\theta</math> given <math> X= x</math> namely posterior is
<math>P(w_i|x,D) = \frac{p(x|w_i,D_i)P(w_i)}{\sum_{j = 1}^c p(x|w_j,D_j)P(w_j)}</math>
+
</center>
+
  
Now, assume <math>p(x)</math> has a parameter form. We are given a set of <math>N</math> independent samples <math>\mathcal{D} = \{x_1, x_2, ... , x_N \}</math>. View <math>\theta</math> as a random variable. Consider more specifically in continuous case:
+
<center><math>h(\theta \mid x) = \frac{\pi(\theta) f(x \mid \theta)}{f(x)}; \quad \theta \in \Theta, \; x\in S</math></center>
  
<math>p(x|D)</math> can be computed as:
+
Our BP estimator is defined as posterior mean <math>E(\theta \mid x)</math>.
<center>
+
<math>p(x|D) = \int p(x, \theta|D)d\theta = \int p(x|\theta)p(\theta|D)d\theta</math>
+
</center>
+
----
+
  
== '''Bayesian Parameter Estimation: General Theory''' ==
 
  
 +
----
  
 +
== '''Bayesian Parameter Estimation: Bernoulli Case with Beta distribution as prior''' ==
  
We first start with a generalized approach which can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are as follows:
+
The probability density function of the beta distribution, where <math>0 \le x \le 1</math>, and shape parameters <math>\alpha,\beta > 0</math>
  
1. The form of the density <math>p(x|\theta)</math> is assumed to be known, but the value of the parameter vector <math>\theta</math> is not known exactly.
+
<center><math>f(x;\alpha,\beta) = \frac{1}{ B(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1}</math></center>
  
2. The initial knowledge about <math>\theta</math> is assumed to be contained in a known a priori density <math>p(\theta)</math>.
+
Recall that the Bernoulli distribution has probability density function (given p)
  
3. The rest of the knowledge about <math>\theta</math> is contained in a set <math>\mathcal{D}</math> of n samples <math>x_1, x_2, ... , x_n</math> drawn independently according to the unknown probability density <math>p(x)</math>.
+
<center><math>g(x \mid p) = p^x (1 - p)^{1-x}, \quad x \in \{0, 1\}</math></center>
  
Accordingly, already know:
+
So, with n i.i.d. samples, the likelihood function will be:
  
<center><math>p(x|D) = \int p(x|\theta)p(\theta|D)d\theta</math></center>
+
<center><math>l(\mathbf{X} \mid p) = p^{\sum{x_i}} (1 - p)^{n-\sum{x_i}}, \quad x \in \{0, 1\}</math></center>
  
and By Bayes Theorem,
+
Thus, the posterior, according to Bayes rule,
 
+
<center><math>p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int p(D|\theta)p(\theta|D)d\theta}</math></center>
+
 
+
 
+
Now, since we are attempting to transform the equation to be based on samples <math>x_k</math>, by independent assumption,
+
 
+
<center><math>p(D|\theta) = \prod_{k = 1}^n p(x_k|\theta)</math></center>
+
 
+
Hence, if a sample <math>\mathcal{D}</math> has n samples, we can denote the sample space as:
+
<math>\mathcal{D}^n = \{x_1, x_2, ... x_n\}</math>.
+
 
+
Combine the sample space definition with the equation above:
+
 
+
 
+
 
+
                                <center><math> p(D^n|\theta) = p(D^{n-1}|\theta)p(x_n|\theta) </math></center>
+
 
+
Using this equation, we can transform the Bayesian Parameter Estimation to:
+
 
+
<center><math>p(\theta|D^n) = \frac{p(x_n|\theta)p(\theta|D^{n-1})}{\int p(x_n|\theta)p(\theta|D^{n-1})d\theta}</math></center>
+
  
 +
<center><math>posterior(p \mid \mathbf{X}) = Cp^{Y+\alpha -1} (1 - p)^{n-Y+\beta-1}, \quad x \in \{0, 1\}</math></center>
  
 +
where <math>Y=\sum{x_i}</math> and C is just a scaling constant.
  
 +
Therefore,  <math>\mathbb{E}(p \mid X) = \frac{\alpha+Y}{\beta+Y+n}</math>
  
 
----
 
----
== '''Bayesian Parameter Estimation: Gaussian Case''' ==
+
== '''Bayesian Parameter Estimation: Example''' ==
 
+
== ''The Univariate Case: <math>p(\mu|\mathcal{D})</math>'' ==
+
 
+
Consider the case where <math>\mu</math> is the only unknown parameter. For simplicity we assume:
+
 
+
                                        <center><math>p(x|\mu) \sim N(\mu, \sigma^2)</math></center> and
+
 
+
<center><math>p(\mu) \sim N(\mu_0, \sigma_0^2)</math></center>
+
 
+
From the previous section, the following expression could be easily obtained using Bayes' formula:
+
 
+
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n p(x_k|\mu)p(\mu)</math></center>
+
 
+
Where <math>\alpha</math> is a factorization factor independent of <math>\mu</math>.
+
 
+
Now, substitute <math>p(x_k|\mu)</math> and <math>p(u)</math> with:
+
 
+
<center><math>p(x_k|\mu) = \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center>
+
 
+
<center><math>p(u) = \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>
+
 
+
The equation has now become:
+
 
+
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}}exp[-\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}] \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}]</math></center>
+
  
<center><math>p(\mu|D) = \alpha \prod_{k = 1}^n \frac{1}{(2\pi\sigma^2)^{1/2}} \frac{1}{(2\pi\sigma_0^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}  -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2}]</math></center>
+
The objective of the following experiments is to evaluate how varying parameters affect density estimation:
  
Update the scaling factor to <math>\alpha'</math> and <math>\alpha''</math> correspondingly,
+
1. 1D Binomial data density estimation when varing the number of training data
  
<center><math>p(\mu|D) = \alpha' exp \sum_{k=1}^n(-\frac{1}{2}(\frac{\mu-\mu_0}{\sigma_0})^{2}  -\frac{1}{2}(\frac{x_k-\mu}{\sigma})^{2})</math></center>
+
2. 1D Binomial data density estimation using different prior distribution.
  
<center><math>p(\mu|D) = \alpha'' exp [-\frac{1}{2}(\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2})\mu^2  -2(\frac{1}{\sigma^2}\sum_{k=1}^nx_k + \frac{\mu_0}{\sigma_0^2})\mu]</math></center>
+
3. 2D synthetic data density estimation when updating our prior guess.
  
With the knowledge of Gaussian distribution:  
+
The 1D Binomial test is based on flipping a biased coin. The probabilty that the biased coin appears ''head'' is assumed as ''p'', so that the probability of tail is ''1-p''. In this experiment, we introduce another well-known estimator, maximum a posteriori probability (MAP) estimator. The reason of introducing MAP in the context of comparing MLE and BPE is that MAP can be treated as an intermediate step between MLE and BPE, which also takes prior into account.
 +
Note that we can simply define MAP as follows:
  
<center><math>p(u|D) = \frac{1}{(2\pi\sigma_n^2)^{1/2}}exp[-\frac{1}{2}(\frac{\mu-\mu_n}{\sigma_n})^{2}]</math></center>
+
<center><math>\hat{\theta}_{\mathrm{ML}}(x)= \underset{\theta}{\operatorname{arg\,max}} \ f(x | \theta) </math></center>
  
Finally, the estimate of <math>u_n</math> can be obtained:
+
First of all, we will examine how the number of training data will affect BPE, MLE and MAP. My question is which one will be the best when our training data is insufficient. To answer this question, we formulate the problem of flipping a biased coin in the following way:
  
<center><math>\mu_n = (\frac{n\sigma_0^2}{n\sigma_0^2 + \sigma^2})\bar{x_n} + \frac{\sigma^2}{n\sigma_0^2 + \sigma^2}\mu_0</math></center>
+
1. number of training data varies from 5 to 200 in step of 10
  
Where <math>\bar{x_n}</math> is defined as sample means and <math>n</math> is the sample size.
+
2. for each case, we use the same prior knowledge, that is <math>\theta</math> follows a Beta distribution(mean = 2/3)
  
In order to form a Gaussian distribution, the variance <math>\sigma_n^2</math> associated with <math>\mu_n</math> could also be obtained correspondingly as:
+
3. for each case, we account 30 trials, which will give us a reasonable mean and variance, where the ground truth of p is 2/3.
  
<center><math>\sigma_n^2 = \frac{\sigma_0^2 \sigma^2}{n\sigma_0^2 + \sigma^2}</math></center>
 
  
 +
Remember that the probability density function of the beta distribution we discussed above.
  
Observation:
+
<center>[[Image:ywfig1.png|600px|Figure 1: Posterior mean with increasing number of samples]] </center>
 +
<center><p><b>Figure 1:</b> Posterior mean with increasing number of samples</p></center>
  
With <math>N \to \infty</math>, <center><math>\sigma_D \to 0</math></center>
+
In Figure 1, all curves converge to the true mean as number of training data increases. However, when number of samples is not enough, BPE gives us a better estimation, because it takes all prior information into account, whereas MAP has a huge offset even though it also includes some prior information. The performance of MLE is somewhere between BPE and MAP from the perspective of mean value.
<math>p(\mu|D)</math> becomes more sharply peaked around <math>\mu_D</math>
+
  
== ''The Univariate Case: <math>p(x|\mathcal{D})</math>'' ==
+
<center>[[Image:ywfig5.png|600px|Figure 2:Variance of <math>\hat{p}</math> with different prior information]] </center>
 +
<center><p><b>Figure 2:</b> Variance of <math>\hat{p}</math> with different prior information</p></center>
  
 +
If we take a closer look at the variance of each case, we can see that MLE tends to have a larger variance specially when number of samples are insufficient, which means MLE has more uncertainty over what it tries to estimate. On the other hand, BPE and MAP have smaller variance because the prior information limits the uncertainty to a certain range. We can infer that if our prior distribution has a narrower peak at the true mean rather than Beta distribution with a wide ramp, the estimated variance will much smaller.
  
Having obtained the posteriori density for the mean <math>u_n</math> of set <math>\mathcal{D}</math>, the remaining of the task is to estimate the "class-conditional" density for <math>p(x|D)</math>.
+
<center>[[Image:ywfig7.png|600px|Figure 3:Beta distribution when varying <math>\alpha,\beta</math>]] </center>
 +
<center><p><b>Figure 3:</b> Beta distribution when varying <math>\alpha,\beta</math></p></center>
  
Based on the text Duda's chatpter #3.4.2 and Prof. Mimi's notes:
+
Figure 2 proves our inference above. In this case, we tempararily let Beta distribution have true mean equal to 0.5 and manipulate two parameters(<math>\alpha \; and\; \beta</math>) to give us different variance, which represents the uncertainty of our initial guess.  Figure 3 shows how Beta distribution changes when using different parameter. Back to Figure 2, we can conclude that certainty of prior knowledge determines the variance of our estimation.
  
 +
<center>[[Image:ywfig6.png|600px|Figure 4:Variance of <math>\hat{p}</math> with increasing number of samples]] </center>
 +
<center><p><b>Figure 4:</b> Variance of <math>\hat{p}</math> with increasing number of samples</p></center>
  
<center><math>p(x|\mathcal{D}) = \int p(x|\mu)p(\mu|\mathcal{D})d\mu</math></center>
+
You may also ask how the number of samples affect the variance. Figure 4 tells us that starting from a really small number of samples, 5 in our case, the variance tends to go up and then go down to zero. The reason of such phenomenon is that when number of samples is so small, the prior is dominant so that the estimation is simply a reflection of prior, which tends to have a small variance.  
<center><math>p(x|\mathcal{D}) = \int \frac{1}{\sqrt{2 \pi } \sigma} \exp[{-\frac{1}{2} (\frac{x-\mu}{\sigma})^2}] \frac{1}{\sqrt{2 \pi } \sigma_n} \exp[{-\frac{1}{2} (\frac{\mu-\mu_n}{\sigma_n})^2}] d\mu</math></center>
+
  
 +
<center>[[Image:ywfig2.png|600px|Figure 5:Posterior mean with different initial guess]] </center>
 +
<center><p><b>Figure 5:</b> Posterior mean with different initial guess</p></center>
  
<center><math>p(x|\mathcal{D}) = \frac{1}{2\pi\sigma\sigma_n} exp [-\frac{1}{2} \frac{(x-\mu)}{\sigma^2 + \sigma_n^2}]f(\sigma,\sigma_n)</math></center>
+
<center>[[Image:ywfig3.png|600px|Figure 6:Prior: Beta distribution with various parameters]] </center>
 +
<center><p><b>Figure 6:</b> Prior: Beta distribution with various parameters</p></center>
  
 +
Now, let's discuss what if our prior knowledge is biased, say the true mean is 0.6, but we model our prior as a gaussian centered at 0.2. Still using the problem we formulated before, where our ground truth is 2/3, we force our prior to be biased. As Figure 5 and Figure 6 shows, four initial guesses are implemented for a relatively small amount of samples. We can see that results from different prior knowledge vary a lot and the effect of prior is dominant in this case.  With more data, such effect will be attenuated and the influence of data will be essential then. Figure 7 simply shows how posterior updates according to different prior.
  
Where <math>f(\sigma, \sigma_n)</math> is defined as:
+
<center>[[Image:ywfig4.png|600px|Figure 7:Posterior: likelihood <math>\times</math> prior]] </center>
 +
<center><p><b>Figure 7:</b> Posterior: likelihood <math>\times</math> prior</p></center>
  
 +
Secondly, I will discuss how to update prior in a recursive way to reach a better estimation. In this experiment, assume there is an intruder UFO detected by global radar in the year of 2050. With advanced technology, the UFO can produce Gaussian noise over its position to illude our radar. However, aliens don't know we have learned Bayes Estimation.
  
<center><math>f(\sigma,\sigma_n) = \int exp[-\frac{1}{2}\frac{\sigma^2 + \sigma_n^2}{\sigma^2 \sigma_n ^2}(\mu - \frac{\sigma_n^2 x+\sigma^2 \mu_n}{\sigma^2+\sigma_n^2})^2]d\mu</math></center>
+
For simplicity, we limit the detection of UFO in a certain range[3:5,4:6] and the true location is [3.5,4.5], which is unknown to us. What we know from our military radar is shown in Figure 8. The illusions that aliens created follow gaussian distribution with standard deviation of 2 centered at true location.
  
Hence, <math>p(x|D)</math> is normally distributed as:
+
<center>[[Image:ywr1.png|600px|Figure 8:UFO location on radar]] </center>
 +
<center><p><b>Figure 8:</b> UFO location on radar</p></center>
  
<center><math>p(x|D) \sim N(\mu_n, \sigma^2 + \sigma_n^2)</math></center>
+
To start off, our initial guess is just a unform distribution in the region we limit. In this case, we have 100 observations on our radar. From each observation, we update the prior according to previous posterior. Figure 9 illustrates three stages of our detection. As we can see, with more data collected, our prior information is more constrained. In another word, the confidence of detection is growing with observations.
  
 +
<center>[[Image:ywr2.png|600px|Figure 9:Updating prior with data: first line represents X,Y coordinates, second line is the updated prior distribution]] </center>
 +
<center><p><b>Figure 9:</b> Updating prior with data: first line represents X,Y coordinates, second line is the updated prior distribution</p></center>
 
----
 
----
 
----
 
----
Line 169: Line 138:
 
[1]. [https://engineering.purdue.edu/~mboutin/ Mireille Boutin], "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.
 
[1]. [https://engineering.purdue.edu/~mboutin/ Mireille Boutin], "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.
  
[2]. R. Duda, P. Hart, ''Pattern Classification''. Wiley-Interscience. Second Edition, 2000.
+
[2]. [http://www.math.uah.edu/stat/index.html  Virtual Laboratories], "Virtual Laboratories in Probability and Statistics"
  
==[[slecture_title_of_slecture_review|Questions and comments]]==
+
==[[Reviews_on_Bayes_Parameter_Estimation_with_examples|Questions and comments]]==
  
If you have any questions, comments, etc. please post them on [[slecture_title_of_slecture_review|this page]].
+
If you have any questions, comments, etc. please post them on [[Reviews_on_Bayes_Parameter_Estimation_with_examples|this page]].
 
----
 
----
 
[[2014_Spring_ECE_662_Boutin|Back to ECE662, Spring 2014]]
 
[[2014_Spring_ECE_662_Boutin|Back to ECE662, Spring 2014]]

Latest revision as of 09:52, 22 January 2015


Bayesian Parameter Estimation with examples

A slecture by ECE student Yu Wang

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.



Introduction: Bayesian Estimation

First recall that the joint probability density function of $ (\mathbf X,\theta) $ is the mapping on $ S \times \Theta $ given by

$ (x, \theta) \mapsto \pi(\theta) f(x \mid \theta) $

Next recall that the (marginal) probability density function f of $ X $ is given by

$ f(x) = \sum_{\theta \in \Theta} \pi(\theta) f(x | \theta), \quad x \in S $

if the parameter has a discrete distribution, or

$ f(x) = \int_\Theta \pi(\theta) f(x| \theta) \, d\theta, \quad x\in S $

if the parameter has a continuous distribution. Finally, according to Bayes rule, the conditional probability density function of $ \theta $ given $ X= x $ namely posterior is

$ h(\theta \mid x) = \frac{\pi(\theta) f(x \mid \theta)}{f(x)}; \quad \theta \in \Theta, \; x\in S $

Our BP estimator is defined as posterior mean $ E(\theta \mid x) $.



Bayesian Parameter Estimation: Bernoulli Case with Beta distribution as prior

The probability density function of the beta distribution, where $ 0 \le x \le 1 $, and shape parameters $ \alpha,\beta > 0 $

$ f(x;\alpha,\beta) = \frac{1}{ B(\alpha,\beta)} x^{\alpha-1}(1-x)^{\beta-1} $

Recall that the Bernoulli distribution has probability density function (given p)

$ g(x \mid p) = p^x (1 - p)^{1-x}, \quad x \in \{0, 1\} $

So, with n i.i.d. samples, the likelihood function will be:

$ l(\mathbf{X} \mid p) = p^{\sum{x_i}} (1 - p)^{n-\sum{x_i}}, \quad x \in \{0, 1\} $

Thus, the posterior, according to Bayes rule,

$ posterior(p \mid \mathbf{X}) = Cp^{Y+\alpha -1} (1 - p)^{n-Y+\beta-1}, \quad x \in \{0, 1\} $

where $ Y=\sum{x_i} $ and C is just a scaling constant.

Therefore, $ \mathbb{E}(p \mid X) = \frac{\alpha+Y}{\beta+Y+n} $


Bayesian Parameter Estimation: Example

The objective of the following experiments is to evaluate how varying parameters affect density estimation:

1. 1D Binomial data density estimation when varing the number of training data

2. 1D Binomial data density estimation using different prior distribution.

3. 2D synthetic data density estimation when updating our prior guess.

The 1D Binomial test is based on flipping a biased coin. The probabilty that the biased coin appears head is assumed as p, so that the probability of tail is 1-p. In this experiment, we introduce another well-known estimator, maximum a posteriori probability (MAP) estimator. The reason of introducing MAP in the context of comparing MLE and BPE is that MAP can be treated as an intermediate step between MLE and BPE, which also takes prior into account. Note that we can simply define MAP as follows:

$ \hat{\theta}_{\mathrm{ML}}(x)= \underset{\theta}{\operatorname{arg\,max}} \ f(x | \theta) $

First of all, we will examine how the number of training data will affect BPE, MLE and MAP. My question is which one will be the best when our training data is insufficient. To answer this question, we formulate the problem of flipping a biased coin in the following way:

1. number of training data varies from 5 to 200 in step of 10

2. for each case, we use the same prior knowledge, that is $ \theta $ follows a Beta distribution(mean = 2/3)

3. for each case, we account 30 trials, which will give us a reasonable mean and variance, where the ground truth of p is 2/3.


Remember that the probability density function of the beta distribution we discussed above.

Figure 1: Posterior mean with increasing number of samples

Figure 1: Posterior mean with increasing number of samples

In Figure 1, all curves converge to the true mean as number of training data increases. However, when number of samples is not enough, BPE gives us a better estimation, because it takes all prior information into account, whereas MAP has a huge offset even though it also includes some prior information. The performance of MLE is somewhere between BPE and MAP from the perspective of mean value.

Figure 2:Variance of $ \hat{p} $ with different prior information

Figure 2: Variance of $ \hat{p} $ with different prior information

If we take a closer look at the variance of each case, we can see that MLE tends to have a larger variance specially when number of samples are insufficient, which means MLE has more uncertainty over what it tries to estimate. On the other hand, BPE and MAP have smaller variance because the prior information limits the uncertainty to a certain range. We can infer that if our prior distribution has a narrower peak at the true mean rather than Beta distribution with a wide ramp, the estimated variance will much smaller.

Figure 3:Beta distribution when varying $ \alpha,\beta $

Figure 3: Beta distribution when varying $ \alpha,\beta $

Figure 2 proves our inference above. In this case, we tempararily let Beta distribution have true mean equal to 0.5 and manipulate two parameters($ \alpha \; and\; \beta $) to give us different variance, which represents the uncertainty of our initial guess. Figure 3 shows how Beta distribution changes when using different parameter. Back to Figure 2, we can conclude that certainty of prior knowledge determines the variance of our estimation.

Figure 4:Variance of $ \hat{p} $ with increasing number of samples

Figure 4: Variance of $ \hat{p} $ with increasing number of samples

You may also ask how the number of samples affect the variance. Figure 4 tells us that starting from a really small number of samples, 5 in our case, the variance tends to go up and then go down to zero. The reason of such phenomenon is that when number of samples is so small, the prior is dominant so that the estimation is simply a reflection of prior, which tends to have a small variance.

Figure 5:Posterior mean with different initial guess

Figure 5: Posterior mean with different initial guess

Figure 6:Prior: Beta distribution with various parameters

Figure 6: Prior: Beta distribution with various parameters

Now, let's discuss what if our prior knowledge is biased, say the true mean is 0.6, but we model our prior as a gaussian centered at 0.2. Still using the problem we formulated before, where our ground truth is 2/3, we force our prior to be biased. As Figure 5 and Figure 6 shows, four initial guesses are implemented for a relatively small amount of samples. We can see that results from different prior knowledge vary a lot and the effect of prior is dominant in this case. With more data, such effect will be attenuated and the influence of data will be essential then. Figure 7 simply shows how posterior updates according to different prior.

Figure 7:Posterior: likelihood $ \times $ prior

Figure 7: Posterior: likelihood $ \times $ prior

Secondly, I will discuss how to update prior in a recursive way to reach a better estimation. In this experiment, assume there is an intruder UFO detected by global radar in the year of 2050. With advanced technology, the UFO can produce Gaussian noise over its position to illude our radar. However, aliens don't know we have learned Bayes Estimation.

For simplicity, we limit the detection of UFO in a certain range[3:5,4:6] and the true location is [3.5,4.5], which is unknown to us. What we know from our military radar is shown in Figure 8. The illusions that aliens created follow gaussian distribution with standard deviation of 2 centered at true location.

Figure 8:UFO location on radar

Figure 8: UFO location on radar

To start off, our initial guess is just a unform distribution in the region we limit. In this case, we have 100 observations on our radar. From each observation, we update the prior according to previous posterior. Figure 9 illustrates three stages of our detection. As we can see, with more data collected, our prior information is more constrained. In another word, the confidence of detection is growing with observations.

Figure 9:Updating prior with data: first line represents X,Y coordinates, second line is the updated prior distribution

Figure 9: Updating prior with data: first line represents X,Y coordinates, second line is the updated prior distribution



References

[1]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.

[2]. Virtual Laboratories, "Virtual Laboratories in Probability and Statistics"

Questions and comments

If you have any questions, comments, etc. please post them on this page.


Back to ECE662, Spring 2014

Alumni Liaison

has a message for current ECE438 students.

Sean Hu, ECE PhD 2009