Line 3: | Line 3: | ||
[http://balthier.ecn.purdue.edu/index.php/ECE662#Class_Lecture_Notes Class Lecture Notes] | [http://balthier.ecn.purdue.edu/index.php/ECE662#Class_Lecture_Notes Class Lecture Notes] | ||
− | [[MLE | + | LECTURE THEME : |
+ | - Maximum Likelihood Estimation and Bayesian Parameter Estimation | ||
+ | |||
+ | **See also:** [Comparison of MLE and Bayesian Parameter Estimation] | ||
+ | |||
+ | **Parametric Estimation** of Class Conditional Density | ||
+ | |||
+ | .. |classcond1| image:: tex | ||
+ | :alt: tex: p(\vec{x}|w_i) | ||
+ | |||
+ | .. |vectheta1| image:: tex | ||
+ | :alt: tex: \vec{\theta} | ||
+ | |||
+ | The class conditional density |classcond1| can be estimated using training data. We denote the parameter of estimation as |vectheta1|. There are two methods of estimation discussed. | ||
+ | |||
+ | MLE ([Maximum Likelihood Estimation]) | ||
+ | |||
+ | BPE ([Bayesian Parameter Estimation]) | ||
+ | |||
+ | .. |Dsample1k| image:: tex | ||
+ | :alt: tex: D_1, \ldots, D_c | ||
+ | |||
+ | .. |classes1k| image:: tex | ||
+ | :alt: tex: \omega_1, \ldots, \omega_c | ||
+ | |||
+ | .. |Di| image:: tex | ||
+ | :alt: tex: D_i | ||
+ | |||
+ | .. |Dj| image:: tex | ||
+ | :alt: tex: D_j, i \neq j | ||
+ | |||
+ | |||
+ | **Maximum Likelihood Estimation** | ||
+ | |||
+ | |||
+ | Let "c" denote the number of classes. D, the entire collection of sample data. |Dsample1k| represent the classification of data into classes |classes1k|. It is assumed that: | ||
+ | - Samples in |Di| give no information about the samples in |Dj|, and | ||
+ | - Each sample is drawn independently. | ||
+ | |||
+ | .. |vec_thetai| image:: tex | ||
+ | :alt: tex: \vec{\theta_i} | ||
+ | |||
+ | .. |X_normal| image:: tex | ||
+ | :alt: tex: X ~ N(\mu,\sigma^2) | ||
+ | |||
+ | .. |theta1| image:: tex | ||
+ | :alt: tex: \vec{\theta}=[\mu,\sigma^2] | ||
+ | |||
+ | Example: The class conditional density |classcond1| depends on parameter |vec_thetai|. If |X_normal| denotes the class conditional density; then |theta1|. | ||
+ | |||
+ | .. |D_collecX| image:: tex | ||
+ | :alt: tex: D=\{\vec{X_1}, \ldots, \vec{X_n}\} | ||
+ | |||
+ | Let n be the size of training sample, and |D_collecX|. Then, | ||
+ | |||
+ | .. |XgivenOmTh| image:: tex | ||
+ | :alt: tex: p(\vec{X}|\omega_i,\vec{\theta_i}) | ||
+ | |||
+ | .. |XgivenTh| image:: tex | ||
+ | :alt: tex: p(\vec{X}|\vec{\theta}) | ||
+ | |||
+ | .. |likelihood1| image:: tex | ||
+ | :alt: tex: p(D|\vec{\theta})=\displaystyle \prod_{k=1}^n p(\vec{X_k}|\vec{\theta}) | ||
+ | |||
+ | |XgivenOmTh| equals |XgivenTh| for a single class. | ||
+ | |||
+ | |||
+ | The **Likelihood Function** is, then, defined as | ||
+ | |likelihood1| | ||
+ | |||
+ | which needs to be maximized for obtaining the parameter. | ||
+ | |||
+ | .. |loglikelihood1| image:: tex | ||
+ | :alt: tex: l(\vec{\theta})=log p(D|\vec{\theta})=\displaystyle log(\prod_{k=1}^n p(\vec{X_k}|\vec{\theta}))=\displaystyle \sum_{k=1}^n log(p(\vec{X_k}|\vec{\theta})) | ||
+ | |||
+ | Since logarithm is a monotonic function, maximizing the Likelihood is same as maximizing log of Likelihood which is defined as | ||
+ | |loglikelihood1|. | ||
+ | |||
+ | "l" is the log likelihood function. | ||
+ | |||
+ | Maximize log likelyhood function with respect to |jinha_theta| | ||
+ | |||
+ | .. |jinha_theta| image:: tex | ||
+ | :alt: tex: \vec{\theta} | ||
+ | |||
+ | |jinha_est_theta| | ||
+ | |||
+ | .. |jinha_est_theta| image:: tex | ||
+ | :alt: tex: \rightarrow \hat{\theta} = argmax \left( l (\vec{\theta}) \right) | ||
+ | |||
+ | If |jinha_ltheta| is a differentiable function | ||
+ | |||
+ | .. |jinha_ltheta| image:: tex | ||
+ | :alt: tex: l(\vec{\theta}) | ||
+ | |||
+ | Let |jinha_vectheta| be 1 by p vector, then | ||
+ | |||
+ | .. |jinha_vectheta| image:: tex | ||
+ | :alt: tex: \vec{\theta} = \left[ \theta_1, \theta_2, \cdots , \theta_p \right] | ||
+ | |||
+ | |jinha_nabia| | ||
+ | |||
+ | .. |jinha_nabia| image:: tex | ||
+ | :alt: tex: \nabla_{\vec{\theta}} = \left[ \frac{\partial}{\partial\theta_1} \\ \frac{\partial}{\partial\theta_2} \\ \cdots \\ \frac{\partial}{\partial\theta_p} \right]^{t} | ||
+ | |||
+ | Then, we can compute the first derivatives of log likelyhood function, | ||
+ | |||
+ | |jinha_fd_ltheta| | ||
+ | |||
+ | .. |jinha_fd_ltheta| image:: tex | ||
+ | :alt: tex: \rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = \sum_{k=1}^{n} \nabla_{\vec{\theta}} \left[ log(p(\vec{x_k} | \vec{\theta})) \right] | ||
+ | |||
+ | and equate this first derivative to be zero | ||
+ | |||
+ | |jinha_fd_0| | ||
+ | |||
+ | .. |jinha_fd_0| image:: tex | ||
+ | :alt: tex: \rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = 0 | ||
+ | |||
+ | **Example of Guassian case** | ||
+ | |||
+ | Assume that covariance matrix are know. | ||
+ | |||
+ | |jinha_mult_normal| | ||
+ | |||
+ | .. |jinha_mult_normal| image:: tex | ||
+ | :alt: tex: p(\vec{x_k} | \vec{\mu}) = \frac{1}{ \left( (2\pi)^{d} |\Sigma| \right)^{\frac{1}{2}}} exp \left[ - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right] | ||
+ | |||
+ | **Step 1: Take log** | ||
+ | |||
+ | |jinha_log_normal| | ||
+ | |||
+ | .. |jinha_log_normal| image:: tex | ||
+ | :alt: tex: log p(\vec{x_k} | \vec{\mu}) = -\frac{1}{2} log \left( (2\pi)^d |\Sigma| \right) - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) | ||
+ | |||
+ | **Step 2: Take derivative** | ||
+ | |||
+ | |jinha_fd_log_normal| | ||
+ | |||
+ | .. |jinha_fd_log_normal| image:: tex | ||
+ | :alt: tex: \frac{\partial}{\partial\vec{\mu}} \left( log p(\vec{x_k} | \vec{\mu}) \right) = \frac{1}{2} \left[ (\vec{x_k} - \vec{\mu})^t \Sigma^{-1}\right]^t + \frac{1}{2} \left[ \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right] = \Sigma^{-1} (\vec{x_k} - \vec{\mu}) | ||
+ | |||
+ | **Step 3: Equate to 0** | ||
+ | |||
+ | |jinha_eqtozero| | ||
+ | |||
+ | .. |jinha_eqtozero| image:: tex | ||
+ | :alt: tex: \sum_{k=1}^{n} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) = 0 | ||
+ | |||
+ | |jinha_eqtozero2| | ||
+ | |||
+ | .. |jinha_eqtozero2| image:: tex | ||
+ | :alt: tex: \rightarrow \Sigma^{-1} \sum_{k=1}^{n} (\vec{x_k} - \vec{\mu}) = 0 | ||
+ | |||
+ | |jinha_eqtozero3| | ||
+ | |||
+ | .. |jinha_eqtozero3| image:: tex | ||
+ | :alt: tex: \rightarrow \Sigma^{-1} \left[ \sum_{k=1}^{n} \vec{x_k} - n \vec{\mu}\right] = 0 | ||
+ | |||
+ | |jinha_eqtozero4| | ||
+ | |||
+ | .. |jinha_eqtozero4| image:: tex | ||
+ | :alt: tex: \Longrightarrow \hat{\vec{\mu}} = \frac{1}{n} \sum_{k=1}^{n} \vec{x_k} | ||
+ | |||
+ | This is the sample mean for a sample size n. | ||
+ | |||
+ | [MLE Examples: Exponential and Geometric Distributions] | ||
+ | |||
+ | [MLE Examples: Binomial and Poisson Distributions] | ||
+ | |||
+ | Advantages of MLE: | ||
+ | - Simple | ||
+ | - Converges | ||
+ | - Asymptotically unbiased (though biased for small N) | ||
+ | |||
+ | **Bayesian Parameter Estimation** | ||
+ | |||
+ | For a given class, | ||
+ | let |x_khh| be feature vector of the class and |theta_khh| be parameter of pdf of |x_khh| to be estimated. | ||
+ | |||
+ | And let |D_khh| | ||
+ | , where |xx_khh| are training samples of the class | ||
+ | |||
+ | .. |x_khh| image:: tex | ||
+ | :alt: tex: \bf{x} | ||
+ | |||
+ | .. |D_khh| image:: tex | ||
+ | :alt: tex: D= \{ \mathbf{x}_1, \mathbf{x}_2, \cdots , \mathbf{x}_n \} \\ | ||
+ | |||
+ | .. |xx_khh| image:: tex | ||
+ | :alt: tex: \mathbf{x}_1, \mathbf{x}_2, \cdots , \mathbf{x}_n | ||
+ | |||
+ | Note that |theta_khh| is random variable with probability density |p_theta_khh| | ||
+ | |||
+ | .. |theta_khh| image:: tex | ||
+ | :alt: tex: \bf{ \theta } | ||
+ | |||
+ | .. |p_theta_khh| image:: tex | ||
+ | :alt: tex: p( \bf { \theta } ) | ||
+ | |||
+ | .. image:: tex | ||
+ | :alt: tex: \qquad p(\mathbf{x} \vert D)=\displaystyle \int p(\mathbf{x} \vert \mathbf{\theta} ) p(\mathbf{\theta} \vert D) d \mathbf{\theta } | ||
+ | |||
+ | where | ||
+ | |||
+ | .. image:: tex | ||
+ | :alt: tex: \qquad p(\mathbf{\theta} \vert D)=\frac {\displaystyle p(D \vert \mathbf{\theta} ) p(\mathbf{\theta} )} {\displaystyle \int p(D \vert \mathbf{\theta} ) p(\mathbf{\theta} ) d \mathbf{\theta } } | ||
+ | |||
+ | - Example | ||
+ | |||
+ | Here is a good example: | ||
+ | http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html | ||
+ | |||
+ | |||
+ | |||
+ | **EXAMPLE: Bayesian Inference for Gaussian Mean** | ||
+ | |||
+ | The univariate case. The variance is assumed to be known. | ||
+ | |||
+ | Here's a summary of results: | ||
+ | |||
+ | * Univariate Gaussian density |bi_gm_1| | ||
+ | |||
+ | * Prior density of the mean |bi_gm_2| | ||
+ | |||
+ | * Posterior density of the mean |bi_gm_3| | ||
+ | |||
+ | where | ||
+ | |||
+ | * |bi_gm_4| | ||
+ | |||
+ | * |bi_gm_5| | ||
+ | |||
+ | * |bi_gm_6| | ||
+ | |||
+ | Finally, the class conditional density is given by | ||
+ | |||
+ | |bi_gm_7| | ||
+ | |||
+ | .. |bi_gm_1| image:: tex | ||
+ | :alt: tex:p(x|\mu)\sim N(\mu,\sigma^{2}) | ||
+ | |||
+ | .. |bi_gm_2| image:: tex | ||
+ | :alt: tex:p(\mu)\sim N(\mu_{0},\sigma_{0}^{2}) | ||
+ | |||
+ | .. |bi_gm_3| image:: tex | ||
+ | :alt: tex:p(\mu|D)\sim N(\mu_{n},\sigma_{n}^{2}) | ||
+ | |||
+ | .. |bi_gm_4| image:: tex | ||
+ | :alt: tex:\mu_{n}=\left(\frac{n\sigma_{0}^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\right)\hat{\mu}_{n}+\frac{\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\mu_{0} | ||
+ | |||
+ | .. |bi_gm_5| image:: tex | ||
+ | :alt: tex:\sigma_{n}^{2}=\frac{\sigma_{0}^{2}\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}} | ||
+ | |||
+ | .. |bi_gm_6| image:: tex | ||
+ | :alt: tex:\hat{\mu}_{n}=\frac{1}{n}\sum_{k=1}^{n}x_{k} | ||
+ | |||
+ | .. |bi_gm_7| image:: tex | ||
+ | :alt: tex: p(x|D)\sim N(\mu_{n},\sigma^{2}+\sigma_{n}^{2}) | ||
+ | |||
+ | .. |hsantos_sigma1| image:: tex | ||
+ | :alt: tex: \sigma^{2} | ||
+ | |||
+ | .. |hsantos_sigma2| image:: tex | ||
+ | :alt: tex: \sigma_{n}^{2} | ||
+ | |||
+ | The above formular can be interpreted as: in making prediction for a single new observatioin, the variance of the estimate will have two components: | ||
+ | 1) |hsantos_sigma1| - the inherent variance within the distribution of x, i.e. the variance that would never be eliminated even with perfect information about the underlying distribution model; | ||
+ | 2) |hsantos_sigma2| - the variance introduced from the estimation of the mean vector "mu", this component can be eliminated given exact prior information or very large training set ( N goes to infinity); | ||
+ | |||
+ | .. image:: BayesianInference_GaussianMean_small.jpg | ||
+ | |||
+ | The above figure illustrates the Bayesian inference for the mean of a Gaussian distribution, for which the variance is assumed to be known. The curves show the prior distribution over 'mu' (the curve labeled N=0), which in this case is itself Gaussian, along with the posterior distributions for increasing number N of data points. The figure makes clear that as the number of data points increase, the posterior distribution peaks around the true value of the mean. This phenomenon is known as *Bayesian learning*. | ||
+ | |||
+ | **For more information:** | ||
+ | |||
+ | [Parametric Estimators] |
Revision as of 20:24, 16 March 2008
LECTURE THEME :
- Maximum Likelihood Estimation and Bayesian Parameter Estimation
- See also:** [Comparison of MLE and Bayesian Parameter Estimation]
- Parametric Estimation** of Class Conditional Density
.. |classcond1| image:: tex
:alt: tex: p(\vec{x}|w_i)
.. |vectheta1| image:: tex
:alt: tex: \vec{\theta}
The class conditional density |classcond1| can be estimated using training data. We denote the parameter of estimation as |vectheta1|. There are two methods of estimation discussed.
MLE ([Maximum Likelihood Estimation])
BPE ([Bayesian Parameter Estimation])
.. |Dsample1k| image:: tex
:alt: tex: D_1, \ldots, D_c
.. |classes1k| image:: tex
:alt: tex: \omega_1, \ldots, \omega_c
.. |Di| image:: tex
:alt: tex: D_i
.. |Dj| image:: tex
:alt: tex: D_j, i \neq j
- Maximum Likelihood Estimation**
Let "c" denote the number of classes. D, the entire collection of sample data. |Dsample1k| represent the classification of data into classes |classes1k|. It is assumed that:
- Samples in |Di| give no information about the samples in |Dj|, and - Each sample is drawn independently.
.. |vec_thetai| image:: tex
:alt: tex: \vec{\theta_i}
.. |X_normal| image:: tex
:alt: tex: X ~ N(\mu,\sigma^2)
.. |theta1| image:: tex
:alt: tex: \vec{\theta}=[\mu,\sigma^2]
Example: The class conditional density |classcond1| depends on parameter |vec_thetai|. If |X_normal| denotes the class conditional density; then |theta1|.
.. |D_collecX| image:: tex
:alt: tex: D=\{\vec{X_1}, \ldots, \vec{X_n}\}
Let n be the size of training sample, and |D_collecX|. Then,
.. |XgivenOmTh| image:: tex
:alt: tex: p(\vec{X}|\omega_i,\vec{\theta_i})
.. |XgivenTh| image:: tex
:alt: tex: p(\vec{X}|\vec{\theta})
.. |likelihood1| image:: tex
:alt: tex: p(D|\vec{\theta})=\displaystyle \prod_{k=1}^n p(\vec{X_k}|\vec{\theta})
|XgivenOmTh| equals |XgivenTh| for a single class.
The **Likelihood Function** is, then, defined as
|likelihood1|
which needs to be maximized for obtaining the parameter.
.. |loglikelihood1| image:: tex
:alt: tex: l(\vec{\theta})=log p(D|\vec{\theta})=\displaystyle log(\prod_{k=1}^n p(\vec{X_k}|\vec{\theta}))=\displaystyle \sum_{k=1}^n log(p(\vec{X_k}|\vec{\theta}))
Since logarithm is a monotonic function, maximizing the Likelihood is same as maximizing log of Likelihood which is defined as |loglikelihood1|.
"l" is the log likelihood function.
Maximize log likelyhood function with respect to |jinha_theta|
.. |jinha_theta| image:: tex
:alt: tex: \vec{\theta}
|jinha_est_theta|
.. |jinha_est_theta| image:: tex
:alt: tex: \rightarrow \hat{\theta} = argmax \left( l (\vec{\theta}) \right)
If |jinha_ltheta| is a differentiable function
.. |jinha_ltheta| image:: tex
:alt: tex: l(\vec{\theta})
Let |jinha_vectheta| be 1 by p vector, then
.. |jinha_vectheta| image:: tex
:alt: tex: \vec{\theta} = \left[ \theta_1, \theta_2, \cdots , \theta_p \right]
|jinha_nabia|
.. |jinha_nabia| image:: tex
:alt: tex: \nabla_{\vec{\theta}} = \left[ \frac{\partial}{\partial\theta_1} \\ \frac{\partial}{\partial\theta_2} \\ \cdots \\ \frac{\partial}{\partial\theta_p} \right]^{t}
Then, we can compute the first derivatives of log likelyhood function,
|jinha_fd_ltheta|
.. |jinha_fd_ltheta| image:: tex
:alt: tex: \rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = \sum_{k=1}^{n} \nabla_{\vec{\theta}} \left[ log(p(\vec{x_k} | \vec{\theta})) \right]
and equate this first derivative to be zero
|jinha_fd_0|
.. |jinha_fd_0| image:: tex
:alt: tex: \rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = 0
- Example of Guassian case**
Assume that covariance matrix are know.
|jinha_mult_normal|
.. |jinha_mult_normal| image:: tex
:alt: tex: p(\vec{x_k} | \vec{\mu}) = \frac{1}{ \left( (2\pi)^{d} |\Sigma| \right)^{\frac{1}{2}}} exp \left[ - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right]
- Step 1: Take log**
|jinha_log_normal|
.. |jinha_log_normal| image:: tex
:alt: tex: log p(\vec{x_k} | \vec{\mu}) = -\frac{1}{2} log \left( (2\pi)^d |\Sigma| \right) - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu})
- Step 2: Take derivative**
|jinha_fd_log_normal|
.. |jinha_fd_log_normal| image:: tex
:alt: tex: \frac{\partial}{\partial\vec{\mu}} \left( log p(\vec{x_k} | \vec{\mu}) \right) = \frac{1}{2} \left[ (\vec{x_k} - \vec{\mu})^t \Sigma^{-1}\right]^t + \frac{1}{2} \left[ \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right] = \Sigma^{-1} (\vec{x_k} - \vec{\mu})
- Step 3: Equate to 0**
|jinha_eqtozero|
.. |jinha_eqtozero| image:: tex
:alt: tex: \sum_{k=1}^{n} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) = 0
|jinha_eqtozero2|
.. |jinha_eqtozero2| image:: tex
:alt: tex: \rightarrow \Sigma^{-1} \sum_{k=1}^{n} (\vec{x_k} - \vec{\mu}) = 0
|jinha_eqtozero3|
.. |jinha_eqtozero3| image:: tex
:alt: tex: \rightarrow \Sigma^{-1} \left[ \sum_{k=1}^{n} \vec{x_k} - n \vec{\mu}\right] = 0
|jinha_eqtozero4|
.. |jinha_eqtozero4| image:: tex
:alt: tex: \Longrightarrow \hat{\vec{\mu}} = \frac{1}{n} \sum_{k=1}^{n} \vec{x_k}
This is the sample mean for a sample size n.
[MLE Examples: Exponential and Geometric Distributions]
[MLE Examples: Binomial and Poisson Distributions]
Advantages of MLE:
- Simple - Converges - Asymptotically unbiased (though biased for small N)
- Bayesian Parameter Estimation**
For a given class, let |x_khh| be feature vector of the class and |theta_khh| be parameter of pdf of |x_khh| to be estimated.
And let |D_khh| , where |xx_khh| are training samples of the class
.. |x_khh| image:: tex
:alt: tex: \bf{x}
.. |D_khh| image:: tex
:alt: tex: D= \{ \mathbf{x}_1, \mathbf{x}_2, \cdots , \mathbf{x}_n \} \\
.. |xx_khh| image:: tex
:alt: tex: \mathbf{x}_1, \mathbf{x}_2, \cdots , \mathbf{x}_n
Note that |theta_khh| is random variable with probability density |p_theta_khh|
.. |theta_khh| image:: tex
:alt: tex: \bf{ \theta }
.. |p_theta_khh| image:: tex
:alt: tex: p( \bf { \theta } )
.. image:: tex
:alt: tex: \qquad p(\mathbf{x} \vert D)=\displaystyle \int p(\mathbf{x} \vert \mathbf{\theta} ) p(\mathbf{\theta} \vert D) d \mathbf{\theta }
where
.. image:: tex
:alt: tex: \qquad p(\mathbf{\theta} \vert D)=\frac {\displaystyle p(D \vert \mathbf{\theta} ) p(\mathbf{\theta} )} {\displaystyle \int p(D \vert \mathbf{\theta} ) p(\mathbf{\theta} ) d \mathbf{\theta } }
- Example
Here is a good example: http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html
- EXAMPLE: Bayesian Inference for Gaussian Mean**
The univariate case. The variance is assumed to be known.
Here's a summary of results:
* Univariate Gaussian density |bi_gm_1|
* Prior density of the mean |bi_gm_2|
* Posterior density of the mean |bi_gm_3|
where
* |bi_gm_4|
* |bi_gm_5|
* |bi_gm_6|
Finally, the class conditional density is given by
|bi_gm_7|
.. |bi_gm_1| image:: tex
:alt: tex:p(x|\mu)\sim N(\mu,\sigma^{2})
.. |bi_gm_2| image:: tex
:alt: tex:p(\mu)\sim N(\mu_{0},\sigma_{0}^{2})
.. |bi_gm_3| image:: tex
:alt: tex:p(\mu|D)\sim N(\mu_{n},\sigma_{n}^{2})
.. |bi_gm_4| image:: tex
:alt: tex:\mu_{n}=\left(\frac{n\sigma_{0}^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\right)\hat{\mu}_{n}+\frac{\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\mu_{0}
.. |bi_gm_5| image:: tex
:alt: tex:\sigma_{n}^{2}=\frac{\sigma_{0}^{2}\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}
.. |bi_gm_6| image:: tex
:alt: tex:\hat{\mu}_{n}=\frac{1}{n}\sum_{k=1}^{n}x_{k}
.. |bi_gm_7| image:: tex
:alt: tex: p(x|D)\sim N(\mu_{n},\sigma^{2}+\sigma_{n}^{2})
.. |hsantos_sigma1| image:: tex
:alt: tex: \sigma^{2}
.. |hsantos_sigma2| image:: tex
:alt: tex: \sigma_{n}^{2}
The above formular can be interpreted as: in making prediction for a single new observatioin, the variance of the estimate will have two components: 1) |hsantos_sigma1| - the inherent variance within the distribution of x, i.e. the variance that would never be eliminated even with perfect information about the underlying distribution model; 2) |hsantos_sigma2| - the variance introduced from the estimation of the mean vector "mu", this component can be eliminated given exact prior information or very large training set ( N goes to infinity);
.. image:: BayesianInference_GaussianMean_small.jpg
The above figure illustrates the Bayesian inference for the mean of a Gaussian distribution, for which the variance is assumed to be known. The curves show the prior distribution over 'mu' (the curve labeled N=0), which in this case is itself Gaussian, along with the posterior distributions for increasing number N of data points. The figure makes clear that as the number of data points increase, the posterior distribution peaks around the true value of the mean. This phenomenon is known as *Bayesian learning*.
- For more information:**
[Parametric Estimators]