Revision as of 08:57, 22 April 2014 by Wang957 (Talk | contribs)


Bayes rule in practice
A slecture by Lu Wang

(partially based on Prof. Mireille Boutin's ECE 662 lecture)



1. Bayes rule for Gaussian data

    Given data x ∈ Rd and N categories {wi}, i=1,2,…,N, we decide which category the data corresponds to by computing the probability of the N events. We’ll pick the category with the largest probability. Mathematically, this can be interpreted as:


$ \max_{w_{i}} \rho \left(w_{i}|x\right) $

According to Bayes rule:

$ \max_{w_{i}} \rho \left(w_{i}|x\right) = \max_{w_{i}} \rho \left(x|w_{i}\right)Prob(w_{i}) $

In our case, the data is distributed as Gaussian. So we have,

$ \rho \left(x|w_{i}\right) = \frac{1}{(2\pi)^{\frac{n}{2}}|\mathbf{\Sigma}|^{\frac{1}{2}}}\mbox{exp}\left[{-\frac{1}{2}(x - \mu)^T\mathbf{\Sigma}^{-1}(x - \mu)}\right] $

Let

$ \begin{align}g_{i}(x) &= ln(\rho \left(w_{i}|x\right)) \\ &= ln(\rho \left(x|w_{i}\right)Prob(w_{i})) \\ &= -\frac{n}{2}ln(2\pi)-\frac{1}{2}ln(|\mathbf{\Sigma}|)-{\frac{1}{2}(x - \mu)^T\mathbf{\Sigma}^{-1}(x - \mu)} \end{align} $

Now we have,


$ \begin{align}\max_{w_{i}} \rho \left(w_{i}|x\right) &= \max_{w_{i}} \rho \left(x|w_{i}\right)Prob(w_{i}) \\ &= \max_{w_{i}} g_{i}(x) \end{align} $


For two-class case, generate the discriminant function as

$ g\left(x\right) = g_{1}\left(x\right) - g_{2}\left(x\right); $
decide w1 if g(x) > 0;
else decide w2.

2. How to evaluate parameters for Gaussian data

    If the parameters for Gaussian, μi and Σi are unknown for category wi, we then need to estimate the parameters based on the information of data samples.

2.1  Generate training and testing samples
    Given the data samples with known category, we are able to estimate the parameters for the Gaussian distribution. Data samples are divided into training samples and testing samples, where μi and Σi are estimated from the training samples first, and then evaluated based on the testing samples. Generally, the more training samples, the more accurate the estimation will be. Also, it is important to select training samples that can represent the distribution of the population.

2.2 How to estimate μi and Σi
    There are various methods to estimate the density distribution. Generally speaking, we have parametric and non-parametric density estimation methods.
    Parametric methods usually derive mathematical expression involving the training data for each parameter. Classic parametric methods include Maximum Likelihood estimation (MLE), and Bayesian Parametric estimation (BPE). However, parametric methods would result in large error rate if the density assumption (such as Gaussian in our case) is incorrect.
    Non-parametric methods do not require the foreknowledge of the density distribution. In fact, they do not estimate the parameters; instead, they estimate the density for each point to be classified. Parzen window and K-nearest neighbors (KNN) are two of the famous non-parametric methods. Two common concerns about non-parametric methods are computational complexity and the number of training samples required.

2.3 Estimate μ and Σ through MLE    

    Let D = {x1,x2,...,xN} to be a set of iid samples from the Gaussian distribution with μ and Σ unknown. The MLE for μ and Σ are

$ \widehat{\mu} = \frac{1}{N}\sum_{k=1}^{N}x_{k} $
$ \widehat{\mathbf{\Sigma}} = \frac{1}{N}\sum_{k=1}^{N}(x_{k}-\mu)(x_{k}-\mu)^{T} $

See Tutorial on Maximum Likelihood Estimation: A Parametric Density Estimation Method for more details.

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett