Bayes rule in practice - Rhea

Bayes rule in practice
A slecture by Lu Wang

(partially based on Prof. Mireille Boutin's ECE 662 lecture)

1. Bayes rule for Gaussian data

Given data x ∈ R $d$ and N categories {w $i$ }, i=1,2,…,N, we decide which category the data corresponds to by computing the probability of the N events. We’ll pick the category with the largest probability. Mathematically, this can be interpreted as:

\max_{w_{i}} \rho \left(w_{i}|x\right)

According to Bayes rule:

\max_{w_{i}} \rho \left(w_{i}|x\right) = \max_{w_{i}} \rho \left(x|w_{i}\right)Prob(w_{i})

In our case, the data is distributed as Gaussian. So we have,

\rho \left(x|w_{i}\right) = \frac{1}{(2\pi)^{\frac{n}{2}}|\mathbf{\Sigma}|^{\frac{1}{2}}}\mbox{exp}\left[{-\frac{1}{2}(x - \mu)^T\mathbf{\Sigma}^{-1}(x - \mu)}\right]

Let

\begin{align}g_{i}(x) &= ln(\rho \left(w_{i}|x\right)) \\ &= ln(\rho \left(x|w_{i}\right)Prob(w_{i})) \\ &= -\frac{n}{2}ln(2\pi)-\frac{1}{2}ln(|\mathbf{\Sigma}|)-{\frac{1}{2}(x - \mu)^T\mathbf{\Sigma}^{-1}(x - \mu)}+ln(Prob(w_{i})) \end{align}

Now we have,

\begin{align}\max_{w_{i}} \rho \left(w_{i}|x\right) &= \max_{w_{i}} \rho \left(x|w_{i}\right)Prob(w_{i}) \\ &= \max_{w_{i}} g_{i}(x) \end{align}

For two-class case, generate the discriminant function as

g\left(x\right) = g_{1}\left(x\right) - g_{2}\left(x\right);

decide w₁ if g(x) > 0;
else decide w₂.

2. How to evaluate parameters for Gaussian data

If the parameters for Gaussian, μ_i and Σ_i are unknown for category w_i, we then need to estimate the parameters based on the information of data samples.

2.1 Generate training and testing samples
Given the data samples with known category, we are able to estimate the parameters for the Gaussian distribution. Data samples are divided into training samples and testing samples, where μi and Σi are estimated from the training samples first, and then evaluated based on the testing samples. Generally, the more training samples, the more accurate the estimation will be. Also, it is important to select training samples that can represent the distribution of the population.

2.2 How to estimate μi and Σi
There are various methods to estimate the density distribution. Generally speaking, we have parametric and non-parametric density estimation methods.
Parametric methods usually derive mathematical expression involving the training data for each parameter. Classic parametric methods include Maximum Likelihood estimation (MLE), and Bayesian Parametric estimation (BPE). However, parametric methods would result in large error rate if the density assumption (such as Gaussian in our case) is incorrect.
Non-parametric methods do not require the foreknowledge of the density distribution. In fact, they do not estimate the parameters; instead, they estimate the density for each point to be classified. Parzen window and K-nearest neighbors (KNN) are two of the famous non-parametric methods. Two common concerns about non-parametric methods are computational complexity and the number of training samples required.

2.3 Estimate μ and Σ through MLE

Let D = {x₁,x₂,...,x_N} to be a set of iid samples from the Gaussian distribution with μ and Σ unknown. The MLE for μ and Σ are

\widehat{\mu} = \frac{1}{N}\sum_{k=1}^{N}x_{k}

\widehat{\mathbf{\Sigma}} = \frac{1}{N}\sum_{k=1}^{N}(x_{k}-\mu)(x_{k}-\mu)^{T}

See Tutorial on Maximum Likelihood Estimation: A Parametric Density Estimation Method for more details.

Bayes rule in practice - Rhea

1. Bayes rule for Gaussian data

2. How to evaluate parameters for Gaussian data

Alumni Liaison