Line 52: Line 52:
 
&nbsp; &nbsp; Let D = {x<sub>1</sub>,x<sub>2</sub>,...,x<sub>N</sub>}&nbsp;to be a set of iid samples from the Gaussian distribution with μ and Σ unknown.&nbsp;The MLE for μ&nbsp;and Σ are<br>  
 
&nbsp; &nbsp; Let D = {x<sub>1</sub>,x<sub>2</sub>,...,x<sub>N</sub>}&nbsp;to be a set of iid samples from the Gaussian distribution with μ and Σ unknown.&nbsp;The MLE for μ&nbsp;and Σ are<br>  
 
<center><math>\widehat{\mu} = \frac{1}{N}\sum_{k=1}^{N}x_{k}</math><br></center> <center><math>\widehat{\mathbf{\Sigma}} = \frac{1}{N}\sum_{k=1}^{N}(x_{k}-\mu)(x_{k}-\mu)^{T}</math><br></center>  
 
<center><math>\widehat{\mu} = \frac{1}{N}\sum_{k=1}^{N}x_{k}</math><br></center> <center><math>\widehat{\mathbf{\Sigma}} = \frac{1}{N}\sum_{k=1}^{N}(x_{k}-\mu)(x_{k}-\mu)^{T}</math><br></center>  
See [https://www.projectrhea.org/rhea/index.php/Mle_tutorial Tutorial on Maximum Likelihood Estimation: A Parametric Density Estimation Method] for more details.<br>
+
See [https://www.projectrhea.org/rhea/index.php/Mle_tutorial Tutorial on Maximum Likelihood Estimation: A Parametric Density Estimation Method] for more details.<br>  
 +
 
 +
----
 +
 
 +
----
 +
 
 +
== 3. Example:&nbsp;data classification based on the Bayes rule  ==
 +
 
 +
&nbsp; &nbsp; Assume a two-claess case, the total population for class 1 and class 2 are 10000, where the prior probability for class 1 is 0.3, so that the prior probability for class 2 is 0.7. The data of category 1 are distributed as N(1, 2), while those of category 2 are distributed as N(4, 1).
 +
 
 +
&nbsp; &nbsp; 200 training samples are randomly chosen from the population pool. In this sense, the proportion of class 1 in the training samples won't necessarily be the same as its proportion in the whole population. We then pick another 500 test samples from the population randomly. Keep in mind that we know the classification for each data point no matter for training samples or test samples. We build the discriminant function based on the classification of the training sample, while test the accuracy by compare the decisions made for the test sample and their ground truth.

Revision as of 11:07, 22 April 2014


Bayes rule in practice
A slecture by Lu Wang

(partially based on Prof. Mireille Boutin's ECE 662 lecture)



1. Bayes rule for Gaussian data

    Given data x ∈ Rd and N categories {wi}, i=1,2,…,N, we decide which category the data corresponds to by computing the probability of the N events. We’ll pick the category with the largest probability. Mathematically, this can be interpreted as:


$ \max_{w_{i}} \rho \left(w_{i}|x\right) $

According to Bayes rule:

$ \max_{w_{i}} \rho \left(w_{i}|x\right) = \max_{w_{i}} \rho \left(x|w_{i}\right)Prob(w_{i}) $

In our case, the data is distributed as Gaussian. So we have,

$ \rho \left(x|w_{i}\right) = \frac{1}{(2\pi)^{\frac{n}{2}}|\mathbf{\Sigma}|^{\frac{1}{2}}}\mbox{exp}\left[{-\frac{1}{2}(x - \mu)^T\mathbf{\Sigma}^{-1}(x - \mu)}\right] $

Let

$ \begin{align}g_{i}(x) &= ln(\rho \left(w_{i}|x\right)) \\ &= ln(\rho \left(x|w_{i}\right)Prob(w_{i})) \\ &= -\frac{n}{2}ln(2\pi)-\frac{1}{2}ln(|\mathbf{\Sigma}|)-{\frac{1}{2}(x - \mu)^T\mathbf{\Sigma}^{-1}(x - \mu)}+ln(Prob(w_{i})) \end{align} $

Now we have,


$ \begin{align}\max_{w_{i}} \rho \left(w_{i}|x\right) &= \max_{w_{i}} \rho \left(x|w_{i}\right)Prob(w_{i}) \\ &= \max_{w_{i}} g_{i}(x) \end{align} $


For two-class case, generate the discriminant function as

$ g\left(x\right) = g_{1}\left(x\right) - g_{2}\left(x\right); $
decide w1 if g(x) > 0;
else decide w2.

2. How to evaluate parameters for Gaussian data

    If the parameters for Gaussian, μi and Σi are unknown for category wi, we then need to estimate the parameters based on the information of data samples.

2.1  Generate training and testing samples
    Given the data samples with known category, we are able to estimate the parameters for the Gaussian distribution. Data samples are divided into training samples and testing samples, where μi and Σi are estimated from the training samples first, and then evaluated based on the testing samples. Generally, the more training samples, the more accurate the estimation will be. Also, it is important to select training samples that can represent the distribution of the population.

2.2 How to estimate μi and Σi
    There are various methods to estimate the density distribution. Generally speaking, we have parametric and non-parametric density estimation methods.
    Parametric methods usually derive mathematical expression involving the training data for each parameter. Classic parametric methods include Maximum Likelihood estimation (MLE), and Bayesian Parametric estimation (BPE). However, parametric methods would result in large error rate if the density assumption (such as Gaussian in our case) is incorrect.
    Non-parametric methods do not require the foreknowledge of the density distribution. In fact, they do not estimate the parameters; instead, they estimate the density for each point to be classified. Parzen window and K-nearest neighbors (KNN) are two of the famous non-parametric methods. Two common concerns about non-parametric methods are computational complexity and the number of training samples required.

2.3 Estimate μ and Σ through MLE    

    Let D = {x1,x2,...,xN} to be a set of iid samples from the Gaussian distribution with μ and Σ unknown. The MLE for μ and Σ are

$ \widehat{\mu} = \frac{1}{N}\sum_{k=1}^{N}x_{k} $
$ \widehat{\mathbf{\Sigma}} = \frac{1}{N}\sum_{k=1}^{N}(x_{k}-\mu)(x_{k}-\mu)^{T} $

See Tutorial on Maximum Likelihood Estimation: A Parametric Density Estimation Method for more details.



3. Example: data classification based on the Bayes rule

    Assume a two-claess case, the total population for class 1 and class 2 are 10000, where the prior probability for class 1 is 0.3, so that the prior probability for class 2 is 0.7. The data of category 1 are distributed as N(1, 2), while those of category 2 are distributed as N(4, 1).

    200 training samples are randomly chosen from the population pool. In this sense, the proportion of class 1 in the training samples won't necessarily be the same as its proportion in the whole population. We then pick another 500 test samples from the population randomly. Keep in mind that we know the classification for each data point no matter for training samples or test samples. We build the discriminant function based on the classification of the training sample, while test the accuracy by compare the decisions made for the test sample and their ground truth.

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett