Latest revision as of 10:16, 20 May 2013

Lecture 7, ECE662: Decision Theory

Lecture notes for ECE662 Spring 2008, Prof. Boutin.

Other lectures: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,

Lecture Content

Maximum Likelihood Estimation and Bayesian Parameter Estimation
Parametric Estimation of Class Conditional Density

Relevant Links

The class conditional density $p(\vec{x}|w_i)$ can be estimated using training data. We denote the parameter of estimation as $\vec{\theta}$ . There are two methods of estimation discussed.

MLE: Maximum Likelihood Estimation

BPE: Bayesian Parameter Estimation

Maximum Likelihood Estimation

Let "c" denote the number of classes. D, the entire collection of sample data. $D_1, \ldots, D_c$ represent the classification of data into classes $\omega_1, \ldots, \omega_c$ . It is assumed that: - Samples in $$ D_i $$ give no information about the samples in $D_j, i \neq j$ , and - Each sample is drawn independently.

Example: The class conditional density $p(\vec{x}|w_i)$ depends on parameter $\vec{\theta_i}$ . If $X ~ N(\mu,\sigma^2)$ denotes the class conditional density; then $\vec{\theta}=[\mu,\sigma^2]$ .

Let n be the size of training sample, and $D=\{\vec{X_1}, \ldots, \vec{X_n}\}$ . Then,

$p(\vec{X}|\omega_i,\vec{\theta_i})$ equals $p(\vec{X}|\vec{\theta})$ for a single class.

The Likelihood Function is, then, defined as $p(D|\vec{\theta})=\displaystyle \prod_{k=1}^n p(\vec{X_k}|\vec{\theta})$ , which needs to be maximized for obtaining the parameter.

Since logarithm is a monotonic function, maximizing the Likelihood is same as maximizing log of Likelihood which is defined as $l(\vec{\theta})=log p(D|\vec{\theta})=\displaystyle log(\prod_{k=1}^n p(\vec{X_k}|\vec{\theta}))=\displaystyle \sum_{k=1}^n log(p(\vec{X_k}|\vec{\theta}))$ .

"l" is the log likelihood function.

Maximize log likelyhood function with respect to $\vec{\theta}$

$\rightarrow \hat{\theta} = argmax \left( l (\vec{\theta}) \right)$

If $l(\vec{\theta})$ is a differentiable function

Let $\vec{\theta} = \left[ \theta_1, \theta_2, \cdots , \theta_p \right]$ be 1 by p vector, then

$\nabla_{\vec{\theta}} = \left[ \frac{\partial}{\partial\theta_1} \frac{\partial}{\partial\theta_2} \cdots \frac{\partial}{\partial\theta_p} \right]^{t}$

Then, we can compute the first derivatives of log likelyhood function,

$\rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = \sum_{k=1}^{n} \nabla_{\vec{\theta}} \left[ log(p(\vec{x_k} | \vec{\theta})) \right]$

and equate this first derivative to be zero

$\rightarrow \nabla_{\vec{\theta}} ( l (\vec{\theta}) ) = 0$

Example of Guassian case

Assume that covariance matrix are known.

$p(\vec{x_k} | \vec{\mu}) = \frac{1}{ \left( (2\pi)^{d} |\Sigma| \right)^{\frac{1}{2}}} exp \left[ - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right]$

Step 1: Take log

$log p(\vec{x_k} | \vec{\mu}) = -\frac{1}{2} log \left( (2\pi)^d |\Sigma| \right) - \frac{1}{2} (\vec{x_k} - \vec{\mu})^{t} \Sigma^{-1} (\vec{x_k} - \vec{\mu})$

Step 2: Take derivative

$\frac{\partial}{\partial\vec{\mu}} \left( log p(\vec{x_k} | \vec{\mu}) \right) = \frac{1}{2} \left[ (\vec{x_k} - \vec{\mu})^t \Sigma^{-1}\right]^t + \frac{1}{2} \left[ \Sigma^{-1} (\vec{x_k} - \vec{\mu}) \right] = \Sigma^{-1} (\vec{x_k} - \vec{\mu})$

Step 3: Equate to 0

$\sum_{k=1}^{n} \Sigma^{-1} (\vec{x_k} - \vec{\mu}) = 0$

$\rightarrow \Sigma^{-1} \sum_{k=1}^{n} (\vec{x_k} - \vec{\mu}) = 0$

$\rightarrow \Sigma^{-1} \left[ \sum_{k=1}^{n} \vec{x_k} - n \vec{\mu}\right] = 0$

$\Longrightarrow \hat{\vec{\mu}} = \frac{1}{n} \sum_{k=1}^{n} \vec{x_k}$

This is the sample mean for a sample size n.

Examples of MLE: Exponential and Geometric Distributions

Examples of MLE: Binomial and Poisson Distributions

Advantages of MLE:

Simple
Converges
Asymptotically unbiased (though biased for small N)

Bayesian Parameter Estimation

For a given class, let $$ x $$ be feature vector of the class and $\theta$ be parameter of pdf of $$ x $$ to be estimated.

And let $D= \{ x_1, x_2, \cdots, x_n \}$ , where $$ x $$ are training samples of the class

Note that $\theta$ is random variable with probability density $p(\theta)$

where

EXAMPLE: Bayesian Inference for Gaussian Mean

The univariate case. The variance is assumed to be known.

Here's a summary of results:

Univariate Gaussian density $p(x|\mu)\sim N(\mu,\sigma^{2})$

Prior density of the mean $p(\mu)\sim N(\mu_{0},\sigma_{0}^{2})$

Posterior density of the mean $p(\mu|D)\sim N(\mu_{n},\sigma_{n}^{2})$

where

$\mu_{n}=\left(\frac{n\sigma_{0}^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\right)\hat{\mu}_{n}+\frac{\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\mu_{0}$

$\sigma_{n}^{2}=\frac{\sigma_{0}^{2}\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}$

$\hat{\mu}_{n}=\frac{1}{n}\sum_{k=1}^{n}x_{k}$

Finally, the class conditional density is given by $p(x|D)\sim N(\mu_{n},\sigma^{2}+\sigma_{n}^{2})$

The above formulas can be interpreted as: in making prediction for a single new observation, the variance of the estimate will have two components: 1) $\sigma^{2}$ - the inherent variance within the distribution of x, i.e. the variance that would never be eliminated even with perfect information about the underlying distribution model; 2) $\sigma_{n}^{2}$ - the variance introduced from the estimation of the mean vector "mu", this component can be eliminated given exact prior information or very large training set ( N goes to infinity);

BayesianInference GaussianMean small Old Kiwi.jpg

The above figure illustrates the Bayesian inference for the mean of a Gaussian distribution, for which the variance is assumed to be known. The curves show the prior distribution over 'mu' (the curve labeled N=0), which in this case is itself Gaussian, along with the posterior distributions for increasing number N of data points. The figure makes clear that as the number of data points increase, the posterior distribution peaks around the true value of the mean. This phenomenon is known as *Bayesian learning*.

For more information

Back to ECE662, Spring 2008, Prof. Boutin

@@ Line 1: / Line 1: @@
-[[ECE662:ECE662_Old Kiwi#Class_Lecture_Notes Class|Class Lecture Notes]]
+=Lecture 7, [[ECE662]]: Decision Theory=
-== Lecture Objective ==
+Lecture notes for [[ECE662:BoutinSpring08_Old_Kiwi|ECE662 Spring 2008]], Prof. [[user:mboutin|Boutin]].
+Other lectures: [[Lecture 1 - Introduction_Old Kiwi|1]],
+[[Lecture 2 - Decision Hypersurfaces_Old Kiwi|2]],
+[[Lecture 3 - Bayes classification_Old Kiwi|3]],
+[[Lecture 4 - Bayes Classification_Old Kiwi|4]],
+[[Lecture 5 - Discriminant Functions_Old Kiwi|5]],
+[[Lecture 6 - Discriminant Functions_Old Kiwi|6]],
+[[Lecture 7 - MLE and BPE_Old Kiwi|7]],
+[[Lecture 8 - MLE, BPE and Linear Discriminant Functions_Old Kiwi|8]],
+[[Lecture 9 - Linear Discriminant Functions_Old Kiwi|9]],
+[[Lecture 10 - Batch Perceptron and Fisher Linear Discriminant_Old Kiwi|10]],
+[[Lecture 11 - Fischer's Linear Discriminant again_Old Kiwi|11]],
+[[Lecture 12 - Support Vector Machine and Quadratic Optimization Problem_Old Kiwi|12]],
+[[Lecture 13 - Kernel function for SVMs and ANNs introduction_Old Kiwi|13]],
+[[Lecture 14 - ANNs, Non-parametric Density Estimation (Parzen Window)_Old Kiwi|14]],
+[[Lecture 15 - Parzen Window Method_Old Kiwi|15]],
+[[Lecture 16 - Parzen Window Method and K-nearest Neighbor Density Estimate_Old Kiwi|16]],
+[[Lecture 17 - Nearest Neighbors Clarification Rule and Metrics_Old Kiwi|17]],
+[[Lecture 18 - Nearest Neighbors Clarification Rule and Metrics(Continued)_Old Kiwi|18]],
+[[Lecture 19 - Nearest Neighbor Error Rates_Old Kiwi|19]],
+[[Lecture 20 - Density Estimation using Series Expansion and Decision Trees_Old Kiwi|20]],
+[[Lecture 21 - Decision Trees(Continued)_Old Kiwi|21]],
+[[Lecture 22 - Decision Trees and Clustering_Old Kiwi|22]],
+[[Lecture 23 - Spanning Trees_Old Kiwi|23]],
+[[Lecture 24 - Clustering and Hierarchical Clustering_Old Kiwi|24]],
+[[Lecture 25 - Clustering Algorithms_Old Kiwi|25]],
+[[Lecture 26 - Statistical Clustering Methods_Old Kiwi|26]],
+[[Lecture 27 - Clustering by finding valleys of densities_Old Kiwi|27]],
+[[Lecture 28 - Final lecture_Old Kiwi|28]],
+----
+----
+== Lecture Content ==
 * Maximum Likelihood Estimation and Bayesian Parameter Estimation
 * Parametric Estimation of Class Conditional Density
-See also: [[Comparison of MLE and Bayesian Parameter Estimation_Old Kiwi]]
+== Relevant Links==
+*MLE: [[Maximum Likelihood Estimation_Old Kiwi| Maximum Likelihood Estimation]]
+**[[MLE Examples: Exponential and Geometric Distributions_Old Kiwi|Examples of MLE: Exponential and Geometric Distributions ]]
+**[[MLE Examples: Binomial and Poisson Distributions_Old Kiwi|Examples of MLE: Binomial and Poisson Distributions]]
+*BPE: [[Bayesian Parameter Estimation_Old Kiwi|Bayesian Parameter Estimation]]
+*[[Comparison of MLE and Bayesian Parameter Estimation_Old Kiwi|Comparison of MLE and Bayesian Parameter Estimation]]
+*[[Parametric Estimators_Old Kiwi|Parametric Estimators]]
+----
+-----
 The class conditional density <math>p(\vec{x}|w_i)</math> can be estimated using training data. We denote the parameter of estimation as <math>\vec{\theta}</math>. There are two methods of estimation discussed.
-MLE [[Maximum Likelihood Estimation_Old Kiwi]]
+MLE: [[Maximum Likelihood Estimation_Old Kiwi| Maximum Likelihood Estimation]]
-BPE [[Bayesian Parameter Estimation_Old Kiwi]]
+BPE: [[Bayesian Parameter Estimation_Old Kiwi|Bayesian Parameter Estimation]]
@@ Line 80: / Line 119: @@
 This is the sample mean for a sample size n.
-[[MLE Examples: Exponential and Geometric Distributions_Old Kiwi]]
+[[MLE Examples: Exponential and Geometric Distributions_Old Kiwi|Examples of MLE: Exponential and Geometric Distributions ]]
-[[MLE Examples: Binomial and Poisson Distributions_Old Kiwi]]
+[[MLE Examples: Binomial and Poisson Distributions_Old Kiwi|Examples of MLE: Binomial and Poisson Distributions]]
 '''Advantages of MLE:'''
@@ Line 104: / Line 143: @@
 [[Image:Equation2_Old Kiwi.png]]
-'''Here is a good [http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html example]
-.'''
 == EXAMPLE: Bayesian Inference for Gaussian Mean ==
@@ Line 137: / Line 173: @@
 The above figure illustrates the Bayesian inference for the mean of a Gaussian distribution, for which the variance is assumed to be known. The curves show the prior distribution over 'mu' (the curve labeled N=0), which in this case is itself Gaussian, along with the posterior distributions for increasing number N of data points. The figure makes clear that as the number of data points increase, the posterior distribution peaks around the true value of the mean. This phenomenon is known as *Bayesian learning*.
+----
+----
+==For more information==
-'''For more information:'''
+*MLE: [[Maximum Likelihood Estimation_Old Kiwi| Maximum Likelihood Estimation]]
+**[[MLE Examples: Exponential and Geometric Distributions_Old Kiwi|Examples of MLE: Exponential and Geometric Distributions ]]
-[[Parametric Estimators_Old Kiwi]]
+**[[MLE Examples: Binomial and Poisson Distributions_Old Kiwi|Examples of MLE: Binomial and Poisson Distributions]]
+*BPE: [[Bayesian Parameter Estimation_Old Kiwi|Bayesian Parameter Estimation]]
+*[[Comparison of MLE and Bayesian Parameter Estimation_Old Kiwi|Comparison of MLE and Bayesian Parameter Estimation]]
+*[[Parametric Estimators_Old Kiwi|Parametric Estimators]]
+----
+[[ECE662:BoutinSpring08_Old_Kiwi|Back to ECE662, Spring 2008, Prof. Boutin]]
-== Lectures ==
+[[Category:lecture notes]]
-[[Lecture_1_-_Introduction_Old Kiwi| [1]]] [http://balthier.ecn.purdue.edu/index.php/Lecture_2_-_Decision_Hypersurfaces 2] [http://balthier.ecn.purdue.edu/index.php/Lecture_3_-_Bayes_classification 3]
-[http://balthier.ecn.purdue.edu/index.php/Lecture_4_-_Bayes_Classification 4] [http://balthier.ecn.purdue.edu/index.php/Lecture_5_-_Discriminant_Functions 5] [http://balthier.ecn.purdue.edu/index.php/Lecture_6_-_Discriminant_Functions 6] [http://balthier.ecn.purdue.edu/index.php/Lecture_7_-_MLE_and_BPE 7] [http://balthier.ecn.purdue.edu/index.php/Lecture_8_-_MLE%2C_BPE_and_Linear_Discriminant_Functions 8] [http://balthier.ecn.purdue.edu/index.php/Lecture_9_-_Linear_Discriminant_Functions 9] [http://balthier.ecn.purdue.edu/index.php/Lecture_10_-_Batch_Perceptron_and_Fisher_Linear_Discriminant 10] [http://balthier.ecn.purdue.edu/index.php/Lecture_11_-_Fischer%27s_Linear_Discriminant_again 11] [http://balthier.ecn.purdue.edu/index.php/Lecture_12_-_Support_Vector_Machine_and_Quadratic_Optimization_Problem 12] [http://balthier.ecn.purdue.edu/index.php/Lecture_13_-_Kernel_function_for_SVMs_and_ANNs_introduction 13] [http://balthier.ecn.purdue.edu/index.php/Lecture_14_-_ANNs%2C_Non-parametric_Density_Estimation_%28Parzen_Window%29 14] [http://balthier.ecn.purdue.edu/index.php/Lecture_15_-_Parzen_Window_Method 15] [http://balthier.ecn.purdue.edu/index.php/Lecture_16_-_Parzen_Window_Method_and_K-nearest_Neighbor_Density_Estimate 16] [http://balthier.ecn.purdue.edu/index.php/Lecture_17_-_Nearest_Neighbors_Clarification_Rule_and_Metrics 17] [http://balthier.ecn.purdue.edu/index.php/Lecture_18_-_Nearest_Neighbors_Clarification_Rule_and_Metrics%28Continued%29 18]
-[http://balthier.ecn.purdue.edu/index.php/Lecture_19_-_Nearest_Neighbor_Error_Rates 19]
-[http://balthier.ecn.purdue.edu/index.php/Lecture_20_-_Density_Estimation_using_Series_Expansion_and_Decision_Trees 20]

Difference between revisions of "Lecture 7 - MLE and BPE Old Kiwi" - Rhea