(16 intermediate revisions by the same user not shown)
Line 4: Line 4:
 
[[Category:ECE662]]
 
[[Category:ECE662]]
 
[[Category:pattern recognition]]
 
[[Category:pattern recognition]]
[[Category:MATLAB]]
 
  
 
<center><font size= 4>
 
<center><font size= 4>
'''Bayes Error for Minimizing Risk''' <br />
+
'''Bayes Rule for Minimizing Risk''' <br />
 
</font size>
 
</font size>
 
<font size=2>
 
<font size=2>
Line 15: Line 14:
 
Partly based on the [[2014_Spring_ECE_662_Boutin|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]].  
 
Partly based on the [[2014_Spring_ECE_662_Boutin|ECE662 Spring 2014 lecture]] material of [[user:mboutin|Prof. Mireille Boutin]].  
 
</center>
 
</center>
 +
 +
[[Media:Slecture_Bayes_error_DennisLee.pdf|Click here for PDF version]]
  
 
----
 
----
Line 38: Line 39:
 
q_i(x) </math> be the posterior probability of class <math> i </math>
 
q_i(x) </math> be the posterior probability of class <math> i </math>
 
(denoted as <math> \omega_i </math>) given <math>
 
(denoted as <math> \omega_i </math>) given <math>
x </math>, and let <math> P_i </math> be the prior probability for <math>omega_i
+
x </math>, and let <math> P_i </math> be the prior probability for <math>\omega_i
 
</math>.  Let <math> p_i(x) </math> be the class conditional density for
 
</math>.  Let <math> p_i(x) </math> be the class conditional density for
 
class <math> i </math>.  Denote <math> c_{ij} </math> as the
 
class <math> i </math>.  Denote <math> c_{ij} </math> as the
Line 46: Line 47:
  
 
<center>
 
<center>
<math>r_i(x) =  c_{i1} q_1(x) + c_{i2} q_2(x)</math>
+
<math>r_i(x) =  c_{i1} q_1(x) + c_{i2} q_2(x) \text{......Eq.(1)}</math>
 
</center>
 
</center>
  
Line 54: Line 55:
  
 
<center>
 
<center>
<math>r_1(x) \lessgtr^{\omega_1}_{\omega_2} r_2(x)</math>
+
<math>r_1(x) \lessgtr^{\omega_1}_{\omega_2} r_2(x) \text{......Eq.(2)}</math>
 
</center>
 
</center>
  
Line 61: Line 62:
 
If we make the decision this way, the total cost becomes
 
If we make the decision this way, the total cost becomes
  
 
+
<center>
<math>\begin{split}
+
<math>\begin{align}
 
E[r(x)] &= \int \text{min}[ r_1(x), r_2(x) ] p(x) dx \\
 
E[r(x)] &= \int \text{min}[ r_1(x), r_2(x) ] p(x) dx \\
 
     &= \int \text{min} [c_{11} q_1(x) + c_{12} q_2(x),
 
     &= \int \text{min} [c_{11} q_1(x) + c_{12} q_2(x),
Line 75: Line 76:
 
         \int_{R_1} (c_{11} - c_{21}) P_1 p_1(x) + (c_{12} -
 
         \int_{R_1} (c_{11} - c_{21}) P_1 p_1(x) + (c_{12} -
 
c_{22}) P_2 p_2(x) dx
 
c_{22}) P_2 p_2(x) dx
\end{split}</math>
+
\end{align} \text{......Eq.(3)}</math>
 
</center>
 
</center>
  
 +
where <math> R_1 </math> and <math> R_2 </math> are partitions of <math> \mathbb R^n </math>,
 +
<math> R_1 </math> and <math> R_2 </math> are determined by the decision rule from
 +
Eq. (2), and
 +
we use <math> \int_{R_2} p_1(x) dx = 1 - \int_{R_1} p_1(x) dx  </math> in
 +
the last line.
 +
To minimize Eq. (3), we make the
 +
integrand as negative as possible:
 +
 +
<center>
 +
<math>(c_{12} - c_{22}) P_2 p_2(x) \lessgtr^{\omega_1}_{\omega_2}
 +
(c_{21} - c_{11}) P_1 p_1(x) \text{......Eq.(4)}</math>
 +
</center>
 +
 +
which is equivalent to assigning <math> x \in \omega_1 </math> if <math> x </math>
 +
makes the integrand negative, and <math> x \in \omega_2 </math>
 +
otherwise.  Bayes test for minimum cost can now be stated as
 +
 +
<center>
 +
<math>\frac{p_1(x)}{p_2(x)} \gtrless^{\omega_1}_{\omega_2}
 +
\frac{(c_{12} - c_{22})P_2}{(c_{21} - c_{11})P_1} \text{......Eq.(5)}</math>
 +
</center>
 +
 
----
 
----
 
== Example 1: 1D features ==
 
== Example 1: 1D features ==
 +
 +
Consider two classes and <math> x \in \mathbb R </math>.  Let <math>
 +
p_1(x) = N(\mu_1, \sigma_1) </math> and <math> p_2(x) = N(\mu_2,
 +
\sigma_2) </math>.  For simplicity, let <math> c_{11} = c_{22} = 0 </math>.
 +
From Eq. (4) we assign <math> x \in \omega_1 </math>
 +
if
 +
 +
<center>
 +
<math>\begin{align}
 +
c_{21} P_1 p_1(x) &> c_{12} P_2 p_2(x) \\
 +
\iff \text{ln}(c_{21} P_1) + \text{ln}(p_1(x)) &>
 +
    \text{ln}(c_{12} P_2) + \text{ln}(p_2(x)) \\
 +
\iff \text{ln}(c_{21} P_1) - \text{ln}(\sigma_1)
 +
    - \frac{1}{2}\text{ln}(2 \pi)
 +
    - \frac{(x - \mu_1)^2}{2 \sigma_1^2} &>
 +
    \text{ln}(c_{12} P_2) - \text{ln}(\sigma_2)
 +
    - \frac{1}{2}\text{ln}(2 \pi)
 +
    - \frac{(x - \mu_2)^2}{2 \sigma_2^2}
 +
\end{align} \text{......Eq.(6)}</math>
 +
</center>
 +
 +
so the discriminant function becomes
 +
 +
<center>
 +
<math>\frac{1}{2} \left( \frac{1}{\sigma_2^2} -
 +
    \frac{1}{\sigma_1^2} \right) x^2
 +
    + \left( \frac{\mu_1}{\sigma_1^2} -
 +
    \frac{\mu_2}{\sigma_2^2} \right) x
 +
    + \frac{1}{2} \left( \frac{\mu_2^2}{\sigma_2^2}
 +
    - \frac{\mu_1^2}{\sigma_1^2} \right)
 +
    + \text{ln} \left( \frac{\sigma_2}{\sigma_1}  \right)
 +
    + \text{ln} \left( \frac{P_1 c_{21}}{P_2 c_{12}} \right)
 +
    > 0 \text{......Eq.(7)}</math>
 +
</center>
 +
 +
which has the form
 +
 +
<center>
 +
<math>a x^2 + b x + c > 0</math>
 +
</center>
 +
 +
where
 +
 +
<center>
 +
<math>a = \frac{1}{2} \left( \frac{1}{\sigma_2^2} -
 +
    \frac{1}{\sigma_1^2} \right),</math>
 +
</center>
 +
 +
<center>
 +
<math>b = \frac{\mu_1}{\sigma_1^2} -
 +
    \frac{\mu_2}{\sigma_2^2},</math>
 +
</center>
 +
 +
and
 +
 +
<center>
 +
<math>c = \frac{1}{2} \left( \frac{\mu_2^2}{\sigma_2^2}
 +
    - \frac{\mu_1^2}{\sigma_1^2} \right)
 +
    + \text{ln} \left( \frac{\sigma_2}{\sigma_1}  \right)
 +
    + \text{ln} \left( \frac{P_1 c_{21}}{P_2 c_{12}} \right).</math>
 +
</center>
 +
 +
This form is similar to Bayes rule for minimizing error,
 +
except for the factor of <math> \text{ln} \left( \frac{P_1
 +
c_{21}}{P_2 c_{12}} \right) </math>, which shifts the decision
 +
thresholds.  An equivalent formulation is to decide <math>
 +
x \in \omega_1 </math> if
 +
 +
<center>
 +
<math>\frac{p_1(x)}{p_2(x)} > \frac{P_2 c_{12}}{P_1 c_{21}}</math>
 +
</center>
 +
 +
or
 +
 +
<center>
 +
<math>\frac{p_1(x)}{p_2(x)} > \lambda</math>
 +
</center>
 +
 +
where
 +
 +
<center>
 +
<math>\lambda =  \frac{P_2 c_{12}}{P_1 c_{21}}.</math>
 +
</center>
 +
 +
We can interpret the decision rule as a modification of the
 +
Neyman-Pearson criterion that takes into account the priors
 +
and the cost.
 +
  
 
----
 
----
 
== Example 2: 2D features ==
 
== Example 2: 2D features ==
  
 +
Let <math> x \in \mathbb R^2 </math> with normal class conditional densities
 +
 +
<center>
 +
<math>
 +
p_i(x) = \frac{1}{(2 \pi)^n} | \Sigma_i |^{-1/2}
 +
\text{exp}\left[ -\frac{1}{2} (x - \mu_i)^T \sigma_i^{-1} (x
 +
- \mu_i)  \right]
 +
</math>
 +
</center>
 +
 +
where <math> i = 1, 2 </math>.  For simplicity, let <math> c_{11} = c_{22} = 0 </math>. 
 +
Similar to Eq. (6),
 +
we decide <math> x \in \omega_1 </math> if
 +
 +
<center>
 +
<math>\begin{align}
 +
c_{21} P_1 p_1(x) &> c_{12} P_2 p_2(x) \\
 +
\iff \text{ln}(c_{21} P_1) + \text{ln}(p_1(x)) &>
 +
    \text{ln}(c_{12} P_2) + \text{ln}(p_2(x)) \\
 +
\iff \text{ln}(c_{21} P_1) - \text{ln}|\Sigma_1|
 +
    - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) &>
 +
    \text{ln}(c_{12} P_2) - \text{ln} |\Sigma_2|
 +
    - (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2)
 +
\end{align}
 +
\text{......Eq.(8)}
 +
</math>
 +
</center>
 +
 +
so the discriminant function becomes
 +
 +
<center>
 +
<math>\frac{1}{2} x^T (\Sigma_2^{-1} - \Sigma_1^{-1}) x +
 +
x^T( \Sigma_1^{-1} \mu_1 - \Sigma_2^{-1} \mu_2 ) +
 +
\frac{1}{2} ( \mu_2^T \Sigma_2^{-1} \mu_2 - \mu_1^T
 +
\Sigma_1^{-1} \mu_1 ) +
 +
\text{ln} \left( \frac{P_1 c_{21}}{P_2 c_{12}} \right) +
 +
\frac{1}{2} \text{ln} \left( \left| \frac{\Sigma_2}{\Sigma_1} \right| \right)
 +
\text{......Eq.(9)}</math>
 +
</center>
 +
 +
which has the form
 +
 +
<center>
 +
<math>x^T A x + b^T x + c > 0</math>
 +
</center>
 +
 +
where
 +
 +
<center>
 +
<math>A = \frac{1}{2} (\Sigma_2^{-1} - \Sigma_1^{-1}),</math>
 +
</center>
 +
 +
<center>
 +
<math>b = \Sigma_1^{-1} \mu_1 - \Sigma_2^{-1} \mu_2,</math>
 +
</center>
 +
 +
and
 +
 +
<center>
 +
<math>c = \frac{1}{2} ( \mu_2^T \Sigma_2^{-1} \mu_2 - \mu_1^T
 +
\Sigma_1^{-1} \mu_1 ) +
 +
\text{ln} \left( \frac{P_1 c_{21}}{P_2 c_{12}} \right) +
 +
\frac{1}{2} \text{ln} \left( \left|
 +
\frac{\Sigma_2}{\Sigma_1} \right| \right).
 +
</math>
 +
</center>
 +
 +
As in the 1D case, we decide <math> x \in \omega_1 </math> if
 +
 +
<center>
 +
<math>\frac{p_1(x)}{p_2(x)} > \lambda</math>
 +
</center>
 +
 +
where
 +
 +
<center>
 +
<math>\lambda =  \frac{P_2 c_{12}}{P_1 c_{21}}.</math>
 +
</center>
 +
 +
When <math> \Sigma_1 = \Sigma_2 </math>, the Bayes classifier becomes a
 +
linear discriminant function.
 +
 +
 +
To give a specific illustration, we generate data from 2
 +
classes and take
 +
 +
<center>
 +
<math>\mu_1 =
 +
\left[
 +
\begin{array}{cc}
 +
8 & 1
 +
\end{array}
 +
\right]^T
 +
\text{......Eq.(10)}</math>
 +
</center>
 +
 +
<center>
 +
<math>\mu_2 =
 +
\left[
 +
\begin{array}{cc}
 +
7 & 7
 +
\end{array}
 +
\right]^T
 +
\text{......Eq.(11)}</math>
 +
</center>
 +
 +
<center>
 +
<math>\Sigma =
 +
\left[
 +
\begin{array}{cc}
 +
6 & 1 \\
 +
1 & 6
 +
\end{array}
 +
\right].
 +
\text{......Eq.(12)}</math>
 +
</center>
  
 
<center>[[Image:C12_c21.png|600px|thumb|left|Fig 1: Data for class 1 (crosses) and class 2 (circles).
 
<center>[[Image:C12_c21.png|600px|thumb|left|Fig 1: Data for class 1 (crosses) and class 2 (circles).
Line 89: Line 316:
 
Misclassified points are shown in red.
 
Misclassified points are shown in red.
 
Values of <math>\mu_1</math>, <math>\mu_2</math>, and <math>\Sigma</math> are given in
 
Values of <math>\mu_1</math>, <math>\mu_2</math>, and <math>\Sigma</math> are given in
Eqs. (------) - (----------).
+
Eqs. (10) - (12).
 
As the figures show, the separating hyperplanes shift depending
 
As the figures show, the separating hyperplanes shift depending
 
on the values of <math>c_{12}</math> and <math>c_{21}</math>.]]</center>
 
on the values of <math>c_{12}</math> and <math>c_{21}</math>.]]</center>
  
----
+
The data is classified using the discriminant function
 +
from Eq. (9).  To see the effects of
 +
<math> c_{12} </math> and <math> c_{21} </math>, we vary their values and examine
 +
how the separating hyperplane shifts in Fig. 1.
 +
We examine the following cases:
  
== Summary and Conclusions ==
+
* When <math> c_{12} = 1 </math> and <math> c_{21} = 1 </math>, the cost of misclassifying classes 1 and 2 are equal.  We are reduced to Bayes rule. The separating hyperplane is positioned to minimize the probability of error, as Fig. 1(a) shows.
  
In this lecture we have shown that the probability of error ($Prob \left[ Error \right] $) when using Bayes error, is upper bounded by the Chernoff Bound. Therefore,
+
* When <math> c_{12} = 5 </math> and <math> c_{21} = 1 </math>, the cost of misclassifying class 2 is high. Thus, the separating hyperplane shifts toward class 1, so less points from class 2 are misclassified, as Fig. 1(b) shows.
  
<center><math>Prob \left[ Error \right] \leq \varepsilon_{\beta}</math></center>
+
* When <math> c_{12} = 1 </math> and <math> c_{21} = 5 </math>, the cost of misclassifying class 1 is high, so the separating hyperplane shifts toward class 2.  As a result, less points from class 1 are misclassified, as Fig. 1(c) shows.
  
for <math>\beta \in \left[ 0, 1 \right]</math>.
 
  
When <math>\beta =\frac{1}{2}</math> then <math>\varepsilon_{\frac{1}{2}}</math> in known as the Bhattacharyya bound.
 
 
----
 
----
 +
== Summary and Conclusions ==
  
 +
 +
We have derived Bayes rule for minimizing risk.  The rule
 +
can be stated as
 +
 +
<center>
 +
<math>\frac{p_1(x)}{p_2(x)} \gtrless^{\omega_1}_{\omega_2}
 +
\frac{(c_{12} - c_{22})P_2}{(c_{21} - c_{11})P_1}.</math>
 +
</center>
 +
 +
To illustrate this rule, we have given two examples dealing with
 +
1D and 2D features.  For both cases, the separating
 +
hyperplane shifts depending on the costs.  Figure
 +
1 provides a nice demonstration for the
 +
2D case.  When the cost of misclassifying class <math> i </math> is
 +
high, the separating hyperplane shifts to reduce the number
 +
of points misclassified from class <math> i </math>.  We hope that this material provides the reader with a more general understanding of Bayes rule.
 +
 +
----
 
== References ==
 
== References ==
  
[1]. Duda, Richard O. and Hart, Peter E. and Stork, David G., "Pattern Classication (2nd Edition)," Wiley-Interscience, 2000.
+
[1]. K. Fukunaga, ''Introduction to Statistical Pattern Recognition'' (Academic, New York, 1972).
  
 
[2]. [https://engineering.purdue.edu/~mboutin/ Mireille Boutin], "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.
 
[2]. [https://engineering.purdue.edu/~mboutin/ Mireille Boutin], "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.
Line 115: Line 363:
 
== Questions and comments==
 
== Questions and comments==
  
If you have any questions, comments, etc. please post them On [[Bayes_Rule_for_Minimizing_Risk_Questions_and_comment|this page]].
+
If you have any questions, comments, etc. please post them on [[Comments_Bayes_Rule_Dennis_Lee|this page]].

Latest revision as of 11:34, 13 April 2014


Bayes Rule for Minimizing Risk
A slecture by ECE student Dennis Lee

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.

Click here for PDF version


Introduction

In class we discussed Bayes rule for minimizing the probability of error. Our goal is to generalize this rule to minimize risk instead of probability of error. For simplicity we deal with the two class case. Then we provide examples for the cases of 1D and 2D features, and we derive Bayes rule for minimizing risk in these cases.

We briefly motivate this topic by considering the following scenario. Suppose that a doctor has to classify a tumor as cancerous or benign. We consider the cost of misclassifying the tumor as benign to be high, since we would like the patient to be treated quickly if cancer is present. Therefore, we would assign a high cost to misclassifying the tumor as benign. We shall show how to incorporate this cost into Bayes rule in this article.


Bayes rule for minimizing risk

Let $ x \in \mathbb R^n $ be a feature vector. Let $ q_i(x) $ be the posterior probability of class $ i $ (denoted as $ \omega_i $) given $ x $, and let $ P_i $ be the prior probability for $ \omega_i $. Let $ p_i(x) $ be the class conditional density for class $ i $. Denote $ c_{ij} $ as the cost of deciding $ x \in \omega_i $ with $ \omega_j $ as the true class. The conditional cost of assigning $ x \in \omega_i $ given $ x $ is

$ r_i(x) = c_{i1} q_1(x) + c_{i2} q_2(x) \text{......Eq.(1)} $

where $ i = 1, 2 $.

We assign $ x $ so that cost is minimum:

$ r_1(x) \lessgtr^{\omega_1}_{\omega_2} r_2(x) \text{......Eq.(2)} $

which says to decide $ \omega_1 $ if $ r_1(x) < r_2(x) $ and $ \omega_2 $ otherwise. If we make the decision this way, the total cost becomes

$ \begin{align} E[r(x)] &= \int \text{min}[ r_1(x), r_2(x) ] p(x) dx \\ &= \int \text{min} [c_{11} q_1(x) + c_{12} q_2(x), c_{21} q_1(x) + c_{22} q_2(x)] p(x) dx \\ &= \int \text{min} [c_{11} P_1 p_1(x) + c_{12} P_2 p_2(x), c_{21} P_1 p_1(x) + c_{22} P_2 p_2(x) ] dx \\ &= \int_{R_1} c_{11} P_1 p_1(x) + c_{12} P_2 p_2(x) dx + \int_{R_2} c_{21} P_1 p_1(x) + c_{22} P_2 p_2(x) dx \\ &= (c_{21} P_1 + c_{22} P_2) + \int_{R_1} (c_{11} - c_{21}) P_1 p_1(x) + (c_{12} - c_{22}) P_2 p_2(x) dx \end{align} \text{......Eq.(3)} $

where $ R_1 $ and $ R_2 $ are partitions of $ \mathbb R^n $, $ R_1 $ and $ R_2 $ are determined by the decision rule from Eq. (2), and we use $ \int_{R_2} p_1(x) dx = 1 - \int_{R_1} p_1(x) dx $ in the last line. To minimize Eq. (3), we make the integrand as negative as possible:

$ (c_{12} - c_{22}) P_2 p_2(x) \lessgtr^{\omega_1}_{\omega_2} (c_{21} - c_{11}) P_1 p_1(x) \text{......Eq.(4)} $

which is equivalent to assigning $ x \in \omega_1 $ if $ x $ makes the integrand negative, and $ x \in \omega_2 $ otherwise. Bayes test for minimum cost can now be stated as

$ \frac{p_1(x)}{p_2(x)} \gtrless^{\omega_1}_{\omega_2} \frac{(c_{12} - c_{22})P_2}{(c_{21} - c_{11})P_1} \text{......Eq.(5)} $


Example 1: 1D features

Consider two classes and $ x \in \mathbb R $. Let $ p_1(x) = N(\mu_1, \sigma_1) $ and $ p_2(x) = N(\mu_2, \sigma_2) $. For simplicity, let $ c_{11} = c_{22} = 0 $. From Eq. (4) we assign $ x \in \omega_1 $ if

$ \begin{align} c_{21} P_1 p_1(x) &> c_{12} P_2 p_2(x) \\ \iff \text{ln}(c_{21} P_1) + \text{ln}(p_1(x)) &> \text{ln}(c_{12} P_2) + \text{ln}(p_2(x)) \\ \iff \text{ln}(c_{21} P_1) - \text{ln}(\sigma_1) - \frac{1}{2}\text{ln}(2 \pi) - \frac{(x - \mu_1)^2}{2 \sigma_1^2} &> \text{ln}(c_{12} P_2) - \text{ln}(\sigma_2) - \frac{1}{2}\text{ln}(2 \pi) - \frac{(x - \mu_2)^2}{2 \sigma_2^2} \end{align} \text{......Eq.(6)} $

so the discriminant function becomes

$ \frac{1}{2} \left( \frac{1}{\sigma_2^2} - \frac{1}{\sigma_1^2} \right) x^2 + \left( \frac{\mu_1}{\sigma_1^2} - \frac{\mu_2}{\sigma_2^2} \right) x + \frac{1}{2} \left( \frac{\mu_2^2}{\sigma_2^2} - \frac{\mu_1^2}{\sigma_1^2} \right) + \text{ln} \left( \frac{\sigma_2}{\sigma_1} \right) + \text{ln} \left( \frac{P_1 c_{21}}{P_2 c_{12}} \right) > 0 \text{......Eq.(7)} $

which has the form

$ a x^2 + b x + c > 0 $

where

$ a = \frac{1}{2} \left( \frac{1}{\sigma_2^2} - \frac{1}{\sigma_1^2} \right), $

$ b = \frac{\mu_1}{\sigma_1^2} - \frac{\mu_2}{\sigma_2^2}, $

and

$ c = \frac{1}{2} \left( \frac{\mu_2^2}{\sigma_2^2} - \frac{\mu_1^2}{\sigma_1^2} \right) + \text{ln} \left( \frac{\sigma_2}{\sigma_1} \right) + \text{ln} \left( \frac{P_1 c_{21}}{P_2 c_{12}} \right). $

This form is similar to Bayes rule for minimizing error, except for the factor of $ \text{ln} \left( \frac{P_1 c_{21}}{P_2 c_{12}} \right) $, which shifts the decision thresholds. An equivalent formulation is to decide $ x \in \omega_1 $ if

$ \frac{p_1(x)}{p_2(x)} > \frac{P_2 c_{12}}{P_1 c_{21}} $

or

$ \frac{p_1(x)}{p_2(x)} > \lambda $

where

$ \lambda = \frac{P_2 c_{12}}{P_1 c_{21}}. $

We can interpret the decision rule as a modification of the Neyman-Pearson criterion that takes into account the priors and the cost.



Example 2: 2D features

Let $ x \in \mathbb R^2 $ with normal class conditional densities

$ p_i(x) = \frac{1}{(2 \pi)^n} | \Sigma_i |^{-1/2} \text{exp}\left[ -\frac{1}{2} (x - \mu_i)^T \sigma_i^{-1} (x - \mu_i) \right] $

where $ i = 1, 2 $. For simplicity, let $ c_{11} = c_{22} = 0 $. Similar to Eq. (6), we decide $ x \in \omega_1 $ if

$ \begin{align} c_{21} P_1 p_1(x) &> c_{12} P_2 p_2(x) \\ \iff \text{ln}(c_{21} P_1) + \text{ln}(p_1(x)) &> \text{ln}(c_{12} P_2) + \text{ln}(p_2(x)) \\ \iff \text{ln}(c_{21} P_1) - \text{ln}|\Sigma_1| - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) &> \text{ln}(c_{12} P_2) - \text{ln} |\Sigma_2| - (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \end{align} \text{......Eq.(8)} $

so the discriminant function becomes

$ \frac{1}{2} x^T (\Sigma_2^{-1} - \Sigma_1^{-1}) x + x^T( \Sigma_1^{-1} \mu_1 - \Sigma_2^{-1} \mu_2 ) + \frac{1}{2} ( \mu_2^T \Sigma_2^{-1} \mu_2 - \mu_1^T \Sigma_1^{-1} \mu_1 ) + \text{ln} \left( \frac{P_1 c_{21}}{P_2 c_{12}} \right) + \frac{1}{2} \text{ln} \left( \left| \frac{\Sigma_2}{\Sigma_1} \right| \right) \text{......Eq.(9)} $

which has the form

$ x^T A x + b^T x + c > 0 $

where

$ A = \frac{1}{2} (\Sigma_2^{-1} - \Sigma_1^{-1}), $

$ b = \Sigma_1^{-1} \mu_1 - \Sigma_2^{-1} \mu_2, $

and

$ c = \frac{1}{2} ( \mu_2^T \Sigma_2^{-1} \mu_2 - \mu_1^T \Sigma_1^{-1} \mu_1 ) + \text{ln} \left( \frac{P_1 c_{21}}{P_2 c_{12}} \right) + \frac{1}{2} \text{ln} \left( \left| \frac{\Sigma_2}{\Sigma_1} \right| \right). $

As in the 1D case, we decide $ x \in \omega_1 $ if

$ \frac{p_1(x)}{p_2(x)} > \lambda $

where

$ \lambda = \frac{P_2 c_{12}}{P_1 c_{21}}. $

When $ \Sigma_1 = \Sigma_2 $, the Bayes classifier becomes a linear discriminant function.


To give a specific illustration, we generate data from 2 classes and take

$ \mu_1 = \left[ \begin{array}{cc} 8 & 1 \end{array} \right]^T \text{......Eq.(10)} $

$ \mu_2 = \left[ \begin{array}{cc} 7 & 7 \end{array} \right]^T \text{......Eq.(11)} $

$ \Sigma = \left[ \begin{array}{cc} 6 & 1 \\ 1 & 6 \end{array} \right]. \text{......Eq.(12)} $

Fig 1: Data for class 1 (crosses) and class 2 (circles). In all cases, Prob($ \omega_1 $) = Prob($ \omega_2 $) = 0.5. Misclassified points are shown in red. Values of $ \mu_1 $, $ \mu_2 $, and $ \Sigma $ are given in Eqs. (10) - (12). As the figures show, the separating hyperplanes shift depending on the values of $ c_{12} $ and $ c_{21} $.

The data is classified using the discriminant function from Eq. (9). To see the effects of $ c_{12} $ and $ c_{21} $, we vary their values and examine how the separating hyperplane shifts in Fig. 1. We examine the following cases:

  • When $ c_{12} = 1 $ and $ c_{21} = 1 $, the cost of misclassifying classes 1 and 2 are equal. We are reduced to Bayes rule. The separating hyperplane is positioned to minimize the probability of error, as Fig. 1(a) shows.
  • When $ c_{12} = 5 $ and $ c_{21} = 1 $, the cost of misclassifying class 2 is high. Thus, the separating hyperplane shifts toward class 1, so less points from class 2 are misclassified, as Fig. 1(b) shows.
  • When $ c_{12} = 1 $ and $ c_{21} = 5 $, the cost of misclassifying class 1 is high, so the separating hyperplane shifts toward class 2. As a result, less points from class 1 are misclassified, as Fig. 1(c) shows.



Summary and Conclusions

We have derived Bayes rule for minimizing risk. The rule can be stated as

$ \frac{p_1(x)}{p_2(x)} \gtrless^{\omega_1}_{\omega_2} \frac{(c_{12} - c_{22})P_2}{(c_{21} - c_{11})P_1}. $

To illustrate this rule, we have given two examples dealing with 1D and 2D features. For both cases, the separating hyperplane shifts depending on the costs. Figure 1 provides a nice demonstration for the 2D case. When the cost of misclassifying class $ i $ is high, the separating hyperplane shifts to reduce the number of points misclassified from class $ i $. We hope that this material provides the reader with a more general understanding of Bayes rule.


References

[1]. K. Fukunaga, Introduction to Statistical Pattern Recognition (Academic, New York, 1972).

[2]. Mireille Boutin, "ECE662: Statistical Pattern Recognition and Decision Making Processes," Purdue University, Spring 2014.


Questions and comments

If you have any questions, comments, etc. please post them on this page.

Alumni Liaison

Prof. Math. Ohio State and Associate Dean
Outstanding Alumnus Purdue Math 2008

Jeff McNeal