Difference between revisions of "Lecture 4 - Bayes Classfication OldKiwi" - Rhea

Revision as of 11:44, 16 March 2008

- Bayes decision rule for continuous features**

Let $\mathbf{x} = \left[ x_1, x_2, \cdots,x_n \right] ^{\mathbf{T}}$ |xvector| be a random vector taking values in |realn|. X is characterized by its pdf (probability density function) and cdf (cumulative distribution function), or simply probability distribution function.

.. |realn| image:: tex

alt: tex: \Re^{n}

.. |xvector| image:: tex

alt: tex: \mathbf{x} = \left[ x_1, x_2, \cdots,x_n \right] ^{\mathbf{T}}

The probability distribution function or cdf is defined as:

.. image:: tex

alt: tex: P({x}) = P(x_1,\cdots,x_n) = Pr\{x_1 \le X_1, \cdots, x_n \le X_n\}

The probability density function is defined as:

|1st part|

|2nd part|

.. |1st part| image:: tex

alt: tex: { p({x}) = p(x_1,\cdots , x_n) = }

.. |2nd part| image:: tex

alt: tex: {\displaystyle \lim_{\Delta x_i \rightarrow 0 ,\phantom{0}\\ \forall i } \frac{Pr\{x_1 \le X_1 \le x_1+ \Delta x_1, \cdots, x_n \le X_n \le x_n+ \Delta x_n\}}{\Delta x_1 \Delta x_2 \cdots \Delta x_n} }

and,

.. |classes1k| image:: tex

alt: tex: \omega_1, \cdots, \omega_k

Each class |classes1k| has its " conditional density"

|conddensity1|

.. |conddensity1| image:: tex

alt: tex: p(x|w_i), i =1,\ldots,K

Each |conddensityxwi| is called "class i density" in contrast to the "unconditional density function of x", also called "mixture density of x" given by:

.. image:: tex

alt: tex: \displaystyle p({x}) = \sum_{k=1}^{K}P(w_i)p(x|w_i)

.. |conddensityxwi| image:: tex

alt: tex: p(x|w_i)

.. |class_i| image:: tex

alt: tex: \omega_i

- Addendum to the lecture -- Since the classes |class_i| are discrete, P(|class_i|) is not the Probability Distribution Function or cdf of |class_i|. Rather, it is the Probability Mass Function or pmf. Refer to duda and hart, page 21. **

Bayes Theorem:

.. image:: tex

alt: tex: p(w_i|{x}) = \frac{\displaystyle p(x|w_i)P(w_i)}{\displaystyle {\sum_{k=1}^{K}p(x|w_k)P(w_k)}}

Bayes Rule: Given X=x, decide |classi| if

|decision11|

|decision12|

.. |decision11| image:: tex

alt: tex: p(w_i|x) \ge p(w_j|x), \forall j

.. |decision12| image:: tex

alt: tex: \Longleftrightarrow p(x|w_i) \frac{\displaystyle P(w_i)}{\displaystyle \sum_{k=1}^{K}p(x|w_k)P(w_k)} \ge p(x|w_j) \frac{\displaystyle P(w_j)}{\displaystyle \sum_{k=1}^{K}p(x|w_k)P(w_k)} , \forall j

The Bayes rules to minimize the expected loss([Loss Functions]) or "Risk":

- We consider a slightly more general setting of k+2 classes:

|classes1k|, D, O, where D="doubt class" and O="outlier/other class"

.. |losswlwk| image:: tex

alt: tex: L(w_l|w_k)

.. |classk| image:: tex

alt: tex: w_k

.. |classl| image:: tex

alt: tex: w_l

.. |losswkwk| image:: tex

alt: tex: L(w_k|w_k)=0, \forall k

Usually, |losswkwk|

If every misclassification is equally bad we define:

.. image:: tex

alt: tex: L(w_l|w_k)= \{ {0, \quad l=k, \text{"correct"}; \quad \\ 1, \quad l \neq k , \text{"incorrect"}} \}

We could also include the cost of doubting:

.. image:: tex

alt: tex: L(w_l|w_k)= \{ {0, \quad l=k; \quad \\ 1, \quad l \neq k; \\ \quad d, \quad w_l=D} \}

Example: Two classes of fish in a lake: trout and catfish

|troutcatfish|

|catfishtrout|

|trouttrout|

|catfishcatfish|

.. |troutcatfish| image:: tex

alt: tex:L(trout|catfish) = \$2

.. |catfishtrout| image:: tex

alt: tex:L(catfish|trout) = \$3

.. |trouttrout| image:: tex

alt: tex:L(trout|trout) = 0

.. |catfishcatfish| image:: tex

alt: tex:L(catfish|catfish)= 0

and, the cost of doubting: |doubt|

.. |doubt| image:: tex

alt: tex:L(D|catfish) = L(D|trout) = \$0.50

The expected loss for deciding a class |classi| given X=x (the "Risk") is defined as:

.. |classi| image:: tex

alt: tex: w_i

.. image:: tex

alt: tex:R(w_i|x) := \displaystyle \sum_{k=1}^{K} L(w_i|w_k)P(w_k|x)

Consider the classifier c(x), a rule that gives a class |classi1| for every feature vector x. The risk of c(x) is given by

.. |classi1| image:: tex

alt: tex: w_i ,i=1..k

.. image:: tex

alt: tex:R(c(x)|x) =\displaystyle \sum_{k=1}^{K} L(c(x)|w_k)P(w_k|x)

The overall risk:

.. image:: tex

alt: tex:R := \displaystyle \int R(c(x)|x)\rho(x)dx

In order to minimize R we need to minimize R(c(x)|x) for every feature vector x.

Example: Expected loss for making wrong decisions can also be represented by a loss matrix. Here is an instance for loss matrix:

.. image:: LossMatrix.jpg

If a patient is diagnosed as normal when s/he has cancer, the incurred loss will be much greater. However, the loss which is incurred when a patient is diagnosed as having cancer while s/he is not sick would be less. On the other hand, if the diagnosis is correct, no loss is incurred.

The optimum solution is to minimize expected loss which is the sum of the loss incurred by each misclassified class. This can be obtained by multiplication of the probability of being belong to wrong class and the loss incurred by that wrong decision.

Let's say, in a population, if the patient has cancer, the probability of making wrong decision is 5%, and if the patient is healthy, the probability of making wrong decision is 30%, then the expected loss based on the values given on the loss matrix can be calculated as follows:

Using the expected loss formula:

.. image:: tex

alt: tex:E[L] = 500*(5/100) + 0*(95/100) + 10*(30/100) + 0*(70/100) = 28.

More generally, we will encounter similar issues when facing the task of rare event detection. In such cases, the impact of failing to detect one rare event would be a lot serious than the impact of false alarm (conclude detecting a rare event when actually it's not). Whenever the impact of mis-classificaiton is asymmetric or un-uniform, Risk would be a much more comprehensive performance metric than others like percentage accuracy.

Bayes rule to minimize the risk R:

|minrisk1|

|minrisk2|

|minrisk3|

|minrisk4|

.. |Rigivenx| image:: tex

alt: tex: R(w_i|x)

.. |minrisk1| image:: tex

alt: tex: R(w_i|x) \le R(w_j|x), \forall j

.. |minrisk2| image:: tex

alt: tex: \Longleftrightarrow \displaystyle \sum_{k=1}^{K} L(w_i|w_k)P(w_k|x) \le \displaystyle \sum_{k=1}^{K} L(w_j|w_k)P(w_k|x), \forall j \\

.. |minrisk3| image:: tex

alt: tex: \Longleftrightarrow \displaystyle \sum_{k=1}^{K} L(w_i|w_k)p(x|w_k)\frac{\displaystyle P(w_k)}{\displaystyle \sum_{l=1}^{K}p(x|w_l)P(w_l)} \\\le \displaystyle \sum_{k=1}^{K}L(w_j|w_k)p(x|w_k)\frac{\displaystyle P(w_k)}{\displaystyle \sum_{l=1}^{K}p(x|w_l)P(w_l)} , \text{for all j}

.. |minrisk4| image:: tex

alt: tex: \Longleftrightarrow \displaystyle \sum_{k=1}^{K} L(w_i|w_k)p(x|w_k)P(w_k) \le \displaystyle \sum_{k=1}^{K} L(w_j|w_k)p(x|w_k)P(w_k), \text{for all j}

- For more information on topic:**

[Bayesian Decision Theory]

Previous: [Lecture 3] Next: [Lecture 5]

Experiments and notes

Bayes Classification: Experiments and Notes_OldKiwi: Experiments with synthetic data. These experiments show the behavior of a Bayes Classification over classes with features with highly correlated data.

@@ Line 1: / Line 1: @@
-LECTURE THEME :
-    - Bayes Classfication
 **Bayes decision rule for continuous features**
-Let |xvector| be a random vector taking values in |realn|. X is characterized by its pdf (probability density function) and cdf (cumulative distribution function), or simply probability distribution function.
+Let <math>\mathbf{x} = \left[ x_1, x_2, \cdots,x_n \right] ^{\mathbf{T}}</math> |xvector| be a random vector taking values in |realn|. X is characterized by its pdf (probability density function) and cdf (cumulative distribution function), or simply probability distribution function.
 .. |realn| image:: tex
-   :alt: tex: \Re^{n}
+:alt: tex: \Re^{n}
 .. |xvector| image:: tex
-   :alt: tex: \mathbf{x} = \left[ x_1, x_2, \cdots,x_n \right] ^{\mathbf{T}}
+:alt: tex: \mathbf{x} = \left[ x_1, x_2, \cdots,x_n \right] ^{\mathbf{T}}
 The probability distribution function or cdf is defined as:
 .. image:: tex
-   :alt: tex: P({x}) = P(x_1,\cdots,x_n) = Pr\{x_1 \le X_1, \cdots, x_n \le X_n\}
+:alt: tex: P({x}) = P(x_1,\cdots,x_n) = Pr\{x_1 \le X_1, \cdots, x_n \le X_n\}
 The probability density function is defined as:
 |1st part|
@@ Line 27: / Line 22: @@
 .. |1st part| image:: tex
-   :alt: tex: { p({x}) = p(x_1,\cdots , x_n) = }
+:alt: tex: { p({x}) = p(x_1,\cdots , x_n) = }
 .. |2nd part| image:: tex
-   :alt: tex: {\displaystyle \lim_{\Delta x_i \rightarrow 0 ,\phantom{0}\\ \forall i } \frac{Pr\{x_1 \le X_1 \le x_1+ \Delta x_1, \cdots, x_n \le X_n \le x_n+ \Delta x_n\}}{\Delta x_1 \Delta x_2 \cdots \Delta x_n}  }
+:alt: tex: {\displaystyle \lim_{\Delta x_i \rightarrow 0 ,\phantom{0}\\ \forall i } \frac{Pr\{x_1 \le X_1 \le x_1+ \Delta x_1, \cdots, x_n \le X_n \le x_n+ \Delta x_n\}}{\Delta x_1 \Delta x_2 \cdots \Delta x_n}  }
 and,
 .. |classes1k| image:: tex
-   :alt: tex: \omega_1, \cdots, \omega_k
+:alt: tex: \omega_1, \cdots, \omega_k
 Each class |classes1k| has its  " conditional density"
 |conddensity1|
 .. |conddensity1| image:: tex
-   :alt: tex: p(x|w_i), i =1,\ldots,K
+:alt: tex: p(x|w_i), i =1,\ldots,K
 Each |conddensityxwi| is called "class i density" in contrast to the "unconditional density function of x", also called "mixture density of x" given by:
 .. image:: tex
-   :alt: tex: \displaystyle p({x}) =  \sum_{k=1}^{K}P(w_i)p(x|w_i)
+:alt: tex: \displaystyle p({x}) =  \sum_{k=1}^{K}P(w_i)p(x|w_i)
 .. |conddensityxwi| image:: tex
-   :alt: tex: p(x|w_i)
+:alt: tex: p(x|w_i)
 .. |class_i| image:: tex
-   :alt: tex: \omega_i
+:alt: tex: \omega_i
 ** Addendum to the lecture -- Since the classes |class_i| are discrete, P(|class_i|) is not the Probability Distribution Function or cdf of |class_i|. Rather, it is the Probability Mass Function or pmf. Refer to duda and hart, page 21. **
 Bayes Theorem:
 .. image:: tex
-   :alt: tex:  p(w_i|{x}) = \frac{\displaystyle p(x|w_i)P(w_i)}{\displaystyle {\sum_{k=1}^{K}p(x|w_k)P(w_k)}}
+:alt: tex:  p(w_i|{x}) = \frac{\displaystyle p(x|w_i)P(w_i)}{\displaystyle {\sum_{k=1}^{K}p(x|w_k)P(w_k)}}
 Bayes Rule: Given X=x, decide |classi| if
 |decision11|
@@ Line 69: / Line 64: @@
 .. |decision11| image:: tex
-   :alt: tex: p(w_i|x) \ge p(w_j|x), \forall j
+:alt: tex: p(w_i|x) \ge p(w_j|x), \forall j
 .. |decision12| image:: tex
-   :alt: tex: \Longleftrightarrow p(x|w_i)  \frac{\displaystyle P(w_i)}{\displaystyle \sum_{k=1}^{K}p(x|w_k)P(w_k)} \ge p(x|w_j) \frac{\displaystyle P(w_j)}{\displaystyle \sum_{k=1}^{K}p(x|w_k)P(w_k)} , \forall j
+:alt: tex: \Longleftrightarrow p(x|w_i)  \frac{\displaystyle P(w_i)}{\displaystyle \sum_{k=1}^{K}p(x|w_k)P(w_k)} \ge p(x|w_j) \frac{\displaystyle P(w_j)}{\displaystyle \sum_{k=1}^{K}p(x|w_k)P(w_k)} , \forall j
 The Bayes rules to minimize the expected loss([Loss Functions]) or "Risk":
@@ Line 83: / Line 78: @@
 .. |losswlwk| image:: tex
-   :alt: tex: L(w_l|w_k)
+:alt: tex: L(w_l|w_k)
 .. |classk| image:: tex
-   :alt: tex: w_k
+:alt: tex: w_k
 .. |classl| image:: tex
-   :alt: tex: w_l
+:alt: tex: w_l
 .. |losswkwk| image:: tex
-   :alt: tex: L(w_k|w_k)=0, \forall k
+:alt: tex: L(w_k|w_k)=0, \forall k
@@ Line 100: / Line 95: @@
 .. image:: tex
-   :alt: tex:   L(w_l|w_k)= \{ {0, \quad l=k, \text{"correct"}; \quad \\ 1, \quad  l \neq k , \text{"incorrect"}} \}
+:alt: tex:   L(w_l|w_k)= \{ {0, \quad l=k, \text{"correct"}; \quad \\ 1, \quad  l \neq k , \text{"incorrect"}} \}
@@ Line 106: / Line 101: @@
 .. image:: tex
-   :alt: tex:   L(w_l|w_k)= \{ {0, \quad l=k; \quad \\ 1, \quad l \neq k; \\ \quad d, \quad w_l=D} \}
+:alt: tex:   L(w_l|w_k)= \{ {0, \quad l=k; \quad \\ 1, \quad l \neq k; \\ \quad d, \quad w_l=D} \}
@@ Line 120: / Line 115: @@
 .. |troutcatfish| image:: tex
-   :alt: tex:L(trout|catfish) = \$2
+:alt: tex:L(trout|catfish) = \$2
 .. |catfishtrout| image:: tex
-   :alt: tex:L(catfish|trout) = \$3
+:alt: tex:L(catfish|trout) = \$3
 .. |trouttrout| image:: tex
-   :alt: tex:L(trout|trout) = 0
+:alt: tex:L(trout|trout) = 0
 .. |catfishcatfish| image:: tex
-   :alt: tex:L(catfish|catfish)= 0
+:alt: tex:L(catfish|catfish)= 0
 and, the cost of doubting:
@@ Line 135: / Line 130: @@
 .. |doubt| image:: tex
-   :alt: tex:L(D|catfish) = L(D|trout) = \$0.50
+:alt: tex:L(D|catfish) = L(D|trout) = \$0.50
 The expected loss for deciding a class |classi| given X=x (the "Risk") is defined as:
 .. |classi| image:: tex
-   :alt: tex: w_i
+:alt: tex: w_i
 .. image:: tex
-   :alt: tex:R(w_i|x) := \displaystyle \sum_{k=1}^{K} L(w_i|w_k)P(w_k|x)
+:alt: tex:R(w_i|x) := \displaystyle \sum_{k=1}^{K} L(w_i|w_k)P(w_k|x)
 Consider the classifier c(x), a rule that gives a class |classi1| for every feature vector x. The risk of c(x) is given by
 .. |classi1| image:: tex
-   :alt: tex: w_i ,i=1..k
+:alt: tex: w_i ,i=1..k
 .. image:: tex
-   :alt: tex:R(c(x)|x) =\displaystyle \sum_{k=1}^{K} L(c(x)|w_k)P(w_k|x)
+:alt: tex:R(c(x)|x) =\displaystyle \sum_{k=1}^{K} L(c(x)|w_k)P(w_k|x)
 The overall risk:
 .. image:: tex
-   :alt: tex:R := \displaystyle \int R(c(x)|x)\rho(x)dx
+:alt: tex:R := \displaystyle \int R(c(x)|x)\rho(x)dx
 In order to minimize R we need to minimize R(c(x)|x) for every feature vector x.
@@ Line 166: / Line 161: @@
 If a patient is diagnosed as normal when s/he has cancer, the incurred loss will be much greater. However, the loss which is incurred when a patient is diagnosed as having cancer while s/he is not sick would be less. On the other hand, if the diagnosis is correct, no loss is incurred.
 The optimum solution is to minimize expected loss which is the sum of the loss incurred by each misclassified class. This can be obtained by multiplication of the probability of being belong to wrong class and the loss incurred by that wrong decision.
 Let's say, in a population, if the patient has cancer, the probability of making wrong decision is 5%, and if the patient is healthy, the probability of making wrong decision is 30%, then the expected loss based on the values given on the loss matrix can be calculated as follows:
 Using the expected loss formula:
-  .. image:: tex
+.. image:: tex
-     :alt: tex:E[L] = 500*(5/100) + 0*(95/100) + 10*(30/100) + 0*(70/100) = 28.
+:alt: tex:E[L] = 500*(5/100) + 0*(95/100) + 10*(30/100) + 0*(70/100) = 28.
 More generally, we will encounter similar issues when facing the task of rare event detection. In such cases, the impact of failing to detect one rare event would be a lot serious than the impact of false alarm (conclude detecting a rare event when actually it's not). Whenever the impact of mis-classificaiton is asymmetric or un-uniform, Risk would be a much more comprehensive performance metric than others like percentage accuracy.
@@ Line 179: / Line 174: @@
 Bayes rule to minimize the risk R:
 - Choose a class |classi| that has the minimum risk |Rigivenx|, i.e., choose |classi| such that
 |minrisk1|
@@ Line 191: / Line 186: @@
 .. |Rigivenx| image:: tex
-   :alt: tex: R(w_i|x)
+:alt: tex: R(w_i|x)
 .. |minrisk1| image:: tex
-   :alt: tex: R(w_i|x) \le R(w_j|x), \forall  j
+:alt: tex: R(w_i|x) \le R(w_j|x), \forall  j
 .. |minrisk2| image:: tex
-   :alt: tex: \Longleftrightarrow  \displaystyle \sum_{k=1}^{K} L(w_i|w_k)P(w_k|x) \le  \displaystyle \sum_{k=1}^{K} L(w_j|w_k)P(w_k|x), \forall  j  \\
+:alt: tex: \Longleftrightarrow  \displaystyle \sum_{k=1}^{K} L(w_i|w_k)P(w_k|x) \le  \displaystyle \sum_{k=1}^{K} L(w_j|w_k)P(w_k|x), \forall  j  \\
 .. |minrisk3| image:: tex
-   :alt: tex: \Longleftrightarrow \displaystyle \sum_{k=1}^{K} L(w_i|w_k)p(x|w_k)\frac{\displaystyle P(w_k)}{\displaystyle \sum_{l=1}^{K}p(x|w_l)P(w_l)} \\\le \displaystyle \sum_{k=1}^{K}L(w_j|w_k)p(x|w_k)\frac{\displaystyle P(w_k)}{\displaystyle \sum_{l=1}^{K}p(x|w_l)P(w_l)} , \text{for all j}
+:alt: tex: \Longleftrightarrow \displaystyle \sum_{k=1}^{K} L(w_i|w_k)p(x|w_k)\frac{\displaystyle P(w_k)}{\displaystyle \sum_{l=1}^{K}p(x|w_l)P(w_l)} \\\le \displaystyle \sum_{k=1}^{K}L(w_j|w_k)p(x|w_k)\frac{\displaystyle P(w_k)}{\displaystyle \sum_{l=1}^{K}p(x|w_l)P(w_l)} , \text{for all j}
 .. |minrisk4| image:: tex
-   :alt: tex: \Longleftrightarrow  \displaystyle \sum_{k=1}^{K} L(w_i|w_k)p(x|w_k)P(w_k) \le \displaystyle \sum_{k=1}^{K} L(w_j|w_k)p(x|w_k)P(w_k), \text{for all j}
+:alt: tex: \Longleftrightarrow  \displaystyle \sum_{k=1}^{K} L(w_i|w_k)p(x|w_k)P(w_k) \le \displaystyle \sum_{k=1}^{K} L(w_j|w_k)p(x|w_k)P(w_k), \text{for all j}

Difference between revisions of "Lecture 4 - Bayes Classfication OldKiwi" - Rhea

Revision as of 11:44, 16 March 2008

Experiments and notes

Related links

Alumni Liaison