Bayes' Theorem and its Application to Pattern Recognition
Contents
History
Bayes' Theorem takes its name from the mathematician Thomas Bayes. For an accurate and detailed information about him, you might want to read his biography by Prof. D.R. Bellhouse \cite{Bellhouse}
Note
This tutorial assumes familiarity with the following--
- The axioms of probability
- Definition of conditional probability
Bayes' Theorem
Let us revisit conditional probability through an example and then gradually move onto Bayes' theorem.
Example
Problem: In Spring 2014, in the Computer Science (CS) Department of Purdue University, 200 students registered for the course CS180 (Problem Solving and Object Oriented Programming). 30% of the registered students are CS majors and the rest are non-majors. From the student registration data we observe that 80% of the CS majors are males, where as only 40% of non-majors are males. Find the following:
- The probability that a randomly selected student is a CS major.
- The probability that the selected student is a CS major and a male.
- The probability that the selected student is a male.
- Given that the selected student is a male what is the probability that he is a CS major? How is this different from the probability computed in part 1.
Solution:
- Notation: Let $ CS $ be the event that the selected student is a computer science major and let $ M $ be the event that the selected student is a male. Therefore, we can define the four events shown below and, summarize the information given problem in the form of a table \cite{Triola} or in the form of a probability tree
$ CS\equiv \text{CS major; } \overline{CS}\equiv \text{non-major; } M\equiv \text{male; } \overline{M}\equiv \text{female} $
The elements of the table excluding the legends (or captions) can be considered as a 3x3 matrix. Let (1,1) represent the first cell of the matrix. The content of (1,1) is computed first, followed by content of (1,2) and then (1,3). The same is then done with the second and third rows of the matrix.
The probability tree in Figure \ref{fig:Probability tree} is drawn by considering events as sequential. The number of branches in the probability tree depends on the number of events (i.e., how much you know about the system). The numbers on the branches denote the conditional probabilities.
- From the table,
$ \textbf{P}(CS) = \frac{\text{No. of CS majors}}{\text{Total no. of Students}} = \frac{60}{200} = 0.3 $
This is nothing but the fraction of the total students who are CS majors. - From the table,
$ \textbf{P}(CS\cap M) = \frac{\text{No. of CS majors who are also males}}{\text{Total no. of Students}} = \frac{48}{200} = 0.24 $
Using the probability tree we can interpret $ (CS\cap M) $ as the occurrence of event $ CS $ followed by the occurrence of event $ M $. Therefore,
$ \textbf{P}(CS\cap M) = \textbf{P}(CS)\times\textbf{P}(M\vert CS) = 0.3\times0.8 = 0.24 \text{ (multiplication rule)} $ - From the table,
$ \textbf{P}(M) = \frac{\text{Total no. males}}{\text{Total no. of Students}} = \frac{104}{200} = 0.52 $
From the probability tree it is clear that the event $ M $ can occur in 2 ways. Therefore we get,$ \textbf{P}(M) = \textbf{P}(M\vert CS)\times \textbf{P}(CS) + \textbf{P}(M\vert \overline{CS})\times \textbf{P}(\overline{CS}) = 0.3\times 0.8 + 0.7\times 0.4 = 0.52\text{ (total probability theorem)} $ - From the table,
$ \textbf{P}(CS\vert M) = \frac{\text{No. of males who are CS majors}}{\text{Total no. of males}} = \frac{48}{104} = 0.4615 $
Now let us compute the same using the probability tree. If you carefully observe the tree it is evident that the computation is not direct. So let us start from the definition of conditional probability, i.e.,$ \textbf{P}(CS\vert M) = \frac{\textbf{P}(CS\cap M)}{\textbf{P}(M)} $
Expanding the numerator using multiplication rule,$ \textbf{P}(CS\vert M) = \frac{\textbf{P}(M\vert CS)\times \textbf{P}(CS)}{\textbf{P}(M)} $
Using total probability theorem in the denominator,$ \textbf{P}(CS\vert M) = \frac{\textbf{P}(M\vert CS)\times \textbf{P}(CS)}{\textbf{P}(M\vert CS)\times \textbf{P}(CS) + \textbf{P}(M\vert \overline{CS})\times \textbf{P}(\overline{CS})} = \frac{0.8\times 0.3}{0.8\times 0.3 + 0.7\times 0.4} = 0.4615 $
Observation:
From part 4. and part 1. we observe that, i.e., 0.4615>0.3. What does this mean? How did the probability that a randomly selected student being a CS major change, when you were informed that the student is a male? Why did it increase?
Explanation:
In part 1. of the problem we only knew the percentage of males and females in the course. So, we computed the probability using just that information. In computing this probability the sample space was the total number of students in the course
In part 4. of the problem we were informed that event $M$ has occurred, i.e., we got partial information. What did we do with this information? We used it and revised the probability, i.e., our prior belief, in this case $ \textbf{P}(CS) $. $ \textbf{P}(CS) $ is called the prior because that is what we knew about the outcome before being informed about the occurrence of event $M$. We revised the probability (prior) by changing the sample space from the total number of students to the total number of males in the course. The increase in the prior is justified by the fact that there are more males who are CS majors than females.
Inference:
So, what do we learn from this example?
- We are supposed to revise our beliefs when we get information. Doing this will help us predict the outcome more accurately.
- In this example we computed probabilities using two different methods: constructing a table and, by constructing a probability tree. In practice one could use either of the methods to solve a problem.
where, $ n<\math> is the number of events <math>A_{i} $ in the sample space. Note that the events $ A_{i} $ should be mutually exclusive and exhaustive as shown in the Figure. \ref{fig:Venn diagram}. In Figure. \ref{fig:Venn diagram} the green colored region corresponds to event $ B $. Bayes' theorem can be understood better by visualizing the events as sequential as depicted in the probability tree. When additional information is obtained about a subsequent event, then it is used to revise the probability of the initial event. The revised probability is called posterior. In other words, we initially have a cause-effect model where we want to predict whether event $ B $ will occur or not, given that event $ A_{i} $ has occurred.
We then move to the inference model where we are told that event $B$ has occurred and our goal is to infer whether event $A_{i}$ has occurred or not \cite{Bertsekas}
In summary, Bayes' Theorem \cite{Sivia} provides us a simple technique to turn information about the probability of different effects (outcomes) from each possible cause into information about the probable cause given the effect (outcome).
Algebra of Sets
A Note on Sets of Different Sizes
We can categorize the sets we encounter three ways:
- A set $ A $ is finite if it contains an finite number of elements, i.e. the number of elements in the set is some natural number $ n $. Then we can list the elements in $ A $; e.g. $ A = \{x_1,...,x_n\} $.
- A set is countable if its elements can be put into a one-to-one correspondence (a bijection) with the integers. In this course when, we say countable, we mean countably infinite. We may write $ A=\{x_1,x_2,...\} $. The set of rationals is a countable set.
- A set is uncountable if it is not finite or countable. A set that is uncountable cannot be written as $ \{x_1,x_2,...\} $. Note that the set of reals as well as any interval in R is uncountable.
Note that a finite or countable space (a collection of sets) may contain elements that are uncountable. For example the set {Ø,[0,1]} is a finite set with 2 elements but the interval [0,1] is uncountable.
If you are interested to learn more about countable and uncountable sets, you may find this Math Squad tutorial useful.
We will often consider indexed collections of sets such as
where $ I $ is called the index set.
The index set can be
- finite: $ I=\{1,...,n\} $ for some finite natural number n so that the collection of sets is $ I=\{A_1,...,A_n\} $
- countable: $ I $ is the set of natural numbers i.e. $ I=\{1,2,3,...\} $, so the collection is $ \{A_1,A_2,A_3,...\} $
- uncountable: so the collection is $ \{A_{\alpha}, \alpha $∈$ I\} $cfor an uncountable set $ I $. If $ I=R $, the set of reals, then the set in the collection can be written as $ A_{\alpha} $ for some real number $ \alpha $
Definition $ \qquad $ The union of an indexed family of sets is defined as
Note that if $ I $ is finite, we can write the union as
If $ I $ is countable, we can write the union as
If $ I $ is uncountable, we can write the union as
Definition $ \qquad $ The intersection of an indexed family of sets is defined as
Note that if $ I $ is finite, we can write the intersection as
If $ I $ is countable, we can write the intersection as
If $ I $ is uncountable, we can write the intersection as
Definition $ \qquad $ $ \{A_i, i $∈$ I\} $ is disjoint if $ A_i $∩$ A_j $ = Ø ∀ i,j∈I, i≠j.
Definition $ \qquad $ the collection $ \{A_i, i $∈$ I\} $ is a partition of S, the sample space, if it is disjoint and if
References
- M. Comer. ECE 600. Class Lecture. Random Variables and Signals. Faculty of Electrical Engineering, Purdue University. Fall 2013.
Questions and comments
If you have any questions, comments, etc. please post them on this page