Bayes Decision Theory - Continuous and Discrete Features
Continuous Features
Continuing from the last essay, we will now improve on the model in the following ways:
- Allowing the use of more than one feature - Like adding the shape of the cards as another feature.
- Allowing more than two states of nature - Having a deck also containing clubs and hearts.
- Allowing actions other than merely deciding the state of nature.
- Introducing a loss function.
Allowing the use of more than one feature just means that we would replace the scaler y with the feature vector Y, where Y is in a d-dimensional Euclidean space $ R^d $, called the feature space. Allowing more than two states of nature provides a useful generalization with small notational expense. Allowing more actions also opens up the possibility of rejection i.e refusing to make a decision in too close cases. This is can be very useful if being indecisive is not too costly. The Loss function states exactly how costly each chosen action is, and is used to convert a probability determination into a decision. Cost functions enables us to look at situations where certain errors are more costly than others, although we will often only be looking at cases where all errors are equally costly.
Putting this together, let {x1,...,xc} be the finite set of c states of nature and let {k1,...,ka} be the finite set of a possible actions. The loss function λ(ki|xj) describes the loss incurred for taking action ki when the state of nature is xj. Let Y be a d-component-vector-valued-RV, and let p(Y|xj) be the conditional probability density function for Y with xj being the true state of nature. As discussed before, P(xj) is the prior probability that nature is in state xj, therefore by using Bayes formula we can find the posterior probability P(xj|Y):
$ P(x_j|\mathbf{Y})= \frac{p(\mathbf{Y}|x_j)P(x_j)}{P(\mathbf{Y})} \qquad\qquad\qquad\qquad (1) $
where
$ P(\mathbf{Y})= \sum_{j=1}^c p(\mathbf{Y}|x_j)P(x_j) \qquad\qquad\qquad\qquad (2) $
Now, suppose we observe a particular feature space Y, and we decide to take an action ki. If the state of nature is xj, then from the definition of the loss function above we will incur the loss λ(ki|xj). Because P(xj|Y) is the probability that the true state of nature is xj, the loss associated with taking action ki can be expressed as:
$ R(k_i|\mathbf{Y})= \sum_{j=1}^c \lambda(k_i|x_j)P(x_j|\mathbf{Y}) \qquad\qquad\qquad(3) $
In decision theory terminology, an expected loss is called a risk, and R(ki|Y) is called the conditional risk. So whenever we have an observation Y, we can minimize the expected loss by choosing the action that minimized the conditional risk. To minimize the overall risk, compute the compute the conditional risk in equation 3, for i = 1,...,a and then select the action ki for which R(ki|Y) is minimum. The resulting minimum risk is also called the Bayes risk and is denoted by $ R^* $.
Discrete Features
In many practical applications, the components of the feature vectors are binary, ternary or higher integer values so that Y can assume one of m discrete values {v1,...,vm}. In these cases, the probability density functions become sums of the form
$ \qquad\qquad \sum_{x} P(\mathbf{Y}|x_j) \qquad\qquad\qquad\qquad\qquad\qquad\qquad (4) $
where we understand that the summation is over all values of x in the discrete distributions. Bayes formula then involves probabilities, rather than probability densities. So we have:
$ P(x_j|\mathbf{Y})= \frac{P(\mathbf{Y}|x_j)P(x_j)}{P(\mathbf{Y})} \qquad\qquad\qquad\qquad (5) $
where
$ P(\mathbf{Y})= \sum_{j=1}^c P(\mathbf{Y}|x_j)P(x_j) \qquad\qquad\qquad\qquad (6) $
However the definition of the conditional risk R(k|Y)is the same, and the aim to minimize the overall risk for a selected action remains the same in the scenario.