(New page: I wanted to understand why picking the most likely is the best you can do (better than choosing randomly from an identical distribution) so I worked it out as follows. Consider a random ...) |
(partial reformat and clean up for purposes of making it more pretty. as yet unfinished.) |
||
Line 3: | Line 3: | ||
so I worked it out as follows. | so I worked it out as follows. | ||
+ | Consider a random experiment with 2 outcomes, 0 and 1. | ||
− | + | Let <math>E_0</math> be the event that outcome 0 occurs and <math>E_1</math> be the event that outcome 1 occurs. | |
− | + | ||
− | the event that outcome 0 occurs and <math>E_1 | + | |
− | + | ||
− | + | ||
− | + | ||
+ | Let <math>Pr(E_0) = p</math> and <math>Pr(E_1) = 1-p</math>, where <math>p</math> is some fixed but arbitrary probability. | ||
− | + | Assume, without loss of generality, that <math>p \ge 0.5</math> (relabelling the outcomes if necessary). | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | Consider a joint, independent random experiment intended to predict the outcome of the first. | |
− | (E_1, F_0)\}) = Pr(\{(E_0, F_1)\}) + Pr(\{(E_1, F_0)\})</math>. | + | |
− | By independence <math>Pr(\{(E_0, F_1)\}) = Pr(\{E_0\}) | + | Let <math>F_0</math> be the event that outcome 0 is predicted and <math>F_1</math> be the event that outcome 1 is predicted. |
− | Pr(\{F_1\})</math> and <math>Pr(\{(E_1, F_0)\}) = Pr(\{E_1\}) | + | |
− | + | Let <math>Pr(F_0) = q</math> and <math>Pr(F_1) = 1-q</math> | |
− | = p - 2pq + q = (1-2p)q + p</math>. | + | |
+ | |||
+ | The probability of error is | ||
+ | <math>P_{err} = Pr(\{(E_0, F_1),(E_1, F_0)\}) = Pr(\{(E_0, F_1)\}) + Pr(\{(E_1, F_0)\})</math>. | ||
+ | |||
+ | By independence, | ||
+ | <math>Pr(\{(E_0, F_1)\}) = Pr(\{E_0\}) Pr(\{F_1\})</math> | ||
+ | and | ||
+ | <math>Pr(\{(E_1, F_0)\}) = Pr(\{E_1\}) Pr(\{F_0\})</math>. | ||
+ | |||
+ | So, | ||
+ | <math>P_{err} = p(1-q) + (1-p)q = p - 2pq + q = (1-2p)q + p</math>. | ||
Revision as of 09:28, 27 January 2010
I wanted to understand why picking the most likely is the best you can do (better than choosing randomly from an identical distribution) so I worked it out as follows.
Consider a random experiment with 2 outcomes, 0 and 1.
Let $ E_0 $ be the event that outcome 0 occurs and $ E_1 $ be the event that outcome 1 occurs.
Let $ Pr(E_0) = p $ and $ Pr(E_1) = 1-p $, where $ p $ is some fixed but arbitrary probability.
Assume, without loss of generality, that $ p \ge 0.5 $ (relabelling the outcomes if necessary).
Consider a joint, independent random experiment intended to predict the outcome of the first.
Let $ F_0 $ be the event that outcome 0 is predicted and $ F_1 $ be the event that outcome 1 is predicted.
Let $ Pr(F_0) = q $ and $ Pr(F_1) = 1-q $
The probability of error is
$ P_{err} = Pr(\{(E_0, F_1),(E_1, F_0)\}) = Pr(\{(E_0, F_1)\}) + Pr(\{(E_1, F_0)\}) $.
By independence, $ Pr(\{(E_0, F_1)\}) = Pr(\{E_0\}) Pr(\{F_1\}) $ and $ Pr(\{(E_1, F_0)\}) = Pr(\{E_1\}) Pr(\{F_0\}) $.
So, $ P_{err} = p(1-q) + (1-p)q = p - 2pq + q = (1-2p)q + p $.
We would like to choose $ q\in[0,1] $ to minimize
$ P_{err} $. Since $ P_{err} $ is linear
in $ q $, the extrema are at the endpoints. Hence,
evaluating at $ q=0 $ and $ q=1 $, the
minimal $ P_{err} $ is $ 1-p $ at
$ q=1 $. Thus the optimal strategy for predicting
the outcome of the first experiment is to always (with
probability 1) predict the more likely outcome.
Futhermore, on the interval $ p\in[\frac{1}{2}, 1] $,
$ P_{err} $ is a strictly decreasing function. That is,
the closer $ p $ is to $ \frac{1}{2} $, the
worse it can be predicted (the higher $ P_{err} $ is),
and the farther $ p $ is from $ \frac{1}{2} $
the better it can be predicted. This is consistent with the
information theoretic description of entropy (which has its
maximum at $ p=\frac{1}{2} $) as the "average uncertainty
in the outcome". Clearly the less uncertain the outcome is,
the better we should expect to be able to predict it.
As a concrete example, consider two approaches for predicting
an experiment with $ p=.8 $ (i.e. $ E_0 $ occurs
with probability .8 and $ E_1 $ occurs with probability
.2). In the first approach we always predict $ E_0 $
(hence $ q = Pr(F_0) = 1, Pr(F_1) = 0 $). With this
approach we have $ Pr(\{(E_0, F_0)\}) = .8 $,
$ Pr(\{(E_0, F_1)\}) = 0 $, $ Pr(\{(E_1, F_0)\}) = .2 $,
$ Pr(\{(E_1, F_1)\}) = 0 $. So $ P_{err} = .2 $.
In the second approach we predict randomly according to the distribution of the first experiment (i.e. q = Pr(F_0) = .8, Pr(F_1) = .2). With this approach we have $ Pr(\{(E_0, F_0)\}) = .64 $, $ Pr(\{(E_0, F_1)\}) = .16 $, $ Pr(\{(E_1, F_0)\}) = .16 $, $ Pr(\{(E_1, F_1)\}) = .04 $. So $ P_{err} = .32 $, substantially worse.
--Jvaught 19:59, 26 January 2010 (UTC)