(partial reformat and clean up for purposes of making it more pretty. as yet unfinished.) |
(continuation of pretty-fication process. still unfinished.) |
||
Line 1: | Line 1: | ||
− | I wanted to understand why picking the most likely is the best you | + | I wanted to understand why picking the most likely is the best you can do (better than choosing randomly from an identical distribution) so I worked it out as follows. |
− | can do (better than choosing randomly from an identical distribution) | + | |
− | so I worked it out as follows. | + | |
Consider a random experiment with 2 outcomes, 0 and 1. | Consider a random experiment with 2 outcomes, 0 and 1. | ||
− | Let <math>E_0</math> be the event that outcome 0 occurs and <math>E_1</math> be the event that outcome 1 occurs. | + | Let <math>\displaystyle E_0</math> be the event that outcome 0 occurs and <math>\displaystyle E_1</math> be the event that outcome 1 occurs. |
− | Let <math>Pr(E_0) = p</math> and <math>Pr(E_1) = 1-p</math>, where <math>p</math> is some fixed but arbitrary probability. | + | Let <math>\displaystyle Pr(E_0) = p</math> and <math>\displaystyle Pr(E_1) = 1-p</math>, where <math>\displaystyle p</math> is some fixed but arbitrary probability. |
− | Assume, without loss of generality, that <math>p \ge | + | Assume, without loss of generality, that <math>\displaystyle p \ge \frac{1}{2}</math> (relabelling the outcomes if necessary). |
Consider a joint, independent random experiment intended to predict the outcome of the first. | Consider a joint, independent random experiment intended to predict the outcome of the first. | ||
− | Let <math>F_0</math> be the event that outcome 0 is predicted and <math>F_1</math> be the event that outcome 1 is predicted. | + | Let <math>\displaystyle F_0</math> be the event that outcome 0 is predicted and <math>\displaystyle F_1</math> be the event that outcome 1 is predicted. |
− | Let <math>Pr(F_0) = q</math> and <math>Pr(F_1) = 1-q</math> | + | Let <math>\displaystyle Pr(F_0) = q</math> and <math>\displaystyle Pr(F_1) = 1-q</math> |
− | The probability of error is | + | The probability of error is |
− | <math>P_{err} = Pr( | + | |
+ | <math>\displaystyle P_{err} = Pr((E_0 \cap F_1) \cup (E_1 \cap F_0)) = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0)</math>. | ||
By independence, | By independence, | ||
− | |||
− | |||
− | |||
− | + | <math>\displaystyle Pr(E_0 \cap F_1) = Pr(E_0) \cdot Pr(F_1)</math> | |
− | <math> | + | |
+ | <math>\displaystyle Pr(E_1 \cap F_0) = Pr(E_1) \cdot Pr(F_0)</math>. | ||
+ | |||
+ | So, | ||
− | + | <math>\displaystyle P_{err} = Pr(E_0) \cdot Pr(F_1) + Pr(E_1) \cdot Pr(F_0) = p(1-q) + (1-p)q</math>. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | We would like to choose <math>\displaystyle q\in[0,1]</math> to minimize <math>\displaystyle P_{err}</math>. Since <math>\displaystyle P_{err}</math> is linear in <math>\displaystyle q</math>, the extrema are at the endpoints. Hence, evaluating at <math>\displaystyle q=0</math> and <math>\displaystyle q=1</math>, the minimal <math>\displaystyle P_{err}</math> is <math>\displaystyle 1-p</math> at <math>\displaystyle q=1</math>. Thus the optimal strategy for predicting the outcome of the first experiment is to always (with probability 1) predict the more likely outcome. | |
− | <math>P_{err}</math> | + | |
− | + | ||
− | + | ||
− | and the | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | the | + | |
+ | Futhermore, on the interval <math>\displaystyle p\in[\frac{1}{2}, 1]</math>, <math>\displaystyle P_{err}</math> is a strictly decreasing function. That is, the closer <math>\displaystyle p</math> is to <math>\displaystyle \frac{1}{2}</math>, the worse it can be predicted (the higher <math>\displaystyle P_{err}</math> is), and the farther <math>\displaystyle p</math> is from <math>\displaystyle \frac{1}{2}</math> the better it can be predicted. This is consistent with the information theoretic description of entropy (which has its maximum at <math>\displaystyle p=\frac{1}{2}</math>) as the "average uncertainty in the outcome". Clearly the less uncertain the outcome is, the better we should expect to be able to predict it. | ||
− | As a concrete example, consider two approaches for predicting | + | As a concrete example, consider two approaches for predicting an experiment with <math>\displaystyle p=.8</math> (i.e. <math>\displaystyle E_0</math> occurs with probability .8 and <math>\displaystyle E_1</math> occurs with probability .2). In the first approach we always predict <math>\displaystyle E_0</math> (hence <math>\displaystyle q = Pr(F_0) = 1, Pr(F_1) = 0</math>). With this approach we have <math>\displaystyle Pr(\{(E_0, F_0)\}) = .8</math>, <math>\displaystyle Pr(\{(E_0, F_1)\}) = 0</math>, <math>\displaystyle Pr(\{(E_1, F_0)\}) = .2</math>, <math>\displaystyle Pr(\{(E_1, F_1)\}) = 0</math>. So <math>\displaystyle P_{err} = .2</math>. |
− | an experiment with <math>p=.8</math> (i.e. <math>E_0</math> occurs | + | |
− | with probability .8 and <math>E_1</math> occurs with probability | + | |
− | .2). In the first approach we always predict <math>E_0</math> | + | |
− | (hence <math>q = Pr(F_0) = 1, Pr(F_1) = 0</math>). With this | + | |
− | approach we have <math>Pr(\{(E_0, F_0)\}) = .8</math>, | + | |
− | <math>Pr(\{(E_0, F_1)\}) = 0</math>, <math>Pr(\{(E_1, F_0)\}) = .2</math>, | + | |
− | <math>Pr(\{(E_1, F_1)\}) = 0</math>. So <math>P_{err} = .2</math>. | + | |
− | In the second approach we predict randomly according to the | + | In the second approach we predict randomly according to the distribution of the first experiment (i.e. q = Pr(F_0) = .8, Pr(F_1) = .2). With this approach we have <math>\displaystyle Pr(\{(E_0, F_0)\}) = .64</math>, <math>\displaystyle Pr(\{(E_0, F_1)\}) = .16</math>, <math>\displaystyle Pr(\{(E_1, F_0)\}) = .16</math>, <math>\displaystyle Pr(\{(E_1, F_1)\}) = .04</math>. So <math>\displaystyle P_{err} = .32</math>, substantially worse. |
− | distribution of the first experiment (i.e. q = Pr(F_0) = .8, | + | |
− | Pr(F_1) = .2). With this approach we have | + | |
− | <math>Pr(\{(E_0, F_0)\}) = .64</math>, <math>Pr(\{(E_0, F_1)\}) = .16</math>, | + | |
− | <math>Pr(\{(E_1, F_0)\}) = .16</math>, <math>Pr(\{(E_1, F_1)\}) = .04</math>. | + | |
− | So <math>P_{err} = .32</math>, substantially worse. | + | |
--[[User:Jvaught|Jvaught]] 19:59, 26 January 2010 (UTC) | --[[User:Jvaught|Jvaught]] 19:59, 26 January 2010 (UTC) |
Revision as of 12:42, 27 January 2010
I wanted to understand why picking the most likely is the best you can do (better than choosing randomly from an identical distribution) so I worked it out as follows.
Consider a random experiment with 2 outcomes, 0 and 1.
Let $ \displaystyle E_0 $ be the event that outcome 0 occurs and $ \displaystyle E_1 $ be the event that outcome 1 occurs.
Let $ \displaystyle Pr(E_0) = p $ and $ \displaystyle Pr(E_1) = 1-p $, where $ \displaystyle p $ is some fixed but arbitrary probability.
Assume, without loss of generality, that $ \displaystyle p \ge \frac{1}{2} $ (relabelling the outcomes if necessary).
Consider a joint, independent random experiment intended to predict the outcome of the first.
Let $ \displaystyle F_0 $ be the event that outcome 0 is predicted and $ \displaystyle F_1 $ be the event that outcome 1 is predicted.
Let $ \displaystyle Pr(F_0) = q $ and $ \displaystyle Pr(F_1) = 1-q $
The probability of error is
$ \displaystyle P_{err} = Pr((E_0 \cap F_1) \cup (E_1 \cap F_0)) = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) $.
By independence,
$ \displaystyle Pr(E_0 \cap F_1) = Pr(E_0) \cdot Pr(F_1) $
$ \displaystyle Pr(E_1 \cap F_0) = Pr(E_1) \cdot Pr(F_0) $.
So,
$ \displaystyle P_{err} = Pr(E_0) \cdot Pr(F_1) + Pr(E_1) \cdot Pr(F_0) = p(1-q) + (1-p)q $.
We would like to choose $ \displaystyle q\in[0,1] $ to minimize $ \displaystyle P_{err} $. Since $ \displaystyle P_{err} $ is linear in $ \displaystyle q $, the extrema are at the endpoints. Hence, evaluating at $ \displaystyle q=0 $ and $ \displaystyle q=1 $, the minimal $ \displaystyle P_{err} $ is $ \displaystyle 1-p $ at $ \displaystyle q=1 $. Thus the optimal strategy for predicting the outcome of the first experiment is to always (with probability 1) predict the more likely outcome.
Futhermore, on the interval $ \displaystyle p\in[\frac{1}{2}, 1] $, $ \displaystyle P_{err} $ is a strictly decreasing function. That is, the closer $ \displaystyle p $ is to $ \displaystyle \frac{1}{2} $, the worse it can be predicted (the higher $ \displaystyle P_{err} $ is), and the farther $ \displaystyle p $ is from $ \displaystyle \frac{1}{2} $ the better it can be predicted. This is consistent with the information theoretic description of entropy (which has its maximum at $ \displaystyle p=\frac{1}{2} $) as the "average uncertainty in the outcome". Clearly the less uncertain the outcome is, the better we should expect to be able to predict it.
As a concrete example, consider two approaches for predicting an experiment with $ \displaystyle p=.8 $ (i.e. $ \displaystyle E_0 $ occurs with probability .8 and $ \displaystyle E_1 $ occurs with probability .2). In the first approach we always predict $ \displaystyle E_0 $ (hence $ \displaystyle q = Pr(F_0) = 1, Pr(F_1) = 0 $). With this approach we have $ \displaystyle Pr(\{(E_0, F_0)\}) = .8 $, $ \displaystyle Pr(\{(E_0, F_1)\}) = 0 $, $ \displaystyle Pr(\{(E_1, F_0)\}) = .2 $, $ \displaystyle Pr(\{(E_1, F_1)\}) = 0 $. So $ \displaystyle P_{err} = .2 $.
In the second approach we predict randomly according to the distribution of the first experiment (i.e. q = Pr(F_0) = .8, Pr(F_1) = .2). With this approach we have $ \displaystyle Pr(\{(E_0, F_0)\}) = .64 $, $ \displaystyle Pr(\{(E_0, F_1)\}) = .16 $, $ \displaystyle Pr(\{(E_1, F_0)\}) = .16 $, $ \displaystyle Pr(\{(E_1, F_1)\}) = .04 $. So $ \displaystyle P_{err} = .32 $, substantially worse.
--Jvaught 19:59, 26 January 2010 (UTC)