(partial reformat and clean up for purposes of making it more pretty. as yet unfinished.)
(continuation of pretty-fication process. still unfinished.)
Line 1: Line 1:
I wanted to understand why picking the most likely is the best you
+
I wanted to understand why picking the most likely is the best you can do (better than choosing randomly from an identical distribution) so I worked it out as follows.
can do (better than choosing randomly from an identical distribution)
+
so I worked it out as follows.
+
  
 
Consider a random experiment with 2 outcomes, 0 and 1.
 
Consider a random experiment with 2 outcomes, 0 and 1.
  
Let <math>E_0</math> be the event that outcome 0 occurs and <math>E_1</math> be the event that outcome 1 occurs.
+
Let <math>\displaystyle E_0</math> be the event that outcome 0 occurs and <math>\displaystyle E_1</math> be the event that outcome 1 occurs.
  
Let <math>Pr(E_0) = p</math> and <math>Pr(E_1) = 1-p</math>, where <math>p</math> is some fixed but arbitrary probability.
+
Let <math>\displaystyle Pr(E_0) = p</math> and <math>\displaystyle Pr(E_1) = 1-p</math>, where <math>\displaystyle p</math> is some fixed but arbitrary probability.
  
Assume, without loss of generality, that <math>p \ge 0.5</math> (relabelling the outcomes if necessary).
+
Assume, without loss of generality, that <math>\displaystyle p \ge \frac{1}{2}</math> (relabelling the outcomes if necessary).
  
  
 
Consider a joint, independent random experiment intended to predict the outcome of the first.   
 
Consider a joint, independent random experiment intended to predict the outcome of the first.   
  
Let <math>F_0</math> be the event that outcome 0 is predicted and <math>F_1</math> be the event that outcome 1 is predicted.   
+
Let <math>\displaystyle F_0</math> be the event that outcome 0 is predicted and <math>\displaystyle F_1</math> be the event that outcome 1 is predicted.   
  
Let <math>Pr(F_0) = q</math> and <math>Pr(F_1) = 1-q</math>
+
Let <math>\displaystyle Pr(F_0) = q</math> and <math>\displaystyle Pr(F_1) = 1-q</math>
  
  
The probability of error is  
+
The probability of error is
<math>P_{err} = Pr(\{(E_0, F_1),(E_1, F_0)\}) = Pr(\{(E_0, F_1)\}) + Pr(\{(E_1, F_0)\})</math>.
+
 
 +
<math>\displaystyle P_{err} = Pr((E_0 \cap F_1) \cup (E_1 \cap F_0)) = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0)</math>.
  
 
By independence,  
 
By independence,  
<math>Pr(\{(E_0, F_1)\}) = Pr(\{E_0\})  Pr(\{F_1\})</math>
 
and
 
<math>Pr(\{(E_1, F_0)\}) = Pr(\{E_1\})  Pr(\{F_0\})</math>.
 
  
So,
+
<math>\displaystyle Pr(E_0 \cap F_1) = Pr(E_0) \cdot Pr(F_1)</math>
<math>P_{err} = p(1-q) + (1-p)q = p - 2pq + q = (1-2p)q + p</math>.
+
  
 +
<math>\displaystyle Pr(E_1 \cap F_0) = Pr(E_1) \cdot Pr(F_0)</math>.
 +
 +
So,
  
We would like to choose <math>q\in[0,1]</math> to minimize
+
<math>\displaystyle P_{err} = Pr(E_0) \cdot Pr(F_1) + Pr(E_1) \cdot Pr(F_0) = p(1-q) + (1-p)q</math>.
<math>P_{err}</math>.  Since <math>P_{err}</math> is linear
+
in <math>q</math>, the extrema are at the endpoints.  Hence,
+
evaluating at <math>q=0</math> and <math>q=1</math>, the
+
minimal <math>P_{err}</math> is <math>1-p</math> at
+
<math>q=1</math>.  Thus the optimal strategy for predicting
+
the outcome of the first experiment is to always (with
+
probability 1) predict the more likely outcome.
+
  
  
Futhermore, on the interval <math>p\in[\frac{1}{2}, 1]</math>,
+
We would like to choose <math>\displaystyle q\in[0,1]</math> to minimize <math>\displaystyle P_{err}</math>.  Since <math>\displaystyle P_{err}</math> is linear in <math>\displaystyle q</math>, the extrema are at the endpoints.  Hence, evaluating at <math>\displaystyle q=0</math> and <math>\displaystyle q=1</math>, the minimal <math>\displaystyle P_{err}</math> is <math>\displaystyle 1-p</math> at <math>\displaystyle q=1</math>.  Thus the optimal strategy for predicting the outcome of the first experiment is to always (with probability 1) predict the more likely outcome.
<math>P_{err}</math> is a strictly decreasing functionThat is,
+
the closer <math>p</math> is to <math>\frac{1}{2}</math>, the
+
worse it can be predicted (the higher <math>P_{err}</math> is),
+
and the farther <math>p</math> is from <math>\frac{1}{2}</math>
+
the better it can be predicted.  This is consistent with the
+
information theoretic description of entropy (which has its
+
maximum at <math>p=\frac{1}{2}</math>) as the "average uncertainty
+
in the outcome"Clearly the less uncertain the outcome is,
+
the better we should expect to be able to predict it.
+
  
 +
Futhermore, on the interval <math>\displaystyle p\in[\frac{1}{2}, 1]</math>, <math>\displaystyle P_{err}</math> is a strictly decreasing function.  That is, the closer <math>\displaystyle p</math> is to <math>\displaystyle \frac{1}{2}</math>, the worse it can be predicted (the higher <math>\displaystyle P_{err}</math> is), and the farther <math>\displaystyle p</math> is from <math>\displaystyle \frac{1}{2}</math> the better it can be predicted.  This is consistent with the information theoretic description of entropy (which has its maximum at <math>\displaystyle p=\frac{1}{2}</math>) as the "average uncertainty in the outcome".  Clearly the less uncertain the outcome is, the better we should expect to be able to predict it.
  
As a concrete example, consider two approaches for predicting
+
As a concrete example, consider two approaches for predicting an experiment with <math>\displaystyle p=.8</math> (i.e. <math>\displaystyle E_0</math> occurs with probability .8 and <math>\displaystyle E_1</math> occurs with probability .2).  In the first approach we always predict <math>\displaystyle E_0</math> (hence <math>\displaystyle q = Pr(F_0) = 1, Pr(F_1) = 0</math>).  With this approach we have <math>\displaystyle Pr(\{(E_0, F_0)\}) = .8</math>, <math>\displaystyle Pr(\{(E_0, F_1)\}) = 0</math>, <math>\displaystyle Pr(\{(E_1, F_0)\}) = .2</math>, <math>\displaystyle Pr(\{(E_1, F_1)\}) = 0</math>.  So <math>\displaystyle P_{err} = .2</math>.
an experiment with <math>p=.8</math> (i.e. <math>E_0</math> occurs
+
with probability .8 and <math>E_1</math> occurs with probability
+
.2).  In the first approach we always predict <math>E_0</math>
+
(hence <math>q = Pr(F_0) = 1, Pr(F_1) = 0</math>).  With this
+
approach we have <math>Pr(\{(E_0, F_0)\}) = .8</math>,
+
<math>Pr(\{(E_0, F_1)\}) = 0</math>, <math>Pr(\{(E_1, F_0)\}) = .2</math>,
+
<math>Pr(\{(E_1, F_1)\}) = 0</math>.  So <math>P_{err} = .2</math>.
+
  
In the second approach we predict randomly according to the
+
In the second approach we predict randomly according to the distribution of the first experiment (i.e. q = Pr(F_0) = .8, Pr(F_1) = .2).  With this approach we have <math>\displaystyle Pr(\{(E_0, F_0)\}) = .64</math>, <math>\displaystyle Pr(\{(E_0, F_1)\}) = .16</math>, <math>\displaystyle Pr(\{(E_1, F_0)\}) = .16</math>, <math>\displaystyle Pr(\{(E_1, F_1)\}) = .04</math>. So <math>\displaystyle P_{err} = .32</math>, substantially worse.
distribution of the first experiment (i.e. q = Pr(F_0) = .8,
+
Pr(F_1) = .2).  With this approach we have
+
<math>Pr(\{(E_0, F_0)\}) = .64</math>, <math>Pr(\{(E_0, F_1)\}) = .16</math>,
+
<math>Pr(\{(E_1, F_0)\}) = .16</math>, <math>Pr(\{(E_1, F_1)\}) = .04</math>.
+
So <math>P_{err} = .32</math>, substantially worse.
+
  
 
--[[User:Jvaught|Jvaught]] 19:59, 26 January 2010 (UTC)
 
--[[User:Jvaught|Jvaught]] 19:59, 26 January 2010 (UTC)

Revision as of 12:42, 27 January 2010

I wanted to understand why picking the most likely is the best you can do (better than choosing randomly from an identical distribution) so I worked it out as follows.

Consider a random experiment with 2 outcomes, 0 and 1.

Let $ \displaystyle E_0 $ be the event that outcome 0 occurs and $ \displaystyle E_1 $ be the event that outcome 1 occurs.

Let $ \displaystyle Pr(E_0) = p $ and $ \displaystyle Pr(E_1) = 1-p $, where $ \displaystyle p $ is some fixed but arbitrary probability.

Assume, without loss of generality, that $ \displaystyle p \ge \frac{1}{2} $ (relabelling the outcomes if necessary).


Consider a joint, independent random experiment intended to predict the outcome of the first.

Let $ \displaystyle F_0 $ be the event that outcome 0 is predicted and $ \displaystyle F_1 $ be the event that outcome 1 is predicted.

Let $ \displaystyle Pr(F_0) = q $ and $ \displaystyle Pr(F_1) = 1-q $


The probability of error is

$ \displaystyle P_{err} = Pr((E_0 \cap F_1) \cup (E_1 \cap F_0)) = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) $.

By independence,

$ \displaystyle Pr(E_0 \cap F_1) = Pr(E_0) \cdot Pr(F_1) $

$ \displaystyle Pr(E_1 \cap F_0) = Pr(E_1) \cdot Pr(F_0) $.

So,

$ \displaystyle P_{err} = Pr(E_0) \cdot Pr(F_1) + Pr(E_1) \cdot Pr(F_0) = p(1-q) + (1-p)q $.


We would like to choose $ \displaystyle q\in[0,1] $ to minimize $ \displaystyle P_{err} $. Since $ \displaystyle P_{err} $ is linear in $ \displaystyle q $, the extrema are at the endpoints. Hence, evaluating at $ \displaystyle q=0 $ and $ \displaystyle q=1 $, the minimal $ \displaystyle P_{err} $ is $ \displaystyle 1-p $ at $ \displaystyle q=1 $. Thus the optimal strategy for predicting the outcome of the first experiment is to always (with probability 1) predict the more likely outcome.

Futhermore, on the interval $ \displaystyle p\in[\frac{1}{2}, 1] $, $ \displaystyle P_{err} $ is a strictly decreasing function. That is, the closer $ \displaystyle p $ is to $ \displaystyle \frac{1}{2} $, the worse it can be predicted (the higher $ \displaystyle P_{err} $ is), and the farther $ \displaystyle p $ is from $ \displaystyle \frac{1}{2} $ the better it can be predicted. This is consistent with the information theoretic description of entropy (which has its maximum at $ \displaystyle p=\frac{1}{2} $) as the "average uncertainty in the outcome". Clearly the less uncertain the outcome is, the better we should expect to be able to predict it.

As a concrete example, consider two approaches for predicting an experiment with $ \displaystyle p=.8 $ (i.e. $ \displaystyle E_0 $ occurs with probability .8 and $ \displaystyle E_1 $ occurs with probability .2). In the first approach we always predict $ \displaystyle E_0 $ (hence $ \displaystyle q = Pr(F_0) = 1, Pr(F_1) = 0 $). With this approach we have $ \displaystyle Pr(\{(E_0, F_0)\}) = .8 $, $ \displaystyle Pr(\{(E_0, F_1)\}) = 0 $, $ \displaystyle Pr(\{(E_1, F_0)\}) = .2 $, $ \displaystyle Pr(\{(E_1, F_1)\}) = 0 $. So $ \displaystyle P_{err} = .2 $.

In the second approach we predict randomly according to the distribution of the first experiment (i.e. q = Pr(F_0) = .8, Pr(F_1) = .2). With this approach we have $ \displaystyle Pr(\{(E_0, F_0)\}) = .64 $, $ \displaystyle Pr(\{(E_0, F_1)\}) = .16 $, $ \displaystyle Pr(\{(E_1, F_0)\}) = .16 $, $ \displaystyle Pr(\{(E_1, F_1)\}) = .04 $. So $ \displaystyle P_{err} = .32 $, substantially worse.

--Jvaught 19:59, 26 January 2010 (UTC)

Alumni Liaison

Basic linear algebra uncovers and clarifies very important geometry and algebra.

Dr. Paul Garrett