(continuation of pretty-fication process. still unfinished.)
Line 1: Line 1:
I wanted to understand why picking the most likely is the best you can do (better than choosing randomly from an identical distribution) so I worked it out as follows.
+
== Theorem ==
 +
 
 +
Choosing the most likely outcome is the best you can do (better than choosing randomly from an identical distribution).
 +
 
 +
=== Proof ===
  
 
Consider a random experiment with 2 outcomes, 0 and 1.
 
Consider a random experiment with 2 outcomes, 0 and 1.
  
Let <math>\displaystyle E_0</math> be the event that outcome 0 occurs and <math>\displaystyle E_1</math> be the event that outcome 1 occurs.
+
Let <i>E<sub>0</sub></i> be the event that outcome 0 occurs and <i>E<sub>1</sub></i> be the event that outcome 1 occurs.
  
Let <math>\displaystyle Pr(E_0) = p</math> and <math>\displaystyle Pr(E_1) = 1-p</math>, where <math>\displaystyle p</math> is some fixed but arbitrary probability.
+
Let Pr(<i>E<sub>0</sub></i>) = <i>p</i> and Pr(<i>E<sub>1</sub></i>) = 1-<i>p</i>, where <i>p</i> is some fixed probability.
  
Assume, without loss of generality, that <math>\displaystyle p \ge \frac{1}{2}</math> (relabelling the outcomes if necessary).
+
Assume, without loss of generality, that <i>p</i> &ge; 1/2.  That is, that <i>E<sub>0</sub></i> is the most likely outcome.
  
  
Consider a joint, independent random experiment intended to predict the outcome of the first.   
+
Consider an independent random experiment intended to predict the outcome of the first.   
  
Let <math>\displaystyle F_0</math> be the event that outcome 0 is predicted and <math>\displaystyle F_1</math> be the event that outcome 1 is predicted.   
+
Let <i>F<sub>0</sub></i> be the event that outcome 0 is predicted and <i>F<sub>1</sub></i> be the event that outcome 1 is predicted.   
  
Let <math>\displaystyle Pr(F_0) = q</math> and <math>\displaystyle Pr(F_1) = 1-q</math>
+
Let Pr(<i>F<sub>0</sub></i>) = <i>q</i> and Pr(<i>F<sub>1</sub></i>) = 1-<i>q</i>.
  
  
The probability of error is
+
An error occurs when we predict incorrectly.  The probability of error is
  
 
<math>\displaystyle P_{err} = Pr((E_0 \cap F_1) \cup (E_1 \cap F_0)) = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0)</math>.
 
<math>\displaystyle P_{err} = Pr((E_0 \cap F_1) \cup (E_1 \cap F_0)) = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0)</math>.
Line 32: Line 36:
  
  
We would like to choose <math>\displaystyle q\in[0,1]</math> to minimize <math>\displaystyle P_{err}</math>.  Since <math>\displaystyle P_{err}</math> is linear in <math>\displaystyle q</math>, the extrema are at the endpointsHence, evaluating at <math>\displaystyle q=0</math> and <math>\displaystyle q=1</math>, the minimal <math>\displaystyle P_{err}</math> is <math>\displaystyle 1-p</math> at <math>\displaystyle q=1</math>.  Thus the optimal strategy for predicting the outcome of the first experiment is to always (with probability 1) predict the more likely outcome.
+
We would like to choose <i>q</i> &isin; [0, 1] to minimize <i>P<sub>err</sub></i>.   
  
Futhermore, on the interval <math>\displaystyle p\in[\frac{1}{2}, 1]</math>, <math>\displaystyle P_{err}</math> is a strictly decreasing function.  That is, the closer <math>\displaystyle p</math> is to <math>\displaystyle \frac{1}{2}</math>, the worse it can be predicted (the higher <math>\displaystyle P_{err}</math> is), and the farther <math>\displaystyle p</math> is from <math>\displaystyle \frac{1}{2}</math> the better it can be predicted.  This is consistent with the information theoretic description of entropy (which has its maximum at <math>\displaystyle p=\frac{1}{2}</math>) as the "average uncertainty in the outcome"Clearly the less uncertain the outcome is, the better we should expect to be able to predict it.
+
Since <i>P<sub>err</sub></i> is linear in <i>q</i>, the extrema are at the endpoints.   
  
As a concrete example, consider two approaches for predicting an experiment with <math>\displaystyle p=.8</math> (i.e. <math>\displaystyle E_0</math> occurs with probability .8 and <math>\displaystyle E_1</math> occurs with probability .2).  In the first approach we always predict <math>\displaystyle E_0</math> (hence <math>\displaystyle q = Pr(F_0) = 1, Pr(F_1) = 0</math>). With this approach we have <math>\displaystyle Pr(\{(E_0, F_0)\}) = .8</math>, <math>\displaystyle Pr(\{(E_0, F_1)\}) = 0</math>, <math>\displaystyle Pr(\{(E_1, F_0)\}) = .2</math>, <math>\displaystyle Pr(\{(E_1, F_1)\}) = 0</math>.  So <math>\displaystyle P_{err} = .2</math>.
+
Evaluating at <i>q</i> = 0 and <i>q</i> = 1, we find the minimal <i>P<sub>err</sub></i> is 1-<i>p</i> at <i>q</i> = 1 (since <i>p</i> &ge; 1/2). 
 +
 
 +
<u>Thus, the optimal strategy for predicting the outcome of the first experiment is:</u>
 +
 
 +
<b>Always (with probability <i>q</i> = 1) predict <i>E<sub>0</sub></i>, the most likely outcome.</b>
 +
 
 +
 
 +
Futhermore, <i>P<sub>err</sub></i> is a strictly decreasing function of <i>p</i> on the interval <i>p</i> &isin; [1/2, 1].
 +
 
 +
That is, as <i>p</i> goes from 1/2 to 1, the accuracy of the prediction increases.
 +
 
 +
This is consistent with the information theoretic description of entropy (which has its maximum at <i>p</i> = 1/2) as the "average uncertainty in the outcome". 
 +
 
 +
Clearly, the less uncertain the outcome is, the better we should expect to be able to predict it.
 +
 
 +
 
 +
== Example ==
 +
 
 +
Consider two approaches for predicting an experiment with <i>p</i> = 0.8 (i.e. <i>E<sub>0</sub></i> occurs with probability 0.8 and <i>E<sub>1</sub></i> occurs with probability 0.2).   
 +
 
 +
=== Approach 1 ===
 +
 
 +
In the first approach we proceed as above and always predict <i>F<sub>0</sub></i> (i.e Pr(<i>F<sub>0</sub></i>) = <i>q</i> = 1, Pr(<i>F<sub>1</sub></i>) = 1-<i>q</i> = 0).
 +
 
 +
With this approach we have
 +
 
 +
<math>\displaystyle Pr(E_0 \cap F_0) = 0.8</math>
 +
 
 +
<math>\displaystyle Pr(E_0 \cap F_1) = 0</math>
 +
 
 +
<math>\displaystyle Pr(E_1 \cap F_0) = 0.2</math>
 +
 
 +
<math>\displaystyle Pr(E_1 \cap F_1) = 0</math>
 +
 
 +
<math>\displaystyle P_{err} = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) = 0.2</math>
 +
 
 +
=== Approach 2 ===
 +
 
 +
In the second approach we predict randomly according to the distribution of the first experiment (i.e. Pr(<i>F<sub>0</sub></i>) = <i>q</i> = 0.8, Pr(<i>F<sub>1</sub></i>) = 1-<i>q</i> = 0.2). 
 +
 
 +
With this approach we have
 +
 
 +
<math>\displaystyle Pr(E_0 \cap F_0) = 0.64</math>
 +
 
 +
<math>\displaystyle Pr(E_0 \cap F_1) = 0.16</math>
 +
 
 +
<math>\displaystyle Pr(E_1 \cap F_0) = 0.16</math>
 +
 
 +
<math>\displaystyle Pr(E_1 \cap F_1) = 0.04</math>
 +
 
 +
<math>\displaystyle P_{err} = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) = 0.32</math>
 +
 
 +
The error in the second approach is greater than the error in the first approach.
 +
 
 +
 
 +
<b>Therefore, always choosing the most likely outcome is better than choosing randomly according to the distribution.</b>
  
In the second approach we predict randomly according to the distribution of the first experiment (i.e. q = Pr(F_0) = .8, Pr(F_1) = .2).  With this approach we have <math>\displaystyle Pr(\{(E_0, F_0)\}) = .64</math>, <math>\displaystyle Pr(\{(E_0, F_1)\}) = .16</math>, <math>\displaystyle Pr(\{(E_1, F_0)\}) = .16</math>, <math>\displaystyle Pr(\{(E_1, F_1)\}) = .04</math>. So <math>\displaystyle P_{err} = .32</math>, substantially worse.
 
  
 
--[[User:Jvaught|Jvaught]] 19:59, 26 January 2010 (UTC)
 
--[[User:Jvaught|Jvaught]] 19:59, 26 January 2010 (UTC)
 +
 +
Minor edits and reformatting, [[User:Pritchey|Pritchey]] 10:55, 28 January 2010 (UTC)

Revision as of 05:55, 28 January 2010

Theorem

Choosing the most likely outcome is the best you can do (better than choosing randomly from an identical distribution).

Proof

Consider a random experiment with 2 outcomes, 0 and 1.

Let E0 be the event that outcome 0 occurs and E1 be the event that outcome 1 occurs.

Let Pr(E0) = p and Pr(E1) = 1-p, where p is some fixed probability.

Assume, without loss of generality, that p ≥ 1/2. That is, that E0 is the most likely outcome.


Consider an independent random experiment intended to predict the outcome of the first.

Let F0 be the event that outcome 0 is predicted and F1 be the event that outcome 1 is predicted.

Let Pr(F0) = q and Pr(F1) = 1-q.


An error occurs when we predict incorrectly. The probability of error is

$ \displaystyle P_{err} = Pr((E_0 \cap F_1) \cup (E_1 \cap F_0)) = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) $.

By independence,

$ \displaystyle Pr(E_0 \cap F_1) = Pr(E_0) \cdot Pr(F_1) $

$ \displaystyle Pr(E_1 \cap F_0) = Pr(E_1) \cdot Pr(F_0) $.

So,

$ \displaystyle P_{err} = Pr(E_0) \cdot Pr(F_1) + Pr(E_1) \cdot Pr(F_0) = p(1-q) + (1-p)q $.


We would like to choose q ∈ [0, 1] to minimize Perr.

Since Perr is linear in q, the extrema are at the endpoints.

Evaluating at q = 0 and q = 1, we find the minimal Perr is 1-p at q = 1 (since p ≥ 1/2).

Thus, the optimal strategy for predicting the outcome of the first experiment is:

Always (with probability q = 1) predict E0, the most likely outcome.


Futhermore, Perr is a strictly decreasing function of p on the interval p ∈ [1/2, 1].

That is, as p goes from 1/2 to 1, the accuracy of the prediction increases.

This is consistent with the information theoretic description of entropy (which has its maximum at p = 1/2) as the "average uncertainty in the outcome".

Clearly, the less uncertain the outcome is, the better we should expect to be able to predict it.


Example

Consider two approaches for predicting an experiment with p = 0.8 (i.e. E0 occurs with probability 0.8 and E1 occurs with probability 0.2).

Approach 1

In the first approach we proceed as above and always predict F0 (i.e Pr(F0) = q = 1, Pr(F1) = 1-q = 0).

With this approach we have

$ \displaystyle Pr(E_0 \cap F_0) = 0.8 $

$ \displaystyle Pr(E_0 \cap F_1) = 0 $

$ \displaystyle Pr(E_1 \cap F_0) = 0.2 $

$ \displaystyle Pr(E_1 \cap F_1) = 0 $

$ \displaystyle P_{err} = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) = 0.2 $

Approach 2

In the second approach we predict randomly according to the distribution of the first experiment (i.e. Pr(F0) = q = 0.8, Pr(F1) = 1-q = 0.2).

With this approach we have

$ \displaystyle Pr(E_0 \cap F_0) = 0.64 $

$ \displaystyle Pr(E_0 \cap F_1) = 0.16 $

$ \displaystyle Pr(E_1 \cap F_0) = 0.16 $

$ \displaystyle Pr(E_1 \cap F_1) = 0.04 $

$ \displaystyle P_{err} = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) = 0.32 $

The error in the second approach is greater than the error in the first approach.


Therefore, always choosing the most likely outcome is better than choosing randomly according to the distribution.


--Jvaught 19:59, 26 January 2010 (UTC)

Minor edits and reformatting, Pritchey 10:55, 28 January 2010 (UTC)

Alumni Liaison

Prof. Math. Ohio State and Associate Dean
Outstanding Alumnus Purdue Math 2008

Jeff McNeal