Latest revision as of 09:40, 1 February 2010

Optimality of Always Predicting the Most Likely Outcome

Theorem

Choosing the most likely outcome is the best you can do (better than choosing randomly from any distribution).

Proof

Consider a random experiment with 2 outcomes, 0 and 1.

Let E₀ be the event that outcome 0 occurs and E₁ be the event that outcome 1 occurs.

Let Pr(E₀) = p and Pr(E₁) = 1-p, where p is some fixed but arbitrary probability.

Assume, without loss of generality, that p ≥ 1/2. That is, that E₀ is the most likely outcome.

Consider a joint, independent random experiment intended to predict the outcome of the first.

Let F₀ be the event that outcome 0 is predicted and F₁ be the event that outcome 1 is predicted.

Let Pr(F₀) = q and Pr(F₁) = 1-q.

An error occurs when we predict incorrectly. The probability of error is

$\displaystyle P_{err} = Pr((E_0 \cap F_1) \cup (E_1 \cap F_0)) = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0)$ .

By independence,

$\displaystyle Pr(E_0 \cap F_1) = Pr(E_0) \cdot Pr(F_1)$

$\displaystyle Pr(E_1 \cap F_0) = Pr(E_1) \cdot Pr(F_0)$ .

So,

$\displaystyle P_{err} = Pr(E_0) \cdot Pr(F_1) + Pr(E_1) \cdot Pr(F_0) = p(1-q) + (1-p)q = (1-2p)q + p$ .

We would like to choose q ∈ [0, 1] to minimize P_err.

Since P_err is linear in q, the extrema are at the endpoints.

Evaluating at q = 0 and q = 1, we find the minimal P_err is 1-p at q = 1 (since p ≥ 1/2).

Thus, the optimal strategy for predicting the outcome of the first experiment is:

Always (with probability q = 1) predict E₀, the most likely outcome.

Futhermore, P_err is a strictly decreasing function of p on the interval p ∈ [1/2, 1].

That is, as p goes from 1/2 to 1, the accuracy of the prediction increases.

This is consistent with the information theoretic description of entropy (which has its maximum at p = 1/2) as the "average uncertainty in the outcome".

Clearly, the less uncertain the outcome is, the better we should expect to be able to predict it.

Example

Consider two approaches for predicting an experiment with p = 0.8 (i.e. E₀ occurs with probability 0.8 and E₁ occurs with probability 0.2).

Approach 1

In the first approach we proceed as above and always predict F₀ (i.e Pr(F₀) = q = 1, Pr(F₁) = 1-q = 0).

With this approach we have

$\displaystyle Pr(E_0 \cap F_0) = 0.8$

$\displaystyle Pr(E_0 \cap F_1) = 0$

$\displaystyle Pr(E_1 \cap F_0) = 0.2$

$\displaystyle Pr(E_1 \cap F_1) = 0$

$\displaystyle P_{err} = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) = 0.2$

Approach 2

In the second approach we predict randomly according to the distribution of the first experiment (i.e. Pr(F₀) = q = 0.8, Pr(F₁) = 1-q = 0.2).

With this approach we have

$\displaystyle Pr(E_0 \cap F_0) = 0.64$

$\displaystyle Pr(E_0 \cap F_1) = 0.16$

$\displaystyle Pr(E_1 \cap F_0) = 0.16$

$\displaystyle Pr(E_1 \cap F_1) = 0.04$

$\displaystyle P_{err} = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) = 0.32$

The error in the second approach is greater than the error in the first approach.

Therefore, always choosing the most likely outcome is better than choosing randomly according to the distribution.

--Jvaught 19:59, 26 January 2010 (UTC)

Minor edits and reformatting, Pritchey 10:55, 28 January 2010 (UTC)

Back to ECE662_topic2_discussions

Back to 2010 Spring ECE 662 mboutin

@@ Line 1: / Line 1: @@
-I wanted to understand why picking the most likely is the best you
+= Optimality of Always Predicting the Most Likely Outcome =
-can do (better than choosing randomly from an identical distribution)
-so I worked it out as follows.
+== Theorem ==
-Consider a random experiment with 2 outcomes.  Let <math>E_0</math>
+Choosing the most likely outcome is the best you can do (better than choosing randomly from any distribution).
-which occurs with fixed but arbitrary probability <math>p</math> be
-the event that outcome 0 occurs and <math>E_1</math> which occurs
-with probability <math>1-p</math> be the event that outcome 1 occurs.
-Assume without loss of generality that <math>p \ge \frac{1}{2}</math>
-(relabelling the outcomes if necessary).
+=== Proof ===
-Consider a joint, independent random experiment intended to
+Consider a random experiment with 2 outcomes, 0 and 1.
-predict the outcome of the first.  Let <math>F_0</math> be the
-event that outcome 0 is predicted and <math>F_1</math> be the
-event that outcome 1 is predicted.  Let <math>q</math> be the
-probability of <math>F_0</math>.
+Let <i>E<sub>0</sub></i> be the event that outcome 0 occurs and <i>E<sub>1</sub></i> be the event that outcome 1 occurs.
-Then the probability of error <math>P_{err} = Pr(\{(E_0, F_1),
+Let Pr(<i>E<sub>0</sub></i>) = <i>p</i> and Pr(<i>E<sub>1</sub></i>) = 1-<i>p</i>, where <i>p</i> is some fixed but arbitrary probability.
-(E_1, F_0)\}) = Pr(\{(E_0, F_1)\}) + Pr(\{(E_1, F_0)\})</math>.
-By independence <math>Pr(\{(E_0, F_1)\}) = Pr(\{E_0\}) \cdot
-Pr(\{F_1\})</math> and <math>Pr(\{(E_1, F_0)\}) = Pr(\{E_1\})
-\cdot Pr(\{F_0\})</math>.  So <math>P_{err} = p(1-q) + (1-p)q
-= p - 2pq + q = (1-2p)q + p</math>.
+Assume, without loss of generality, that <i>p</i> &ge; 1/2.  That is, that <i>E<sub>0</sub></i> is the most likely outcome.
-We would like to choose <math>q\in[0,1]</math> to minimize
-<math>P_{err}</math>.  Since <math>P_{err}</math> is linear
-in <math>q</math>, the extrema are at the endpoints.  Hence,
-evaluating at <math>q=0</math> and <math>q=1</math>, the
-minimal <math>P_{err}</math> is <math>1-p</math> at
-<math>q=1</math>.  Thus the optimal strategy for predicting
-the outcome of the first experiment is to always (with
-probability 1) predict the more likely outcome.
+Consider a joint, independent random experiment intended to predict the outcome of the first.
-Futhermore, on the interval <math>p\in[\frac{1}{2}, 1]</math>,
+Let <i>F<sub>0</sub></i> be the event that outcome 0 is predicted and <i>F<sub>1</sub></i> be the event that outcome 1 is predicted.
-<math>P_{err}</math> is a strictly decreasing function.  That is,
-the closer <math>p</math> is to <math>\frac{1}{2}</math>, the
-worse it can be predicted (the higher <math>P_{err}</math> is),
-and the farther <math>p</math> is from <math>\frac{1}{2}</math>
-the better it can be predicted.  This is consistent with the
-information theoretic description of entropy (which has its
-maximum at <math>p=\frac{1}{2}</math>) as the "average uncertainty
-in the outcome".  Clearly the less uncertain the outcome is,
-the better we should expect to be able to predict it.
+Let Pr(<i>F<sub>0</sub></i>) = <i>q</i> and Pr(<i>F<sub>1</sub></i>) = 1-<i>q</i>.
-As a concrete example, consider two approaches for predicting
-an experiment with <math>p=.8</math> (i.e. <math>E_0</math> occurs
-with probability .8 and <math>E_1</math> occurs with probability
-.2).  In the first approach we always predict <math>E_0</math>
-(hence <math>q = Pr(F_0) = 1, Pr(F_1) = 0</math>).  With this
-approach we have <math>Pr(\{(E_0, F_0)\}) = .8</math>,
-<math>Pr(\{(E_0, F_1)\}) = 0</math>, <math>Pr(\{(E_1, F_0)\}) = .2</math>,
-<math>Pr(\{(E_1, F_1)\}) = 0</math>.  So <math>P_{err} = .2</math>.
-In the second approach we predict randomly according to the
+An error occurs when we predict incorrectly.  The probability of error is
-distribution of the first experiment (i.e. q = Pr(F_0) = .8,
-Pr(F_1) = .2).  With this approach we have
+<math>\displaystyle P_{err} = Pr((E_0 \cap F_1) \cup (E_1 \cap F_0)) = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0)</math>.
-<math>Pr(\{(E_0, F_0)\}) = .64</math>, <math>Pr(\{(E_0, F_1)\}) = .16</math>,
-<math>Pr(\{(E_1, F_0)\}) = .16</math>, <math>Pr(\{(E_1, F_1)\}) = .04</math>.
+By independence,
-So <math>P_{err} = .32</math>, substantially worse.
+<math>\displaystyle Pr(E_0 \cap F_1) = Pr(E_0) \cdot Pr(F_1)</math>
+<math>\displaystyle Pr(E_1 \cap F_0) = Pr(E_1) \cdot Pr(F_0)</math>.
+So,
+<math>\displaystyle P_{err} = Pr(E_0) \cdot Pr(F_1) + Pr(E_1) \cdot Pr(F_0) = p(1-q) + (1-p)q = (1-2p)q + p</math>.
+We would like to choose <i>q</i> &isin; [0, 1] to minimize <i>P<sub>err</sub></i>.
+Since <i>P<sub>err</sub></i> is linear in <i>q</i>, the extrema are at the endpoints.
+Evaluating at <i>q</i> = 0 and <i>q</i> = 1, we find the minimal <i>P<sub>err</sub></i> is 1-<i>p</i> at <i>q</i> = 1 (since <i>p</i> &ge; 1/2).
+<u>Thus, the optimal strategy for predicting the outcome of the first experiment is:</u>
+<b>Always (with probability <i>q</i> = 1) predict <i>E<sub>0</sub></i>, the most likely outcome.</b>
+Futhermore, <i>P<sub>err</sub></i> is a strictly decreasing function of <i>p</i> on the interval <i>p</i> &isin; [1/2, 1].
+That is, as <i>p</i> goes from 1/2 to 1, the accuracy of the prediction increases.
+This is consistent with the information theoretic description of entropy (which has its maximum at <i>p</i> = 1/2) as the "average uncertainty in the outcome".
+Clearly, the less uncertain the outcome is, the better we should expect to be able to predict it.
+== Example ==
+Consider two approaches for predicting an experiment with <i>p</i> = 0.8 (i.e. <i>E<sub>0</sub></i> occurs with probability 0.8 and <i>E<sub>1</sub></i> occurs with probability 0.2).
+=== Approach 1 ===
+In the first approach we proceed as above and always predict <i>F<sub>0</sub></i> (i.e Pr(<i>F<sub>0</sub></i>) = <i>q</i> = 1, Pr(<i>F<sub>1</sub></i>) = 1-<i>q</i> = 0).
+With this approach we have
+<math>\displaystyle Pr(E_0 \cap F_0) = 0.8</math>
+<math>\displaystyle Pr(E_0 \cap F_1) = 0</math>
+<math>\displaystyle Pr(E_1 \cap F_0) = 0.2</math>
+<math>\displaystyle Pr(E_1 \cap F_1) = 0</math>
+<math>\displaystyle P_{err} = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) = 0.2</math>
+=== Approach 2 ===
+In the second approach we predict randomly according to the distribution of the first experiment (i.e. Pr(<i>F<sub>0</sub></i>) = <i>q</i> = 0.8, Pr(<i>F<sub>1</sub></i>) = 1-<i>q</i> = 0.2).
+With this approach we have
+<math>\displaystyle Pr(E_0 \cap F_0) = 0.64</math>
+<math>\displaystyle Pr(E_0 \cap F_1) = 0.16</math>
+<math>\displaystyle Pr(E_1 \cap F_0) = 0.16</math>
+<math>\displaystyle Pr(E_1 \cap F_1) = 0.04</math>
+<math>\displaystyle P_{err} = Pr(E_0 \cap F_1) + Pr(E_1 \cap F_0) = 0.32</math>
+The error in the second approach is greater than the error in the first approach.
+<b>Therefore, always choosing the most likely outcome is better than choosing randomly according to the distribution.</b>
 --[[User:Jvaught|Jvaught]] 19:59, 26 January 2010 (UTC)
+Minor edits and reformatting, [[User:Pritchey|Pritchey]] 10:55, 28 January 2010 (UTC)
+----
+[[ECE662_topic2_discussions|Back to ECE662_topic2_discussions]]
+[[ 2010 Spring ECE 662 mboutin|Back to 2010 Spring ECE 662 mboutin]]

Difference between revisions of "EE662Sp10OptimalPrediction" - Rhea