Revision as of 11:48, 8 May 2010 by Roy8 (Talk | contribs)

Thursday, April 8th 2010
(Continuation of the linear discriminant of lecture 19)

Let's define the vectors $ \vec{y}_{i} $, $ i \in \left (1,N \right) $ such that $ \vec{y}_{i}\in \mathcal{D} $ and $ \mathcal{D} = \mathcal{D}_{1} \cup \mathcal{D}_{2} $.

Let's define $ \mathcal{D} $ such that $ \mathcal{D} = \mathcal{D}_{1} \cup \mathcal{D}_{2} $.

There exists a vector $ \vec{c} $ for which:

$ \begin{align} \vec{c} \cdot \left ( 1, y_{1}, y_{2}, ... , y_{n} \right ) & > 0, \forall \left ( y_{1}, ... , y_{n} \right ) \in \mathcal{D}_{1} \\ & < 0, \forall \left ( y_{1}, ... , y_{n} \right ) \in \mathcal{D}_{2} \\ \end{align} $

We can transform the second inequality into:
$ \vec{c} \cdot \left ( -1, -y_{1}, -y_{2}, ... , -y_{n} \right ) > 0, \forall \left ( y_{1}, ... , y_{n} \right ) \in \mathcal{D}_{2} $


Example in 1D:
Fig1.png

Fig2.png $ g(y_{1}) = -y_{1} + 3.5 $

$ g(y_{1}) = 0 $ is the decision hyperplane.

$ \vec{c} \cdot \left ( X_{1}, X_{2} \right ) = 0 $ is equivalent here to $ 3.5 X_{1} - X_{2} = 0 $ which is the decision hyperplane in the new space.
$ \vec{c} = \left ( 3.5, -1 \right ) $


$ \vec{c} $ is not unique. It can be any vector with the same direction, it's norm does not count.

$ \mathcal{D}_{1} \begin{cases} \vec{y} = (3) \\ \vec{y} = (2) \\ \vec{y} = (1) \\ \vec{y} = (-1) \\ \vec{y} = (-2) \\ \vec{y} = (-3) \end{cases} \mathcal{D}_{2} \begin{cases} \vec{y} = (4) \\ \vec{y} = (5) \\ \vec{y} = (6) \\ \vec{y} = (7) \end{cases} g(y_{1}) = cst.y_{1} + cst $

                      $ \Downarrow $

$ new \mathcal{D}_{1} \begin{cases} \vec{y} = (1,3) \\ \vec{y} = (1,2) \\ \vec{y} = (1,1) \\ \vec{y} = (1,-1) \\ \vec{y} = (1,-2) \\ \vec{y} = (1,-3) \end{cases} new \mathcal{D}_{2} \begin{cases} \vec{y} = (1,4) \\ \vec{y} = (1,5) \\ \vec{y} = (1,6) \\ \vec{y} = (1,7) \end{cases} $

$ \vec{c} \cdot \vec{y} > 0, \vec{y} \in \mathcal{D}_{1} $
$ \vec{c} \cdot \vec{y} < 0, \vec{y} \in \mathcal{D}_{2} $

We modify $ \mathcal{D}_{2} $ such that $ \vec{c} \cdot \vec{y} > 0, \forall \vec{y} \in \mathcal{D} $:
$ new \mathcal{D}_{2} \begin{cases} \vec{y} = (-1,-4) \\ \vec{y} = (-1,-5) \\ \vec{y} = (-1,-6) \\ \vec{y} = (-1,-7) \end{cases} $

$ \vec{c} $ is not unique.

1) Norm ambiguity

If $ \vec{c} $ separates the data, i.e. $ \vec{c} \cdot \vec{y} > 0, \forall \vec{y} \in \mathcal{D} $, then $ \lambda\vec{c} $ also separates it, $ \forall \lambda \in \mathcal{R}_{>0} $.
$ \vec{c} \cdot \vec{y} > 0 \Rightarrow \lambda (\vec{c} \cdot \vec{y}) > 0 $
We can set $ \left | \vec{c} \right | = 1 $.

2) Separation location ambiguity
Fig3.png
Here we can take $ \vec{c} = (\beta,-1) $, with any $ \beta \in (3,4) $.

Fig4.png
Fig7.png
To uniquely define $ \vec{c} $, introduce margin $ b>0 $ and ask that $ \vec{c} $ be the minimum length vector such that $ \vec{c} \cdot \vec{y} \ge b, \forall \vec{y} \in \mathcal{D} $ (after transformation). Then take the separation as $ \vec{c} \cdot \vec{y} = 0 $.

Observe

$ \exists \vec{y}_{i_{0}} \in \mathcal{D} $ such that $ \vec{c} \cdot \vec{y}_{i_{0}} = b $
Otherwise write $ \vec{c} \cdot \vec{y}_{i} = b + \epsilon_{i}, \epsilon_{i} > 0 $
Let $ \vec{c}^\prime = \dfrac{b}{b+\epsilon_{i_{0}}}\vec{c} $ where $ \epsilon_{i_{0}} = min_{i} \epsilon_{i} $ Observe that $ \left | \vec{c}^\prime \right | < \left | \vec{c} \right | $ and $ \vec{c}^\prime \cdot \vec{y} = \dfrac{b}{b+\epsilon_{i_{0}}}\vec{c} \cdot \vec{y} = \dfrac{b}{b+\epsilon_{i_{0}}} ( b + \epsilon_{i} ) \ge b, \forall \epsilon_{i} $ (equality when $ i = i_{0} $)


$ \vec{c} \cdot \vec{y}_{i_{0}} = b $ means $ \vec{y}_{i_{0}} $ belongs to the plane with normal vector $ \vec{c} $ and at distance $ \dfrac{b}{\left | \vec{y}_{i_{0}} \right |} $ from the origin (no other $ \vec{y}_{i} \in \mathcal{D} $ is closer to it).

Typically its impossible to satisfy $ \vec{c} \cdot \vec{y} > 0, \forall \vec{y} \in \mathcal{D} $. Try to do "as best as you can".

Idea:
Try to minimize the number of misclassified training samples.

Let $ J(\vec{c}) = \left | \vec{y}_{1} \in \mathcal{D} \right | \vec{c} \cdot \vec{y}_{1} \le 0 $ be the cost function. Hard to minimize !

Other idea "Perceptron criteria function".

Let $ J_{p}(\vec{c}) = \sum_{\vec{y}_{i} misclassified <br> \vec{y}_{i} \in \mathcal{D}} -\vec{c} \cdot \vec{y}_{i} $

Measures how "far away" you are from classifying all training data correctly.

Geometric interpretation

Fig6.png Distance from yi to plane $ \vec{c} \cdot \vec{y} = 0 $ is:

$ \left | projection\, of\, \vec{y}_{i}\, onto\, \dfrac{\vec{c}}{\left | \vec{c} \right |} \right | = \left | \vec{y}_{i} \cdot \dfrac{\vec{c}}{\left | \vec{c} \right |} \right | = \dfrac{}{\left | \vec{c} \right |} . \left | \vec{y}_{i} \cdot \vec{c} \right | $

$ (\vec{c} \cdot \vec{y}_{i})\, is\, negative\, when\, \vec{y}_{i}\, is\, misclassified $



So $ -\vec{c} \cdot \vec{y}_{i} $ is proportional to the distance from yi to the right side (being right).
Can use gradient descent procedure to minimize $ J_{p}(\vec{c}) $ $ \nabla J_{p}(\vec{c}) = \begin{pmatrix} \dfrac{\partial}{\partial c_{1}} \\ \dfrac{\partial}{\partial c_{2}} \\ \vdots \\ \dfrac{\partial}{\partial c_{n+1}} \end{pmatrix} J_{p}(\vec{c}) = \begin{pmatrix} \dfrac{\partial}{\partial c_{1}} \\ \dfrac{\partial}{\partial c_{2}} \\ \vdots \\ \dfrac{\partial}{\partial c_{n+1}} \end{pmatrix} \sum_{\vec{y}_{i}\, misclassified} -\vec{c} \cdot \vec{y}_{i} = \begin{pmatrix} \sum_{\vec{y}_{i}\, miscl.} -\vec{y}_{i,1} \\ \sum_{\vec{y}_{i}\, miscl.} -\vec{y}_{i,2} \\ \vdots \\ \sum_{\vec{y}_{i}\, miscl.} -\vec{y}_{i,n+1} \end{pmatrix} $

Follow basic gradient descent procedure:

  • initialize with guess $ \vec{c}_{1} $
  • upgrade to better guess by moving in direction of steepest descent
    $ \vec{c}_{2} = \vec{c}_{1} - \underbrace{\eta(1)}_{step\, size} \nabla J_{p}(\vec{c}_{1}) $
  • iterate $ \vec{c}_{k+1} = \vec{c}_{k} - \eta(k) \nabla J_{p}(\vec{c}_{k}) $ until convergence


Interpretation:
$ \vec{c}_{k+1} = \vec{c}_{k} - \underbrace{\eta(k) \sum_{\vec{y}_{i}\, misclassified} -\vec{y}_{i}}_{ \begin{matrix} update\, by\, adding\, a\, constant\, \\ multiple\, of\, the\, sum\, of\, all\, \\ misclassified\, samples \end{matrix} } $

When dealing with multiple classes, 2 possible ways:

  • take the classes 2 by 2 -> 2(n+1) separations
  • or take each class against the rest -> (n+1) separations




(--roy8 17:48, 8 May 2010 (UTC))

Alumni Liaison

Ph.D. 2007, working on developing cool imaging technologies for digital cameras, camera phones, and video surveillance cameras.

Buyue Zhang