Difference between revisions of "Perceptron Convergence Theorem OldKiwi" - Rhea

Revision as of 13:08, 28 March 2008

The following theorem, due to Novikoff (1962), proves the convergence of a perceptron using linearly-separable samples. This proof was taken from Learning Kernel Classifiers, Theory and Algorithms By Ralf Herbrich

Consider the following definitions:

A training set $z=(x,y)\in\mathbb{Z}^m$

"Functional margin" on example $(x_i,y_i)\in z$ is defined as $\tilde{\gamma}_i = y_i\langle x_i,c \rangle$ ;

"Functional margin" on training set $$ z $$ is defined as $\tilde{\gamma}_z = \min_{(x_i,y_i)\in z} \tilde{\gamma}_i$ ;

"Geometrical margin" on example $(x_i,y_i)\in z$ is defined as $\gamma_i = \frac{\tilde{\gamma}_i(c)}{\|c\|}$ ;

"Geometrical margin" on training set $$ z $$ is defined as $\gamma_z = \frac{\tilde{\gamma}_z(c)}{\|c\|}$ .

Let $\psi = \max_{x_i\in x} \|\phi(x_i)\|$ , where $\phi(x_i)$ is the feature vector for $$ x_i $$ .

.. |zdef| image:: tex

  :alt: tex: z=(x,y)\in\mathbb{Z}^m

.. |z| image:: tex

  :alt: tex: z

.. |xiyiinz| image:: tex

  :alt: tex: (x_i,y_i)\in z

.. |gammaitildeDef| image:: tex

  :alt: tex: \tilde{\gamma}_i = y_i\langle x_i,c \rangle

.. |gammaztildeDef| image:: tex

  :alt: tex: \tilde{\gamma}_z = \min_{(x_i,y_i)\in z} \tilde{\gamma}_i

.. |gammaidef| image:: tex

  :alt: tex: \gamma_i = \frac{\tilde{\gamma}_i(c)}{\|c\|}

.. |gammazdef| image:: tex

  :alt: tex: \gamma_z = \frac{\tilde{\gamma}_z(c)}{\|c\|}

.. |psidef| image:: tex

  :alt: tex: \psi = \max_{x_i\in x} \|\phi(x_i)\|

.. |feature| image:: tex

  :alt: tex: \phi(x_i)

.. |xi| image:: tex

  :alt: tex: x_i

Now let |ck| be a final solution vector after |k| steps. Then the last update of the Perceptron algorithm has

|stagekdef|.

.. |stagekdef| image:: tex

  :alt: tex: c_k = c_{k-1} + y_ix_i

.. |ck| image:: tex

  :alt: tex: c_k

.. |k| image:: tex

  :alt: tex: k

For any vector, |c*|, we have

|cc_kinnerprod|,

.. |c*| image:: tex

  :alt: tex: c^*

.. |cc_kinnerprod| image:: tex

  :alt: tex: \langle c^*,c_k \rangle = \langle c^*,c_{k-1} \rangle + y_i \langle c^*,x_i \rangle \geq \langle c^*,c_{k-1}\rangle + \gamma_z(c^*) \geq \ldots \geq k\gamma_z(c^*)

where the inequalities follow from repeated applications up to step 0 where we assume |c0|. Similarly, by the algorithm definition,

|cknorm|

.. |c0| image:: tex

  :alt: tex: c_0=0

.. |cknorm| image:: tex

  :alt: tex: \|c_k\|^2 = \|c_{k-1}\|^2 + 2y_i\langle c_{k-1},x_i \rangle + \|x_i\|^2 \leq \|c_{k-1}\|^2 + \psi^2 \leq \ldots \leq k\psi^2.

Then by the Cauchy-Schwartz inequality, we have

|kgamz|.

.. |kgamz| image:: tex

  :alt: tex: k\gamma_z(c^*) \leq \langle c^*,c_k \rangle \leq \|c^*\| \cdot \|c_k\| \leq \sqrt{k} \psi

It follows, then, that the number of required iterations of the Perceptron algorithm has a finite upper bound, i.e.

|kbound|

.. |kbound| image:: tex

  :alt: tex: k\leq \left( \frac{\psi}{\gamma_z(c^*)}\right)^2

@@ Line 5: / Line 5: @@
 Consider the following definitions:
 <math> </math>
-A training set <math>z=(x,y)\in\mathbb{Z}^m</math>|zdef|
+A training set <math>z=(x,y)\in\mathbb{Z}^m</math>
-"Functional margin" on example |xiyiinz| is defined as |gammaitildeDef|;
+"Functional margin" on example <math>(x_i,y_i)\in z</math> is defined as <math>\tilde{\gamma}_i = y_i\langle x_i,c \rangle</math>;
-"Functional margin" on training set |z| is defined as |gammaztildeDef|;
+"Functional margin" on training set <math>z</math> is defined as <math>\tilde{\gamma}_z = \min_{(x_i,y_i)\in z} \tilde{\gamma}_i</math>;
-"Geometrical margin" on example |xiyiinz| is defined as |gammaidef|;
+"Geometrical margin" on example <math>(x_i,y_i)\in z</math> is defined as <math>\gamma_i = \frac{\tilde{\gamma}_i(c)}{\|c\|}</math>;
-"Geometrical margin" on training set |z| is defined as |gammazdef|.
+"Geometrical margin" on training set <math>z</math> is defined as <math>\gamma_z = \frac{\tilde{\gamma}_z(c)}{\|c\|}</math>.
-Let |psidef|, where |feature| is the feature vector for |xi|.
+Let <math>\psi = \max_{x_i\in x} \|\phi(x_i)\|</math>, where <math>\phi(x_i)</math> is the feature vector for <math>x_i</math>.
 .. |zdef| image:: tex

Difference between revisions of "Perceptron Convergence Theorem OldKiwi" - Rhea

Revision as of 13:08, 28 March 2008

Alumni Liaison