Line 16: | Line 16: | ||
Let <math>\psi = \max_{x_i\in x} \|\phi(x_i)\|</math>, where <math>\phi(x_i)</math> is the feature vector for <math>x_i</math>. | Let <math>\psi = \max_{x_i\in x} \|\phi(x_i)\|</math>, where <math>\phi(x_i)</math> is the feature vector for <math>x_i</math>. | ||
+ | |||
+ | Now let <math>c_k</math> be a final solution vector after <math>k</math> steps. Then the last update of the Perceptron algorithm has | ||
+ | |||
+ | <math>c_k = c_{k-1} + y_ix_i</math>. | ||
+ | |||
+ | For any vector, |c*|, we have | ||
+ | |||
+ | |cc_kinnerprod|, | ||
+ | |||
+ | .. |c*| image:: tex | ||
+ | :alt: tex: c^* | ||
+ | |||
+ | .. |cc_kinnerprod| image:: tex | ||
+ | :alt: tex: \langle c^*,c_k \rangle = \langle c^*,c_{k-1} \rangle + y_i \langle c^*,x_i \rangle \geq \langle c^*,c_{k-1}\rangle + \gamma_z(c^*) \geq \ldots \geq k\gamma_z(c^*) | ||
+ | |||
+ | where the inequalities follow from repeated applications up to step 0 where we assume |c0|. Similarly, by the algorithm definition, | ||
+ | |||
+ | |cknorm| | ||
+ | |||
+ | .. |c0| image:: tex | ||
+ | :alt: tex: c_0=0 | ||
+ | |||
+ | .. |cknorm| image:: tex | ||
+ | :alt: tex: \|c_k\|^2 = \|c_{k-1}\|^2 + 2y_i\langle c_{k-1},x_i \rangle + \|x_i\|^2 \leq \|c_{k-1}\|^2 + \psi^2 \leq \ldots \leq k\psi^2. | ||
+ | |||
+ | Then by the Cauchy-Schwartz inequality, we have | ||
+ | |||
+ | |kgamz|. | ||
+ | |||
+ | .. |kgamz| image:: tex | ||
+ | :alt: tex: k\gamma_z(c^*) \leq \langle c^*,c_k \rangle \leq \|c^*\| \cdot \|c_k\| \leq \sqrt{k} \psi | ||
+ | |||
+ | It follows, then, that the number of required iterations of the Perceptron algorithm has a finite upper bound, i.e. | ||
+ | |||
+ | |kbound| | ||
+ | |||
+ | .. |kbound| image:: tex | ||
+ | :alt: tex: k\leq \left( \frac{\psi}{\gamma_z(c^*)}\right)^2 | ||
.. |zdef| image:: tex | .. |zdef| image:: tex | ||
Line 46: | Line 84: | ||
.. |xi| image:: tex | .. |xi| image:: tex | ||
:alt: tex: x_i | :alt: tex: x_i | ||
− | |||
− | |||
− | |||
− | |||
.. |stagekdef| image:: tex | .. |stagekdef| image:: tex | ||
Line 59: | Line 93: | ||
.. |k| image:: tex | .. |k| image:: tex | ||
:alt: tex: k | :alt: tex: k | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Revision as of 13:10, 28 March 2008
(See Lecture 10_OldKiwi)
The following theorem, due to Novikoff (1962), proves the convergence of a perceptron using linearly-separable samples. This proof was taken from Learning Kernel Classifiers, Theory and Algorithms By Ralf Herbrich
Consider the following definitions:
A training set $ z=(x,y)\in\mathbb{Z}^m $
"Functional margin" on example $ (x_i,y_i)\in z $ is defined as $ \tilde{\gamma}_i = y_i\langle x_i,c \rangle $;
"Functional margin" on training set $ z $ is defined as $ \tilde{\gamma}_z = \min_{(x_i,y_i)\in z} \tilde{\gamma}_i $;
"Geometrical margin" on example $ (x_i,y_i)\in z $ is defined as $ \gamma_i = \frac{\tilde{\gamma}_i(c)}{\|c\|} $;
"Geometrical margin" on training set $ z $ is defined as $ \gamma_z = \frac{\tilde{\gamma}_z(c)}{\|c\|} $.
Let $ \psi = \max_{x_i\in x} \|\phi(x_i)\| $, where $ \phi(x_i) $ is the feature vector for $ x_i $.
Now let $ c_k $ be a final solution vector after $ k $ steps. Then the last update of the Perceptron algorithm has
$ c_k = c_{k-1} + y_ix_i $.
For any vector, |c*|, we have
|cc_kinnerprod|,
.. |c*| image:: tex
:alt: tex: c^*
.. |cc_kinnerprod| image:: tex
:alt: tex: \langle c^*,c_k \rangle = \langle c^*,c_{k-1} \rangle + y_i \langle c^*,x_i \rangle \geq \langle c^*,c_{k-1}\rangle + \gamma_z(c^*) \geq \ldots \geq k\gamma_z(c^*)
where the inequalities follow from repeated applications up to step 0 where we assume |c0|. Similarly, by the algorithm definition,
|cknorm|
.. |c0| image:: tex
:alt: tex: c_0=0
.. |cknorm| image:: tex
:alt: tex: \|c_k\|^2 = \|c_{k-1}\|^2 + 2y_i\langle c_{k-1},x_i \rangle + \|x_i\|^2 \leq \|c_{k-1}\|^2 + \psi^2 \leq \ldots \leq k\psi^2.
Then by the Cauchy-Schwartz inequality, we have
|kgamz|.
.. |kgamz| image:: tex
:alt: tex: k\gamma_z(c^*) \leq \langle c^*,c_k \rangle \leq \|c^*\| \cdot \|c_k\| \leq \sqrt{k} \psi
It follows, then, that the number of required iterations of the Perceptron algorithm has a finite upper bound, i.e.
|kbound|
.. |kbound| image:: tex
:alt: tex: k\leq \left( \frac{\psi}{\gamma_z(c^*)}\right)^2
.. |zdef| image:: tex
:alt: tex: z=(x,y)\in\mathbb{Z}^m
.. |z| image:: tex
:alt: tex: z
.. |xiyiinz| image:: tex
:alt: tex: (x_i,y_i)\in z
.. |gammaitildeDef| image:: tex
:alt: tex: \tilde{\gamma}_i = y_i\langle x_i,c \rangle
.. |gammaztildeDef| image:: tex
:alt: tex: \tilde{\gamma}_z = \min_{(x_i,y_i)\in z} \tilde{\gamma}_i
.. |gammaidef| image:: tex
:alt: tex: \gamma_i = \frac{\tilde{\gamma}_i(c)}{\|c\|}
.. |gammazdef| image:: tex
:alt: tex: \gamma_z = \frac{\tilde{\gamma}_z(c)}{\|c\|}
.. |psidef| image:: tex
:alt: tex: \psi = \max_{x_i\in x} \|\phi(x_i)\|
.. |feature| image:: tex
:alt: tex: \phi(x_i)
.. |xi| image:: tex
:alt: tex: x_i
.. |stagekdef| image:: tex
:alt: tex: c_k = c_{k-1} + y_ix_i
.. |ck| image:: tex
:alt: tex: c_k
.. |k| image:: tex
:alt: tex: k