(5 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | + | <center><font size= 4> | |
+ | '''A proof of the convergence of the Perceptron''' | ||
+ | </font size> | ||
− | The following theorem, due to Novikoff (1962), proves the convergence of a | + | [[ECE662]]: Statistical decision theory and decision making processes |
+ | |||
+ | (Supplement to [[Lecture_10_-_Batch_Perceptron_and_Fisher_Linear_Discriminant_OldKiwi|Lecture 10]] of the [[ECE662_Pattern_Recognition_Decision_Making_Processes_Spring2008_sLecture_collective|lecture notes]]) | ||
+ | |||
+ | </center> | ||
+ | ---- | ||
+ | |||
+ | The following theorem, due to Novikoff (1962), proves the convergence of a [[perceptron_OldKiwi]] using linearly-separable samples. This proof was taken from [http://books.google.com/books?id=el_cDjHdP0cC&pg=PA51&lpg=PA51&dq=novikoff+perceptron+convergence&source=web&ots=2pWu1dQHIM&sig=K5ubpLbO4n0XjrgzhEuOD06Xfmo#PPA261,M1 Learning Kernel Classifiers, Theory and Algorithms By Ralf Herbrich] | ||
Consider the following definitions: | Consider the following definitions: | ||
− | |||
A training set <math>z=(x,y)\in\mathbb{Z}^m</math> | A training set <math>z=(x,y)\in\mathbb{Z}^m</math> | ||
Line 17: | Line 25: | ||
Let <math>\psi = \max_{x_i\in x} \|\phi(x_i)\|</math>, where <math>\phi(x_i)</math> is the feature vector for <math>x_i</math>. | Let <math>\psi = \max_{x_i\in x} \|\phi(x_i)\|</math>, where <math>\phi(x_i)</math> is the feature vector for <math>x_i</math>. | ||
− | . | + | Now let <math>c_k</math> be a final solution vector after <math>k</math> steps. Then the last update of the Perceptron algorithm has |
− | + | ||
− | . | + | <math>c_k = c_{k-1} + y_ix_i</math>. |
− | + | ||
− | + | For any vector, <math>c^*</math>, we have | |
− | + | ||
− | + | <math>\langle c^*,c_k \rangle = \langle c^*,c_{k-1} \rangle + y_i \langle c^*,x_i \rangle \geq \langle c^*,c_{k-1}\rangle + \gamma_z(c^*) \geq \ldots \geq k\gamma_z(c^*)</math>, | |
− | + | ||
− | + | where the inequalities follow from repeated applications up to step 0 where we assume <math>c_0=0</math>. Similarly, by the algorithm definition, | |
− | + | ||
− | + | <math>\|c_k\|^2 = \|c_{k-1}\|^2 + 2y_i\langle c_{k-1},x_i \rangle + \|x_i\|^2 \leq \|c_{k-1}\|^2 + \psi^2 \leq \ldots \leq k\psi^2.</math> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
Then by the Cauchy-Schwartz inequality, we have | Then by the Cauchy-Schwartz inequality, we have | ||
− | + | <math>k\gamma_z(c^*) \leq \langle c^*,c_k \rangle \leq \|c^*\| \cdot \|c_k\| \leq \sqrt{k} \psi</math>. | |
− | + | ||
− | + | ||
− | + | ||
It follows, then, that the number of required iterations of the Perceptron algorithm has a finite upper bound, i.e. | It follows, then, that the number of required iterations of the Perceptron algorithm has a finite upper bound, i.e. | ||
− | + | <math>k\leq \left( \frac{\psi}{\gamma_z(c^*)}\right)^2</math> | |
− | + | ---- | |
− | + | [[Lecture_10_-_Batch_Perceptron_and_Fisher_Linear_Discriminant_OldKiwi|Back to Lecture 10]] | |
− | + |
Latest revision as of 05:37, 21 April 2013
A proof of the convergence of the Perceptron
ECE662: Statistical decision theory and decision making processes
(Supplement to Lecture 10 of the lecture notes)
The following theorem, due to Novikoff (1962), proves the convergence of a perceptron_OldKiwi using linearly-separable samples. This proof was taken from Learning Kernel Classifiers, Theory and Algorithms By Ralf Herbrich
Consider the following definitions: A training set $ z=(x,y)\in\mathbb{Z}^m $
"Functional margin" on example $ (x_i,y_i)\in z $ is defined as $ \tilde{\gamma}_i = y_i\langle x_i,c \rangle $;
"Functional margin" on training set $ z $ is defined as $ \tilde{\gamma}_z = \min_{(x_i,y_i)\in z} \tilde{\gamma}_i $;
"Geometrical margin" on example $ (x_i,y_i)\in z $ is defined as $ \gamma_i = \frac{\tilde{\gamma}_i(c)}{\|c\|} $;
"Geometrical margin" on training set $ z $ is defined as $ \gamma_z = \frac{\tilde{\gamma}_z(c)}{\|c\|} $.
Let $ \psi = \max_{x_i\in x} \|\phi(x_i)\| $, where $ \phi(x_i) $ is the feature vector for $ x_i $.
Now let $ c_k $ be a final solution vector after $ k $ steps. Then the last update of the Perceptron algorithm has
$ c_k = c_{k-1} + y_ix_i $.
For any vector, $ c^* $, we have
$ \langle c^*,c_k \rangle = \langle c^*,c_{k-1} \rangle + y_i \langle c^*,x_i \rangle \geq \langle c^*,c_{k-1}\rangle + \gamma_z(c^*) \geq \ldots \geq k\gamma_z(c^*) $,
where the inequalities follow from repeated applications up to step 0 where we assume $ c_0=0 $. Similarly, by the algorithm definition,
$ \|c_k\|^2 = \|c_{k-1}\|^2 + 2y_i\langle c_{k-1},x_i \rangle + \|x_i\|^2 \leq \|c_{k-1}\|^2 + \psi^2 \leq \ldots \leq k\psi^2. $
Then by the Cauchy-Schwartz inequality, we have
$ k\gamma_z(c^*) \leq \langle c^*,c_k \rangle \leq \|c^*\| \cdot \|c_k\| \leq \sqrt{k} \psi $.
It follows, then, that the number of required iterations of the Perceptron algorithm has a finite upper bound, i.e.
$ k\leq \left( \frac{\psi}{\gamma_z(c^*)}\right)^2 $