Class notes for ECE662 Spring 2010 Lecture 21 (Makeup Lecture #1)
Friday, April 9th 2010
For anyone that was unable to attend. Feel free to correct any mistakes!
(...Continuing the "Gradient Descent Procedure for Perceptron" discussion from last time)
Batch Method Similar idea, but uses all samples to update our $ \vec{c}_{k} $ value.
See this student page on the topic.
Theorem: if samples are linearly separable, then the batch perceptron algorithm...
(using all samples)
(w/ constant step size $ \eta(k) = const, ~~\forall k $)
...will terminate after a finite number of steps. (note that we don't know how many.)
(Proof: see DHS text)
"Online Method"
Uses one sample at a time
Consider the sequence: $ \{\vec{y}_{j}\}_{j=1}^{\infty} = \{\vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d}, \vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d}, \vec{y}_{1}, \vec{y}_{2}, ... , \vec{y}_{d}, \vec{y}_{1}, \vec{y}_{2}, ... \}\text{ where } d=|\mathcal{D}| $
begin with our guess, $ \vec{c}_{k} $.
Let $ \vec{z}_{1} $ be the first mis-classified sample in the sequence above: Say $ \vec{z}_{1} = \vec{y}_{j1} $
Now, update our solution:
$ \vec{c}_{2} = \vec{c}_{1} + const\cdot\vec{z}_{1} $
Again: Let $ \vec{z}_{2} $ be the next mis-classified sample in the sequence above: Say $ \vec{z}_{2} = \vec{y}_{j2}, ~~j2 > j1 $
Update the solution:
$ \vec{c}_{3} = \vec{c}_{2} + const\cdot\vec{z}_{2} $
Or more generally, $ \vec{c}_{k+1} = \vec{c}_{k} + const\cdot\vec{z}_{k} $
We hope that this gets better with time. Note that the time to compute our next step is lower, but more steps overall may be required for convergence (vs gradient descent.)
Theorem: if samples are linearly separable, then the online perceptron will converge after a finite number of steps. (Proof: see DHS text)
Does this mean it will find a good separation? No guarantee... but it will converge. Note that this algo. is only occasionally used in practice. (If you are interested in this from an algorithmic standpoint, examine the proofs in the text.)
Note: We could minimize other cost functions, e.g.
$ J(\vec{c}) = \sum_{\vec{y}~misclassified} (\vec{c}\cdot\vec{y}_{i})^{2} $
This one is very flat near the $ \vec{c}=0 $ boundary though. We can revise it:
$ J(\vec{c}) = \sum_{\vec{y}~misclassified} (\vec{c}\cdot\vec{y}_{i} - \vec{b})^{2} $
But this one tends to drag the decision boundary toward outliers (i.e. gives too much weight to outliers). So how about:
$ J(\vec{c}) = \sum_{\vec{y}~misclassified}\frac{ (\vec{c}\cdot\vec{y}_{i} - \vec{b})^{2}}{|\vec{y}_{i}|^{2}} $
etc. Sometimes a pre-processor can be used to help eliminate outliers as well.
A better method:
Now we will formulate our problem as a Least Squares Procedure.
We want: $ \vec{c} $ such that: $ \vec{c}\cdot\vec{y}_{i}>0, ~~ \forall\vec{y}_{i}\in \mathcal{D} $
--> "Inequalities"
So, choose numbers $ b_{i}>0 $ (see later which ones to choose)
Consider: $ \vec{c}\cdot\vec{y_{i}}=b_{i},~~ \forall i=1..d $
--> "Equalities"
Usually, very many equations (samples, d) and very few unknowns (#dimnensions=n+1).
if d=n+1 (#samples = number of features, plus transformation)
This is a "square problem," and can be solved by matrix inversion (if $ determinant(\vec{Y})\ne0 $)
Matrix equation: $ \vec{Y}\vec{c}=\vec{b} $
if d>>n+1 (many more samples and features)
This is an "over-constraint system"
Typically, there is no solution.
We seek $ \vec{c} $ that is "close" to a solution by minimizing the L2 norm:
$ \|\vec{Y}\vec{c}-\vec{b}\|_{L2} $
which has solution:
$ \vec{c} = (Y^{T}Y)^{-1}Y^{T}\vec{b},~ \text{if } |Y^{T}Y|\ne0 $
(c = "Pseudo-inverse of Y" * b)
Or, if $ |Y^{T}Y|\ne0 $ then:
$ \vec{c} = \lim_{\varepsilon\rightarrow 0}(Y^{T}Y+\varepsilon\mathit{1})^{-1}Y^{T}\vec{b} $
Which always exists.
(note that i wanted $ \mathit{1} $ here to be a vector of ones. anybody know the correct latex?)
BUT! This solution depends on the choice of $ \vec{b} $
How to choose b?
One choice links to Fisher Linear Discriminant (See: DHS section 3.8.2)
But to get there, lets talk about projection first:
Context:
given training data from 2 classes,
$ \vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d} \in \mathbb{R} $ (no additional dimension transform yet)
In application, we have a tendency to "overmeasure" and get too many features. So you ask yourself, "what can I get rid of?"
n is big?
Difficult to separate classes.
Instead, try to find a projection (to a lower dimensional subspace, e.g. a line)
$ \pi:\mathcal{R}^{n}\rightarrow\mathcal{R}^{\bar{n}},~ \bar{n}<n $
such that:
$ \pi(\vec{y}_{1}), \pi(\vec{y}_{2}), ..., \pi(\vec{y}_{d}) $ area easier to separate.
Assume $ \bar{n}=1 $
We will attempt to find a straight line through the origin, such that the projection of the data onto the line is separated as best as possible.
(somebody should draw the pretty picture of 2-d data being projected onto a line. basically, draw a line perpendicularly from a data point onto the line we chose. We can choose any line throughthe origin... some will be better than others.)
Note that the direction of our line, $ \vec{w} $, might not separate well. We want to find the best!
To do this, assume:
$ \vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d1} $ belong to the 1^{st} class.
$ \vec{y}_{d1+1}, \vec{y}_{d1+2}, ..., \vec{y}_{d} $ belong to the 2^{nd} class.
Let $ \vec{w} $ be the unit tangent to the line (origin? can anyone clarify this?)
Consider: the sample means of the projection for each class:
$ \tilde{m}_{1}=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}(\vec{w}\cdot\vec{y}_{i}) $
$ \tilde{m}_{2}=\frac{1}{d-d_{1}}\sum_{i=d_{1}+1}^{d}(\vec{w}\cdot\vec{y}_{i}) $
Consider: the scatter of the projections for each class:
$ \tilde{s}_{1}=\sum_{i=1}^{d_{1}}(\vec{w}\cdot\vec{y}_{i}-\tilde{m}_{1})^{2} $
$ \tilde{s}_{2}=\sum_{i=d_{1}+1}^{d}(\vec{w}\cdot\vec{y}_{i}-\tilde{m}_{2})^{2} $
Definition: The Fisher Linear Discriminant is:
$ J(\vec{w}) = \frac{|\tilde{m}_{1}-\tilde{m}_{2}|^{2}}{\tilde{s_{1}}^{2}+\tilde{s_{2}}^{2}} $
(Note: wikipedia says the numerator is squared, but Mimi did not write it on the board. Clarification?--Mreeder 21:56, 9 April 2010 (UTC))
This is the "diff f projected means" (squared) over "the sum of scatters of each class"
Observe:
Can't we just say "means are far away" for a good separator?
Merely demanding $ |\tilde{m}_{1}-\tilde{m}_{2}| $ to be big would not do!
Expanding this, we write:
$ |\tilde{m}_{1}-\tilde{m}_{2}| =|\vec{w}\cdot(m_{1} - m_{2})| $
Note! Removing the W vector, this means the difference of "class means" not "projection means"!
$ =|\vec{w}\cdot(m_{1} - m_{2})|~\xrightarrow{|\vec{w}|\rightarrow\infty}~\infty $
this is now a "constraint problem", which is harder to solve. So, fix $ |\vec{w}|=1 $, but then this is a "constraint optimization problem" (did i screw up this description? --Mreeder 21:56, 9 April 2010 (UTC))
But Fisher LInear Discriminant is independent of norm of W, $ |\vec{w}| $
The scatter (on the denominator) "fixes the scaling issue"
$ J(\vec{w}) $ bypasses this problem because:
since:
$ |\tilde{m}_{1}-\tilde{m}_{2}| = ... $
(ugh latex gets so old so fast--Mreeder 21:56, 9 April 2010 (UTC))
$ =|\vec{w}|\cdot\Big|\frac{\vec{w}}{|\vec{w}|}\cdot(m_{1} - m_{2})\Big| $
Similarly,
$ \tilde{s}_{1}=\sum_{i=1}^{d_{1}}(\vec{w}\cdot\vec{y}_{i}-\tilde{m}_{1})^{2} =... $
(latex...--Mreeder 22:04, 9 April 2010 (UTC))
$ ...=|\vec{w}|^{2}\sum_{i=1}^{d_{1}}( \frac{\vec{w}}{|\vec{w}|} \vec{y_{i}} - \frac{\vec{w}}{|\vec{w}|}m_{1}) $
and:
$ \tilde{s}_{2}=|\vec{w}|^{2}\sum_{i=d_{1}+1}^{d}( \frac{\vec{w}}{|\vec{w}|} \vec{y_{i}} - \frac{\vec{w}}{|\vec{w}|}m_{2}) $
Final Thoughts
Why not just simply take:
$ J(\vec{w}) = \frac{|\tilde{m}_{1} - \tilde{m}_{2}| ^{2}}{|\vec{w}|^{2}} $???