Revision as of 16:34, 9 April 2010 by Mreeder (Talk | contribs)


ECE662 Spring 2010 Makeup Lecture Notes #1

Friday, April 9th 2010
For anyone that was unable to attend. Feel free to correct any mistakes!

(...Continuing the "Gradient Descent Procedure" discussion from last time)

Batch Method Similar idea, but uses all samples to update our $ \vec{c}_{k} $ value.

Theorem: if samples are linearly separable, then the batch perceptron algorithm...
(using all samples)
(w/ constant step size $ \eta(k) = const, \forall k $)

...will terminate after a finite number of steps. (note that we don't know how many.)
(Proof: see DHS text)


"Online Method"
Uses one sample at a time
Consider the sequence: $ \{\vec{y}_{j}\}_{j=1}^{\infty} = \{\vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d}, \vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d}, \vec{y}_{1}, \vec{y}_{2}, ... , \vec{y}_{d}, \vec{y}_{1}, \vec{y}_{2}, ... \} $ where $ d=|\mathcal{D}| $
begin with our guess, $ \vec{c}_{k} $.

Let $ \vec{z}_{1} $ be the first mis-classified sample in the sequence above: Say $ \vec{z}_{1} = \vec{y}_{j1} $

Now, update our solution:
$ \vec{c}_{2} = \vec{c}_{1} + const\cdot\vec{z}_{1} $

Again: Let $ \vec{z}_{2} $ be the next mis-classified sample in the sequence above: Say $ \vec{z}_{2} = \vec{y}_{j2}, j2 > j1 $
Update the solution:
$ \vec{c}_{3} = \vec{c}_{2} + const\cdot\vec{z}_{2} $

Or more generally, $ \vec{c}_{k+1} = \vec{c}_{k} + const\cdot\vec{z}_{k} $

We hope that this gets better with time. Note that the time to compute our next step is lower, but more steps overall may be required for convergence (vs gradient descent.)

Theorem: if samples are linearly separable, then the online perceptron will converge after a finite number of steps. (Proof: see DHS text)

Does this mean it will find a good separation? No guarantee... but it will converge. Note that this algo. is only occasionally used in practice. (If you are interested in this from an algorithmic standpoint, examine the proofs in the text.)

Note: We could minimize other cost functions, e.g.
$ J(\vec{c}) = \sum_{\vec{y},misclassified} (\vec{c}\cdot\vec{y}_{i})^{2} $
This one is very flat near the $ \vec{c}=0 $ boundary though. We can revise it:
$ J(\vec{c}) = \sum_{\vec{y},misclassified} (\vec{c}\cdot\vec{y}_{i} - \vec{b})^{2} $
But this one tends to drag the decision boundary toward outliers (i.e. gives too much weight to outliers). So how about:
$ J(\vec{c}) = \sum_{\vec{y},misclassified}\frac{ (\vec{c}\cdot\vec{y}_{i} - \vec{b})^{2}}{|\vec{y}_{i}|^{2}} $
etc. Sometimes a pre-processor can be used to help eliminate outliers as well.

A better method:
Now we will formulate our problem as a Least Squares Procedure.
We want: $ \vec{c} $ such that: $ \vec{c}\cdot\vec{y}_{i}>0, \forall\vec{y}_{i}\in \mathcal{D} $
--> "Inequalities"

So, choose numbers b_{i}>0 (see later which ones to choose)
Consider: $ \vec{c}\cdot\vec{y}_{i}=b_{i}, \foralli=1..d $
--> "Equalities"

Usually, very many equations (samples, d) and very few unknowns (#dimnensions=n+1).

if d=n+1 (#samples = number of features, plus transformation)
This is a "square problem," and can be solved by matrix inversion (if $ determinant(\vec{Y})\ne0 $)
Matrix equation: $ \vec{Y}\vec{c}=\vec{b} $

if d>>n+1 (many more samples and features)
This is an "over-constraint system"
Typically, there is no solution.
We seek $ \vec{c} $ that is "close" to a solution by minimizing the L2 norm:
$ \|\vec{Y}\vec{c}-\vec{b}\|_{L2} $

which has solution:
$ \vec{c} = (Y^{T}Y)^{-1}Y^{T}\vec{b}, if |Y^{T}Y|\ne0 $
(c = "Pseudo-inverse of Y" * b)
Or, if $ |Y^{T}Y|\ne0 $ then:
$ \vec{c} = \lim_{\varepsilon\rightarrow 0}(Y^{T}Y+\varepsilon\mathit{1})^{-1}Y^{T}\vec{b} $
Which always exists.
(note that i wanted $ \mathit{1} $ here to be a vector of ones. anybody know the correct latex?)

BUT! This solution depends on the choice of $ \vec{b} $
How to choose b?

One choice links to Fisher Linear Discriminant (See: DHS section 3.8.2)

But to get there, lets talk about projection first:

Context:
given training data from 2 classes,
$ \vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d} \in \mathbb{R} $ (no additional dimension transform yet)
In application, we have a tendency to "overmeasure" and get too many features. So you ask yourself, "what can I get rid of?"

n is big?
Difficult to separate classes.
Instead, try to find a projection (to a lower dimensional subspace, e.g. a line)
$ \pi:\mathcal{R}^{n}\rightarrow\mathcal{R}^{\bar{n}}, \bar{n}<n $
such that:
$ pi{\vec{y}_{1}}, pi{\vec{y}_{2}}, ..., pi{\vec{y}_{d}} $ area easier to separate.


Assume $ \bar{n}=1 $
We will attempt to find a straight line through the origin, such that the projection of the data onto the line is separated as best as possible.

(somebody should draw the pretty picture of 2-d data being projected onto a line. basically, draw a line perpendicularly from a data point onto the line we chose. We can choose any line throughthe origin... some will be better than others.)

Note that the direction of our line, $ \vec{W} $, might not separate well. We want to find the best!

To do this, assume:
$ \vec{y}_{1}, \vec{y}_{2}, ..., \vec{y}_{d1} $ belong to the 1^{st} class.
$ \vec{y}_{d1+1}, \vec{y}_{d1+2}, ..., \vec{y}_{d} $ belong to the 2^{nd} class.
Let $ \vec{W} $ be the unit tangent to the line (origin? can anyone clarify this?)
Consider: the sample means of the projection for each class:
$ \tilde{m}_{1}=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}(\vec{W}\cdot\vec{y}_{i}) $
$ \tilde{m}_{2}=\frac{1}{d-d_{1}}\sum_{i=d_{1}+1}^{d}(\vec{W}\cdot\vec{y}_{i}) $

Consider: the scatter of the projections for each class:

$ \tilde{s}_{1}=\sum_{i=1}^{d_{1}}(\vec{W}\cdot\vec{y}_{i}-\tilde{m}_{1})^{2} $
$ \tilde{s}_{2}=\sum_{i=d_{1}+1}^{d}(\vec{W}\cdot\vec{y}_{i}-\tilde{m}_{2})^{2} $

Definition: The Fisher Linear Discriminant is:
$ J(\vec{W}) = \frac{|\tilde{m}_{1}-\tilde{m}_{2}|^{2}}{} $
(Note: wikipedia says the numerator is squared, but Mimi did not write it on the board. Clarification?)


Back to 2010 Spring ECE 662 mboutin

Alumni Liaison

Correspondence Chess Grandmaster and Purdue Alumni

Prof. Dan Fleetwood