Line 29: Line 29:
 
= Intuition and derivation of Logistic Regression =
 
= Intuition and derivation of Logistic Regression =
  
Consider a simple classification problem. The goal is to tell whether a person is male or female base on one feature: hair length. The data is given as <math>(x_i,y_i)</math> where i is the index number of the training set, and <math>x_i</math> is hair length in centimeters and <math>y_i=1</math> indicates the person is male and 0 if female. Assume women has longer hair length the distribution of training data will look like this:
+
Consider a simple classification problem. The goal is to tell whether a person is male or female base on one feature: hair length. The data is given as <math>(x_i,y_i)</math> where i is the index number of the training set, and <math>x_i</math> is hair length in inches and <math>y_i=1</math> indicates the person is male and 0 if female. Assume women has longer hair length the distribution of training data will look like this:
  
 
<center>[[Image:Cbr_intuition.png]]</center>
 
<center>[[Image:Cbr_intuition.png]]</center>
Line 48: Line 48:
 
Then the logistic function of <math>x_i</math> would be:
 
Then the logistic function of <math>x_i</math> would be:
  
<center><math>L(x_i) = \frac{1}{1+e^{(-\beta^T x_i)}}</math></center>
+
<center><math>L(x_i) = Pr(Y_i=1|x_i) = Pr(female|x_i)= \frac{1}{1+e^{(-\beta^T x_i)}}</math></center>
  
 
After some fitting optimization algorithm, the curve looks like the following:
 
After some fitting optimization algorithm, the curve looks like the following:
Line 70: Line 70:
  
 
'''Note:'''
 
'''Note:'''
*the decision boundary is:
+
*the decision boundary is linear:
 
<center><math>\beta^T x_i = 0</math></center>
 
<center><math>\beta^T x_i = 0</math></center>
  
 
:For 1-D it's a point, 2-D it's a line and etc.
 
:For 1-D it's a point, 2-D it's a line and etc.
  
*asdf
+
* <math>\beta^T x_i</math> is the log of the odd ratio:
 +
<center><math>\beta^T x_i = log\frac{Pr(female)}{Pr(male)}=log\frac{Pr(female)}{1-Pr(female)}</math></center>
  
 
Having this setup, the goal is to find a <math>\beta</math> to let the curve fit the data optimally. It comes about Maximum Likelihood Estimation and Newton's method.
 
Having this setup, the goal is to find a <math>\beta</math> to let the curve fit the data optimally. It comes about Maximum Likelihood Estimation and Newton's method.
  
 
= Maximum Likelihood Estimation =
 
= Maximum Likelihood Estimation =
 +
 +
For the logistic regression, we need to figure out a best fit of the curve to the training data. To do this, we choose <math>\beta</math> such that the likelihood of the joint distribution of the training data
 +
 +
<center><math>Pr(Y_1=y_1,...,Y_n=y_n|x_1...x_n)</math></center>
 +
 +
is maximized.
 +
 +
From the likelihood function:
 +
 +
<center><math>
 +
\begin{align}
 +
Pr(Y_1=y_1,...,Y_n=y_n|x_1...x_n) = \prod_{i=1}^n Pr(Y_i=y_i|x_i,\beta)
 +
\end{align}
 +
</math></center>
  
  
  
 
= Numerical optimization =
 
= Numerical optimization =

Revision as of 20:02, 13 May 2014


Logistic regression

A slecture by ECE student Borui Chen

Partly based on the ECE662 Spring 2014 lecture material of Prof. Mireille Boutin.


Introduction

In the field of machine learning, one big topic is classification problem. A linear classifier is an algorithm that make the classification decision on a new test data point base on a linear combination of the features.

There are two classes of linear classifier: Generative model and Discriminative model:

  • The generative model measures the joint distribution of the data and class.
    • Examples are Naive Bayes Classifier, Linear Discriminant Analysis.
  • The discriminivative model makes no assumption on the joint distribution of the data. Instead, it takes the data as given and tries to maximize the conditional density (Prob(class|data)) directly.
    • Examples are Logistic Regression, Perceptron and Support Vector Machine.

Intuition and derivation of Logistic Regression

Consider a simple classification problem. The goal is to tell whether a person is male or female base on one feature: hair length. The data is given as $ (x_i,y_i) $ where i is the index number of the training set, and $ x_i $ is hair length in inches and $ y_i=1 $ indicates the person is male and 0 if female. Assume women has longer hair length the distribution of training data will look like this:

Cbr intuition.png

Clearly if a person's hair length is comparably long, being a female is more likely. If there is another person with hair length 10, we could say it's more likely to be a female. However, we don't have a description about how long is long enough to say a person is female. So we introduce the probability.

The intuition of logistic regression, in this example, is to assign a continuous probability to every possible value of hair length so that for longer hair length the probability of being a female is close to 1 and for shorter hair length the probability of being a female is close to 0. It is done by taking the linear combination of the feature and a constant and feeding it into a logistic function:

$ \begin{align} a_i &= \beta_0+\beta_1 x_i\\ &=\beta^T x_i \end{align} $
$ L(a) = \frac{1}{1+e^{(-a)}} $

Then the logistic function of $ x_i $ would be:

$ L(x_i) = Pr(Y_i=1|x_i) = Pr(female|x_i)= \frac{1}{1+e^{(-\beta^T x_i)}} $

After some fitting optimization algorithm, the curve looks like the following:

Cbr intuition 2.png

Having this curve, we could develop an decision rule:

$ \text{This person is } \begin{cases} female, & \text{if }L(x_i)\ge 0.5\\ male, & \text{if }L(x_i) < 0.5 \end{cases} $

More generally, the feature can be more than one dimension

$ x = (1,x_1,x_2,...,x_n)^T $
$ \beta = (\beta_0,\beta_1,\beta_2,...,\beta_n)^T $

Note:

  • the decision boundary is linear:
$ \beta^T x_i = 0 $
For 1-D it's a point, 2-D it's a line and etc.
  • $ \beta^T x_i $ is the log of the odd ratio:
$ \beta^T x_i = log\frac{Pr(female)}{Pr(male)}=log\frac{Pr(female)}{1-Pr(female)} $

Having this setup, the goal is to find a $ \beta $ to let the curve fit the data optimally. It comes about Maximum Likelihood Estimation and Newton's method.

Maximum Likelihood Estimation

For the logistic regression, we need to figure out a best fit of the curve to the training data. To do this, we choose $ \beta $ such that the likelihood of the joint distribution of the training data

$ Pr(Y_1=y_1,...,Y_n=y_n|x_1...x_n) $

is maximized.

From the likelihood function:

$ \begin{align} Pr(Y_1=y_1,...,Y_n=y_n|x_1...x_n) = \prod_{i=1}^n Pr(Y_i=y_i|x_i,\beta) \end{align} $


Numerical optimization

Alumni Liaison

Meet a recent graduate heading to Sweden for a Postdoctorate.

Christine Berkesch