Professional Documents
Culture Documents
Piyush Rai
Machine Learning (CS771A)
Aug 17, 2016
Logistic Regression
Recap
Logistic Regression
N
X
(yn w > x n )2
n=1
N
X
n=1
Simple, convex loss functions in both cases. Closed-form solution for w can
be found. Can also solve for w more efficiently using gradient based methods.
Logistic Regression
Logistic Regression
p(yn |x n , w ) = N (w x n , ) =
2 2
exp
(yn w > x n )2
2 2
N
Y
n=1
p(yn |x n , w ) =
1
2 2
N
2
(
exp
Logistic Regression
N
X
(yn w > x n )2
2 2
n=1
1 X
(yn w > x n )2
2 2 n=1
Can also combine the likelihood with a prior over w , e.g., a multivariate
Gaussian prior with zero mean: p(w ) = N (0, 2 ID ) exp(w > w /22 )
NLL(w ) = log p(y |X, w )
The prior allows encoding our prior beliefs on w , acts as a regularizer and, in
this case, encourages the final solution to shrink towards the priors mean
Can now solve for w using MAP estimation, i.e., maximizing the log
posterior or minimizing the negative of the log posterior w.r.t. w
N
X
2
NLL(w ) log p(w )
(yn w > x n )2 + 2 w > w
n=1
Optimization and probabilistic views led to same objective functions (though
the probabilistic view also enables a full Bayesian treatment of the problem)
Machine Learning (CS771A)
Logistic Regression
Todays Plan
Logistic Regression
Logistic Regression
Logistic Regression
p(yn = 0|x n , w )
1 n
f (x n ) = n = (w xn ) =
1
exp(w> xn )
=
1 + exp(w> xn )
1 + exp(w> xn )
PD
The sigmoid first computes a real-valued score w > x = d=1 wd xd and
squashes it between (0,1) to turn this score into a probability score
Logistic Regression
p(y = 0|x, w )
1
exp(w > x)
=
1 + exp(w > x)
1 + exp(w > x)
1
>
1 = 1 (w x) =
1 + exp(w > x)
>
= (w x) =
p(y = 1|x, w )
>
>
= log exp(w x) = w x
p(y = 0|x, w )
Logistic Regression
10
Logistic Regression
11
log(n )
yn = 1
log(1 n ) yn = 0
This loss function makes intuitive sense
`(yn , f (x n )) =
The above loss function can be combined and written more compactly as
`(yn , f (x n )) = yn log(n ) (1 yn ) log(1 n )
This is a function of the unknown parameter w since n = (w > x n )
Machine Learning (CS771A)
Logistic Regression
12
N
X
`(yn , f (x n )) =
N
X
n=1
n=1
Plugging in
n =
exp(w > x n )
1+exp(w > x n )
L(w ) =
N
X
>
>
(yn w x n log(1 + exp(w x n )))
n=1
N
X
>
>
2
(yn w x n log(1 + exp(w x n ))) + ||w ||
n=1
Logistic Regression
13
Logistic Regression
14
where
n =
exp(w > x n )
1+exp(w > x n )
N
Y
p(yn |x n , w ) =
n=1
N
Y
1yn
nn (1 n )
n=1
N
X
(yn log n + (1 yn ) log(1 n ))
n=1
NLL(w ) =
n =
exp(w > x n )
1+exp(w > x n )
we get
N
X
>
>
(yn w x n log(1 + exp(w x n )))
n=1
Not surprisingly, the NLL expression is the same as the loss function (but we
reached at it sort of more directly)
Machine Learning (CS771A)
Logistic Regression
15
Logistic Regression
16
N
X
>
>
(yn w x n log(1 + exp(w x n )))
n=1
L(w )
w
>
>
[
(yn w x n log(1 + exp(w x n )))]
w
n=1
!
N
X
exp(w > x n )
xn
yn x n
>
(1 + exp(w x n ))
n=1
N
X
>
(yn n )x n = X ( y )
n=1
Cant get a closed form solution for w by setting the derivative to zero
Need to use iterative methods (e.g., gradient descent) to solve for w
Machine Learning (CS771A)
Logistic Regression
17
new value
w (t)
|{z}
previous value
N
X
((t)
n yn )x n
n=1
{z
where is the learning rate and (t) = (w (t) x n ) is the predicted label
probability for x n using w = w (t) from the previous iteration
Note that the updates give larger weights to those examples on which the
(t)
current model makes larger mistakes, as measured by (n yn )
Note: Computing the gradient in every iteration requires all the data. Thus
GD can be expensive if N is very large. A cheaper alternative is to do GD
using only a small randomly chosen minibatch of data. It is known as
Stochastic Gradient Descent (SGD). Runs faster and converges faster.
Machine Learning (CS771A)
Logistic Regression
18
Logistic Regression
19
(t+1)
(t)
(t) 1 (t)
Since gradient
Since
n
w
g=
PN
2 L(w )
=
w w >
w
n=1 (yn
n )x n , H =
>
exp(w x n )
1+exp(w > x n )
H=
"
L(w )
w
g>
w
#>
=
= w
g>
w
PN
n=1 (yn
n )x >
n =
n
n=1 w
PN
x>
n
= n (1 n )x n , we have
N
X
>
>
n (1 n )x n x n = X SX
n=1
Logistic Regression
20
(t+1)
=
=
=
(t) 1 (t)
(t)
(t)
(X S X)
(t)
> (t)
> (t)
+ (X S X)
>
X (
(t)
y)
>
(t)
X (y )
> (t)
(X S X)
> (t)
(X S X)
X [S Xw t + y ]
> (t)
(X S X)
X S [Xw
> (t)
(X S X)
> (t)
[(X S X)w
>
(t)
>
> (t)
(t)
+ X (y )]
(t)
(t)
(t)
(t) 1
+S
(t)
(y )]
X S y
n=1
(t)
A weighted least squares with y (t) and Sn changing in each iteration
(t)
The weight Sn is the nth diagonal element of S(t)
Logistic Regression
21
N Y
K
Y
n`n`
(Likelihood function)
n=1 `=1
where yn` = 1 if true class of example n is ` and yn`0 = 0 for all other `0 6= `
Can do MLE/MAP for W similar to the binary logistic regression case
Machine Learning (CS771A)
Logistic Regression
22
Logistic Regression
23