Professional Documents
Culture Documents
(Summary)
Linear Regression: The Aim
We are given:
1) A bunch of training data, each with n features
2) The desired numeric output against each training example.
We desire:
To make a formula that calculates y from x. The formula is going to be
linear. That is given a training example, our formula would predict:
Now if there are n features, and there are more than n training
examples, then it might be impossible for a straight line (or hyper-plane)
to pass through all the points.
So we need an error formula, that judges how good our current weights
are.
Linear Regression: The Error
Remember y predict w0 x1w1 x2 w2 ... xn wn
Say, we have m training examples. How good is our line (or hyper-plane)
in fitting the given data? We would like to take something like the average
error. A couple of ways to do this are:
1 m
predict desired
y y 2 The formula is taking the average of
the squared errors (Mean of the
m i 1
Squared Errors).
More commonly used. It is half of the mean
1 m
y predict y desired 2 squared error. The 2 in the denominator is
2m i 1 just so that the algebra simplifies when we
later differentiate.
So we now know (a) Our formula and (b) We can specify how good
or bad it is.
First lets
y predict ydesired
m m
1 1
change the
notation ( i ) a little
(i ) 2 bit:
2
hw ( x ) y
2m i 1 2m i 1
Here by hw ( x (i ) ) we denote our predicted value on the i th training example.
The w in the subscript means, that this predicted value (obviously) depends
on the current values of all the w j .
Linear Regression: Updating Notation
Remember from the previous slide : error
1 m
2m i 1
hw ( x ( i ) ) y (i ) 2
That is, in the formula for y predict , x (ji ) would be multiplied by w j (as shown above)
Error is only a function of weights
Remember : error
1 m
2m i 1
hw ( x ( i ) ) y (i ) 2
Now note something. The training examples, x (i) are given to us. They do
not change. The desired outputs y(i) are given to us. They do not change
either. They are all constants.
The only thing that changes (and we are trying to change that) are the
weights, wj.
So
J (w
1 m
) error
the
isha
w ( (i )
function
x ) y
(i ) 2
of all the weights. So we can write the error as a
function2m J(w):
i 1
Reducing the Error
Remember J ( w)
1 m
2m i 1
hw ( x ) y
(i ) (i ) 2
Now the question worth a billion dollars: How do we update the weights to
reduce the error?
Suppose there are only one feature. So we have just two weights: w 0 and
w1. As the error is a function of these two weights we can write: J(w) =
J(w0, w1)
So the graph of J(w) may look Here
like this:
the two horizontal axes are w0 and w1.
The vertical axis is J(w0, w1); that is, it is the
error J(w).
That is, we need to move in the opposite direction to the gradient. How
fast shall we move is determined by a user-specified constant (the so-
J
w0 ,
called learning rate), so the full update rule becomes:
w0and
w0
Reducing the Error
h (x ) y
m
1 (i ) 2
Remember J ( w) w
(i )
2m i 1
J
And our update rules are w j w j
w j
The only problem remaining is that of calculus: Differentiating J over all the
weights. After all the algebra, here is the result we get after differentiation:
w j w j hw x (i ) y (i ) x (ji )
1 m
m i 1
1
wj
m
Sum_over_all_training_examples y predicted ydesired (the feature mutliplying w j )
So we update all the (w0, w1, wn) like this. After having done that, we
calculate and update all the ypredicted with the new ws.
Then we repeat the procedure: update all the (w 0, w1, wn) for the 2nd
time and so on
Numeric Example
Lets say there are two features per training example. Also assume that
we have just two training examples:
Note that the only thing that has changed in the 2nd iteration is
the values of error1
and error2.
Here is what the aim is: We are given training examples. With each
example there is a label the example either belongs to class 1, or class 0.
(NOTE there are only 2 classes).
(This class 1, and class 0 could be anything depending on the application)
For every training example, the value of ypredict (which we also denoted by
hw(x(i))( i )is given by: 1
hw ( x ) w0 w1 x1 ... wn xn
1 e
The following properties are true about the above function:
1) The value of hw(x(i)) always is between 0 and 1.
2) As w0 + w1x1 + wnxn goes to +infinity (That is the point goes very
far away from the line on one side), then hw(x(i)) goes closer and
closer to 1.
3) As w0 + w1x1 + wnxn goes to infinity (That is the point goes very
far away from the line on the(i)other side), then h w(x ) goes closer and
So with the above formula hw(x ) is interpreted as:
(i)
closer
The to 0. that the given point belong to class 1.
probability
Now we would like to do the same as before: Define a measure J(w) of the
how good or bad the current weights are, in doing the job.
J
wj
w j to:
Then update the weights according
w j
This is quite similar to linear regression. However, this time our definition
of J(w) is: m
y ( i ) log hw ( x ( i ) ) 1 y ( i ) 1 log hw ( x ( i ) )
1
J ( w)
m i 1
Logistic Regression: Separating line
J ( w) y log hw ( x ( i ) ) 1 y ( i ) 1 log hw ( x ( i ) )
1 m (i )
m i 1
You can note the following properties by substituting values above
(Recall, that hw(x(i)) is the probability that x(i) belongs to class 1). Lets
say there is only one training example. Then:
1. When hw(x(i)) = 0, and y(i) = 0, the error J(w) = 0. (As we would have
liked)
2. When hw(x(i)) = 1, and y(i) = 1, the error J(w) = 0. (As we would have
liked)
3. When hw(x(i)) = 1, and y(i) = 0, the error J(w) is very large. (As we
would have liked)
4. When hw(x(i)) = 1, and y(i) = 1, the error J(w) = 0. (As we would have
liked)
Logistic Regression: Separating line
So we have defined this time:J ( w)
1 m (i )
m i 1
y log hw ( x (i )
) 1 y (i )
1 log hw ( x (i )
)
With this J(w), the update rule for the weights comes out as identical to
that in linear regression (quite fantastic!).
w j w j hw x ( i ) y (i ) x (ji )
1 m
m i 1
1
wj
m
Sum_over_all_training_examples y predicted ydesired (the feature mutliplying w j )
So to update the weights, we run exactly the same steps as that for linear
regression. The ydesired this time is either 0 or 1 for every training set (its
label or class). And the way we compute ypredicted (which is the same as
hw(x(i)) ) for logistic regression is:
1
hw ( x )
(i )
1 e w0 w1 x1 ...wn xn
Thats the only difference from linear regression as far as the algorithm of