You are on page 1of 20

Machine Learning

(Summary)
Linear Regression: The Aim
We are given:
1) A bunch of training data, each with n features
2) The desired numeric output against each training example.

We desire:
To make a formula that calculates y from x. The formula is going to be
linear. That is given a training example, our formula would predict:

y predict w0 x1w1 x2 w2 ... xn wn

Where w are the weights.

Note that if there is only one feature it is an equation of a straight line in


2D: ypredict = w0 + x1w1. If there are two features, it is a plane in 3D:
ypredict = w0 + x1w1 + x2w2
Similarly with n-features, it is an n dimensional hyper-plane.

Now if there are n features, and there are more than n training
examples, then it might be impossible for a straight line (or hyper-plane)
to pass through all the points.
So we need an error formula, that judges how good our current weights
are.
Linear Regression: The Error
Remember y predict w0 x1w1 x2 w2 ... xn wn
Say, we have m training examples. How good is our line (or hyper-plane)
in fitting the given data? We would like to take something like the average
error. A couple of ways to do this are:
1 m

predict desired
y y 2 The formula is taking the average of
the squared errors (Mean of the
m i 1
Squared Errors).
More commonly used. It is half of the mean
1 m
y predict y desired 2 squared error. The 2 in the denominator is
2m i 1 just so that the algebra simplifies when we
later differentiate.

So we now know (a) Our formula and (b) We can specify how good
or bad it is.

The idea behind learning now comes. We need to change


the wi appropriately so that the error becomes smaller. It is
an idea that is repeated in all of supervised learning:
Linear Regression, Logistic Regression, Neural Networks.
Linear Regression: The Error (Updating
Notation)
So the only remaining piece is that we want to do the
following:
Look at the error change the weights to reduce the error Look at the
new error Change the weights to reduce the error even further and so
on

First lets
y predict ydesired
m m
1 1
change the

notation ( i ) a little
(i ) 2 bit:
2
hw ( x ) y
2m i 1 2m i 1
Here by hw ( x (i ) ) we denote our predicted value on the i th training example.
The w in the subscript means, that this predicted value (obviously) depends
on the current values of all the w j .
Linear Regression: Updating Notation
Remember from the previous slide : error
1 m

2m i 1

hw ( x ( i ) ) y (i ) 2

Notice that x(i) above, is not a single number. A training example is


not a number, it comes with several numbers (features). To see this,
consider what our predicted value on the 3rd training example,
x(3),would
y be: w x ( 3) w x (3) w ... x ( 3) w
predict 0 1 1 2 1 n n

That gives us a further notation:


x ( i ) , as shown above, is the j th feature of the i th training example.
j

That is, in the formula for y predict , x (ji ) would be multiplied by w j (as shown above)
Error is only a function of weights
Remember : error
1 m

2m i 1

hw ( x ( i ) ) y (i ) 2

Now note something. The training examples, x (i) are given to us. They do
not change. The desired outputs y(i) are given to us. They do not change
either. They are all constants.

The only thing that changes (and we are trying to change that) are the
weights, wj.

So
J (w
1 m

) error
the
isha
w ( (i )
function
x ) y
(i ) 2
of all the weights. So we can write the error as a
function2m J(w):
i 1
Reducing the Error
Remember J ( w)
1 m

2m i 1
hw ( x ) y
(i ) (i ) 2

Now the question worth a billion dollars: How do we update the weights to
reduce the error?
Suppose there are only one feature. So we have just two weights: w 0 and
w1. As the error is a function of these two weights we can write: J(w) =
J(w0, w1)
So the graph of J(w) may look Here
like this:
the two horizontal axes are w0 and w1.
The vertical axis is J(w0, w1); that is, it is the
error J(w).

Our current values of w0 and w1, could be


anywhere on the plot. However we want to
change the current values of w0, w1, so that
we get to the minimum of the graph.

That is, we wish that w0, w1 have the values


that they have at the minimum. Then
obviously, J(w0, w1) would have achieved its
minimum possible value.
Reducing the Error
Remember J ( w)
1 m

2m i 1
hw ( x ) y
(i ) (i ) 2

Here, as an example, our current values of w


are shown to be 2 and 0 (see the dotted lines
on the horizontal axes).

Our error is then given by J(2, 0), which is


shown by the red dot. But we would want our
current position to reach the minimum of the
plot.
J
Say we take the gradient of the error plot with respect to w 0. That is:
J w0
w
Notice that if we update w0 with w
0 : 0 , then we would
w0
move towards the minimum point on the plot.

That is, we need to move in the opposite direction to the gradient. How
fast shall we move is determined by a user-specified constant (the so-
J
w0 ,
called learning rate), so the full update rule becomes:
w0and
w0
Reducing the Error
h (x ) y
m
1 (i ) 2
Remember J ( w) w
(i )

2m i 1

J
And our update rules are w j w j
w j
The only problem remaining is that of calculus: Differentiating J over all the
weights. After all the algebra, here is the result we get after differentiation:

w j w j hw x (i ) y (i ) x (ji )
1 m
m i 1
1
wj
m
Sum_over_all_training_examples y predicted ydesired (the feature mutliplying w j )
So we update all the (w0, w1, wn) like this. After having done that, we
calculate and update all the ypredicted with the new ws.
Then we repeat the procedure: update all the (w 0, w1, wn) for the 2nd
time and so on
Numeric Example
Lets say there are two features per training example. Also assume that
we have just two training examples:

Training example 1: x1 = 1, x2 = 5, ydesired = 22


Training example 2: x1 = 3, x2 = 6, ydesired = 30

Now our ypredicted = w0 + w1x1 + w2x2.


Assume that initially: w0 = 2; w1 = 3; w2 = 4.

Lets go through an iterations of updating the weights.


First lets calculate ypredicted = hw(x(i)) over the two training examples:
hw ( x (1) ) 2 (3 1) (4 5) 25
We have (Note that our desired
value was
hw22)
( x ( 2 ) ) 2 (3 3) (4 6) 35

We have (Note that our desired


value was 30)

So we have hw(x(1)) y(1) = 25 22 = 3 (Lets call this error1)


And hw(x(2)) y(2) = 35 30 = 5 (Lets call this error 2)
Numeric Example: Updating the weights
Recall from the previous slide

Training example 1: x1(1) = 1, x2(1) = 5, y(1) = 22


Training example 2: x1(2) = 3, x2(2) = 6, y(2) = 30

hw(x(1)) y(1) = 25 22 = 3 (Lets call this error1)


And hw(x(2)) y(2) = 35 30 = 5 (Lets call this error 2)

Lets assume that = 1

Now lets update the weights:


w0 = w0 /m [ (error1)(1) + (error2)(1) ] = 2 1/2 ( 3(1) + 5(1) ) =
2
w1 = w1 /m [ (error1)(x1(1)) + (error2)(x1(2)) ] = 3 1/2 ( 3 (1) + 5(3) )
= 6
w2 = w2 /m [ (error1)(x2(1)) + (error2)(x2(2)) ] = 4 1/2 ( 3 (5) + 5(6) )
= 18.5

Note that to update w1 we are multiplying the errors of the training


Numeric Example: How to go to the next
iteration
That was one iteration of updates to the weights w.

To do the next iteration we do the following:


1) Take the updated weights, and re-compute hw(x(2)), hw(x(2)), and from
them error1, and error2.
2) Use these errors to update the weights for the second time, using
exactly the same formulae as before:
w0 = w0 /m [ (error1)(1) + (error2)(1) ]
w1 = w1 /m [ (error1)(x1(1)) + (error2)(x1(2)) ]
w2 = w2 /m [ (error1)(x2(1)) + (error2)(x2(2)) ]

Note that the only thing that has changed in the 2nd iteration is
the values of error1
and error2.

Similarly a third or a fourth iteration of the weight updates can be


carried out as
well.
Logistic Regression
Logistic Regression
It is almost the same concept as regression --- even though the purpose is
quite different.

Here is what the aim is: We are given training examples. With each
example there is a label the example either belongs to class 1, or class 0.
(NOTE there are only 2 classes).
(This class 1, and class 0 could be anything depending on the application)

So this time ydesired = 0 or 1, depending on the class to which the training


example belongs.
And ypredict is a value between 0 and 1: It is the probability that the
training example belongs to class 1.

So if ydesired = 1, then ypredict should be a value close to 1.


If ydesired = 0, then ypredict should be a value close to 0.

How is it done? What does it mean? Lets see.


Logistic Regression: Separating line
Suppose there are only 2 features per training example:
x1 and x2.

Here we have drawn all the training examples with the


two features. The training examples with class = 1, are
shown as the circles. The training examples with class =
0, are shown as red crosses.
So in this case, what we want is for the algorithm to learn a separating
line. This line would have the equation w 0 + w1x1 + w2x2 = 0.
(Note that every line can be written in this form, with 0 on the right hand
side. For example if a line has an equation 3x + 5y 3 = 7. Then it can be
also written as: 3x + 5y 10 = 0)
For example this blue line shown on the left, does a good
job in separating the points belonging to class = 1 and
class = 0.

Once such a line has been learned by the algorithm (that


is the weights wj have been learned), we can predict the
class of a new example that comes.
Lets say a new example (with unknown class) is given to
us, with feature values x1 and x2. We would calculate w0
+ w1x1 + w2x2. If this value > 0, then it belongs to one
Logistic Regression: Separating line
In general, we might have more than 2 features. If we have n features, we
would like to learn a good separating hyper-plane with the equation: w 0 +
w 1x1here
But + iswanxslight
n = 0.problem. If a new example comes, we said on the
previous slide, that in order to predict its class we would compute: w 0 +
w1x1 + wnxn (after having learned the weights from the training data).

Lets call w0 + w1x1 + wnxn = val


Now if val > 0, we would predict class 1, and if val < 0 we would predict
class 0.

However, if we do the above, we are throwing away some information.


The value w0 + w1x1 + wnxn tells us not only about what side of the
separating plane it is, but also how far it is from that separating plane.
The farther this point is from the separating plane, the larger is the
magnitude of val, and the more confident we should be that our prediction
is correct --- because the point is so far on one side, and away
val = -6 from the
uncertainty of being right on top of the separating line.
val = -1
val = +8
val = +2
Logistic Regression: Separating line
So we want to use, not just the sign, but also the value of w 0 + w1x1 +
wnxn

For every training example, the value of ypredict (which we also denoted by
hw(x(i))( i )is given by: 1
hw ( x ) w0 w1 x1 ... wn xn
1 e
The following properties are true about the above function:
1) The value of hw(x(i)) always is between 0 and 1.
2) As w0 + w1x1 + wnxn goes to +infinity (That is the point goes very
far away from the line on one side), then hw(x(i)) goes closer and
closer to 1.
3) As w0 + w1x1 + wnxn goes to infinity (That is the point goes very
far away from the line on the(i)other side), then h w(x ) goes closer and
So with the above formula hw(x ) is interpreted as:
(i)

closer
The to 0. that the given point belong to class 1.
probability

So if hw(x(i)) is close to 1, then the probability that it belongs to class 1 is


high.
If hw(x(i)) is close to 0, then the probability that it belongs to class 1 is
low (and hence the probability that it belongs to class 0 is high).
Logistic Regression: Separating line
we would like to learn the weights w, so that:
he training examples whose ydesired = 1, we have hw(x(i)) close to 1.
he training examples whose ydesired = 0, we have hw(x(i)) close to 1.

Now we would like to do the same as before: Define a measure J(w) of the
how good or bad the current weights are, in doing the job.
J
wj
w j to:
Then update the weights according
w j

This is quite similar to linear regression. However, this time our definition
of J(w) is: m
y ( i ) log hw ( x ( i ) ) 1 y ( i ) 1 log hw ( x ( i ) )
1
J ( w)
m i 1
Logistic Regression: Separating line

J ( w) y log hw ( x ( i ) ) 1 y ( i ) 1 log hw ( x ( i ) )
1 m (i )
m i 1
You can note the following properties by substituting values above
(Recall, that hw(x(i)) is the probability that x(i) belongs to class 1). Lets
say there is only one training example. Then:

1. When hw(x(i)) = 0, and y(i) = 0, the error J(w) = 0. (As we would have
liked)
2. When hw(x(i)) = 1, and y(i) = 1, the error J(w) = 0. (As we would have
liked)
3. When hw(x(i)) = 1, and y(i) = 0, the error J(w) is very large. (As we
would have liked)
4. When hw(x(i)) = 1, and y(i) = 1, the error J(w) = 0. (As we would have
liked)
Logistic Regression: Separating line
So we have defined this time:J ( w)
1 m (i )

m i 1
y log hw ( x (i )
) 1 y (i )
1 log hw ( x (i )
)

With this J(w), the update rule for the weights comes out as identical to
that in linear regression (quite fantastic!).

w j w j hw x ( i ) y (i ) x (ji )
1 m
m i 1
1
wj
m
Sum_over_all_training_examples y predicted ydesired (the feature mutliplying w j )

So to update the weights, we run exactly the same steps as that for linear
regression. The ydesired this time is either 0 or 1 for every training set (its
label or class). And the way we compute ypredicted (which is the same as
hw(x(i)) ) for logistic regression is:
1
hw ( x )
(i )

1 e w0 w1 x1 ...wn xn

Thats the only difference from linear regression as far as the algorithm of

You might also like