Mlclass Notes

Notes from Andrew Ngs
Machine Learning Course

Travis Johnson
2012-06-14
1 Introduction
1.1 Welcome
Machine learning is
1. grew out of work in AI
2. New capability for computers
Examples of Machine Learning
1. Database mining, since we have large datasets from
growth of automation/web (web click data, medical
records, biology, engineering)
2. Applications that cant be programmed manually -
autonomous helecopters/handwriting recognition/most
of natural language processing, computer vision
3. Self-customizing programs (Amazon, Netix, etc)
4. Understanding human learning: brain, real AI.
1.2 What is machine learning?
Machine learning is the eld of study that gives computers
the ability to learn without being explicitly programmed.
More formally(From Tom Mitchell): Well-posed Learning
problem: A computer program is said to Learn from expe-
rience E with respect to some task T and some performance
measure P, if its performance on T, as measured by P,
improves with experience E.
Suppose your email program watches which emails you
do or do not mark as spam, and based on that learns how
to better lter spam. What is the task T in this setting?
Classifying emails as spam or not spam is T. Watching you
label emails as spam or not spam is E, and the fraction of
correctly labeled emails is P.
Machine learning algorithms are supervised and un-
supervised learning, and reinforcement learning, and
recommender systems. Well also talk about practical usage
of these algorithms.
1.3 Supervised Learning
Lets say you want to predict housing problems. Given
the plot 6/14/2012 +1743, could put a plot through it
or quadratic or something. This is a supervised learning
algorithm since we gave it the right answers, which in
this case was the house prices. This is a regression problem,
since we predict continuous valued output, the price.
Another example: Breast cancermaybe we want to
predict malignnant vs benign, for various tumor sizes. (plot
1746) this is a CLASSIFICATION problem, since we try
to predict zero or one. We are not limited to 2 classeswe
could have multiclass as well.
We can also extend this into having multiple variables:
maybe age, and tumor size. We might even have tons more!
Maybe even innite variables?
1.4 Unsupervised Learning
Here, our datasets do not include the labels! So, we just
look for structure in the data. We might just try to cluster
All notes from https://www.coursera.org/course/ml
Department of Engineering Science and Applied Mathematics,

Northwestern University, Evanston, IL, USA.
data: This is how google News works, by clustering news
articles together. Some other applications:
1. Organize computing clusters
2. Social Network analysis
3. Market Segmentation
4. Astronomical data analysis
Cocktail party problem: Put two microphones in a room
with two speakers talking at the same time. Cocktail party
algorithm will split the voices out. Can be done with one
line of code(1).
Listing 1: Cocktail Party Algorithm
[W, s , v ] = svd( ( repmat (sum( x . x , 1 ) , . . .
si ze ( x , 1 ) , 1 ) . x)x ) ;
Well use Octavebasically the same as MATLAB. Typ-
ically we prototype in Octave and then implement things
properly in another language.
2 Linear Regression in One Variable
2.1 Model Representation
Well use the Portland, OR training set: Price(in 1000s
of dollars) vs Size(in feet square). This is a supervised
learning example because we have the right answer, and
also an example of regression.
More formally, we have a dataset:
1. m - number of training examples(47 in Portland, OR
example.)
2. xs - input variable/features
3. ys - ouput variable/target variable.
We will use (x, y) to denote one training example, and
(x
(i)
, y
(i)
) to denote the i
th
training example.
The hypothesis is a function h that maps from xs to ys.
The rst thing well need to decide: How do we represent
h? For this problem,
h
(x) =
0
+
1
x (1)
or h(x) in shorthandthis function predicts some straight
line function into the data points, and is known as Linear
regression with one variable or univariate linear regression.
2.2 Cost Function
We call
i
as the parameter. Well need to determine
0
and
1
to evaluate things. Wed like to come up with values
i
that correspond to a good t. Therefore, well pick
i
such that h
(x) is close to y for our training examples

(x, y). We can formulate this as a minimization problem:
min
0,1
1
2m
m
i=1
(h
(x
(i)
) y
(i)
)
2
(2)
where m is the number of training examples and 2m is just
a normalization factor to make the math easier.
1
Typically we redene this as: the cost function J(
0
,
1
)
dened as
J(
0
,
1
) =
1
2m
m
i=1
(h
(x
(i)
) y
(i)
)
2
(3)
and then solve the minimization problem
min
0,1
J(
0
,
1
). (4)
2.3 Cost function Intuition
Consider a simplied cost function with h
(x) =
1
x (or,
0
= 0). For xed
1
, h
(x) is a function of f, while J(

1
)
is a function of the parameter
1
.
We can plot J(
1
), which gives (plot 1829), derived from
plotting the error J(
1
) for a bunch of dierent values of
1
. Furthermore, we want to use thet
1
that corresponds
to the minimum error, so in this case we pick
1
= 1.
2.4 Cost function Intuition II
If we instead consider the 2d version, we need to consider
both parameters. We can do this with a contour plot.
(See plot 1833)
This isnt very important thoughwe cant really use plots
to nd optimum in higher dimensionswe need an algorithm.
2.5 Gradient Descent
Gradient descent is a super sweet algorithm for nding
minimum of our cost function J. Our problem is
min
0,1
J(
0
,
1
) (5)
where we can initialize
0
= 0 and
1
= 0 and keep changing
these
i
to make J lower.
Intuition is that we look locally around the point were at,
and take a tiny step in a descent direction. One important
point of this is that it is possible to nd local minima that are
not global minima, depending on which points we start at!
Formalizing this denition a bit, we have that we will
repeat until convergence:
j
:=
j

j
J(
0
,
1
) ( for j = 0, j = 1). (6)
The parameter is called the Learning Rate. Well say
more about it. It is important that this update must be
simultaneous.
2.6 Gradient Descent Intuition
We call
j
J(
0
,
1
) (7)
the derivative term. Suppose we just wanted to solve
min
1
J(
1
) (8)
Then gradient descent will give
1
:=
1
d
d
1
J(
1
) (9)
The derivative is a tangent to the function, which means
that if at the point were evaluating the cost function
J(
1
) has a positive (negative) slope, then the update will
decrease (increase)
1
, as wed expectin both cases, we
move toward the minimum.
Now consider : If is too small, then gradient descent
will be too slow. On the other hand, if is too large, then
gradient descent will overshoot the minimumand may even
fail to converge or diverge.
What will happen if we initialize gradient descent AT a
local minimum? Well, the derivative will be zero then, so
the update wont move the iterate! This has the important
property that gradient descent will automatically take
smaller steps, so we have no need to decrease over time.
2.7 Gradient Descent for Linear Regression
Now we need to apply gradient descent to the linear regres-
sion idea, where we had h
(x) =
0
+
1
x and associated
cost function J(
0
,
1
) =
1
2m
m
j=1
(h
(x
(i)
) y
(i)
)
2
. Wed
like to solve the problem
min
0,1
J(
0
,
1
) (10)
via the gradient descent algorithm, where we repeat until
convergence
j
:=
j

j
J(
0
,
1
) for j = 0, j = 1. (11)
The huge key to applying gradient descent, we need to
calculate the derivative term
j
J(
0
,
1
) =

1
2m
m
j=1
(h
(x
(i)
) y
(i)
)
2
(12)
=

1
2m
m
j=1
(
0
+
1
x
(i)
y
(i)
)
2
(13)
=
1
m
m
i=1
(h
(x
(i)
) y
(i)
)
_
1 j = 0
x
(i)
j = 1.
(14)
Then the gradient descent algorithm is to repeat until
convergence
_
1
_
:=
_
1
_
1
m
m
i=1
(h
(x
(i)
) y
(i)
)
_
1
x
(i)
.
_
(15)
It turns out that for the cost function used with linear
regression, we will always have convex functions, so gradient
descent will always converge to a global optimum.
This is also known as Batch gradient Descentthe name
comes from that each step of gradient descent use all the
training examples.
This will scale better than the linear algebraic answers.
2.8 Whats next?
Two extensions:
1. Can solve
0
and
1
directly, without needing an iterative
algorithm.
2. We might also use a bunch more features! In this case, we
introduce x
1
, x
2
, x
3
, and x
4
with y still being the tar-
get. We wont be able to plot these, though. We use fea-
ture matrices X, where the row is the training data point
and the column is the feature, and Y is the target vector.
2
3 Linear Algebra Review
3.1 Matrices and Vectors
A Matrix is a rectangular array of numbers:
_
_
_
1402 191
1371 821
949 1437
147 1448
_
_
_ or
_
1 2 3
4 5 6
_
(16)
The dimension of the matrix is the number of rows
number of columns. The rst matrix is R
42
, for example.
We call A
ij
to be the i, j entry of the matrix and refers to
the i
th
row and the j
th
column.
A vector is a special matrix: An n 1 matrix. We might
say a vector is an element in R
n
, and y
i
is the i
th
row of a
vector. Careful here! Some people use 0-indexed, we (and
Octave/MATLAB) use 1-indexed.
Usually, people use upper case to refer to matrices and
lower case to refer to vectors and scalars.
3.2 Addition and Scalar Multiplication
Matrices can be added like: A
ij
+B
ij
= (A+B)
ij
. We can
only add matrices of identical sizes.
Scalar multiplication is dened as (A
ij
) = (A
ij
).
We can take combinations of operands, and just follow
typical order of operations.
3.3 Matrix-Vector Multiplication
To multiply a mn matrix A by an n1 vector x (resulting
in an m1-vector y), we have that
n
j=1
A
ij
x
j
= y
i
(17)
We can evaluate hyptotheses (say h(x) = 40 +.25x
1
) at
some vector x all at once by a matrix-vector multiplication
by introducing (for some data vector y =
_
_
_
2104
1416
1534
852
_
_
_) by
introducing the matrix
_
_
_
1 2104
1 1416
1 1534
1 852
_
_
_ =
_
_
_
1 2104
1 1416
1 1534
1 852
_
_
_
_
40
0.25
_
= (18)
which says that the prediction is just the datamatrix times
the parameters ! Neat!
3.4 Matrix-Matrix Multiplication
Similarly, we can dene the product of a m n-matrix A
by a n o matrix B into a mp result matrix C = AB:
C
mo
= (AB)
mo
=
j
A
mj
B
jo
(19)
We can apply multiple competing hypotheses by ma-
trix multiplication where the data matrix is as the last
section and the columns of the hypothesis matrix are the
parameters.
USE LINEAR ALGEBRA LIBRARIES for this!
3.5 Matrix Multiplication Properties
We should be careful with matrix multiplication, though!
There are several sticking points:
1. If A, B are matrices, then in general A B ,= B A!
(Matrices are not commutative!)
2. Matrices are associative: A B C = (A B) C =
A(B C)but, careful about orders!
3. Special Matrix: I is the identity matrix, and for any
matrix A, A I = I A = A. (This only makes sense if
I is square.)
4.
3.6 Inverse and Transpose
Some matrices have inverses! If A is an mn matrix and if
it has an inverse, then A(A
1
) = A
1
A = I. (Only square
matrices can have inverses.) Matrices without inverses are
Listing 2: Matrix Inverses in Octave
A = [ 3 , 4; 2 1 6 ] ;
i nverseOf A = pinv(A) ;
A i nverseOf A % = I , t o roundo f f
i nverseOf AA % = I , t o roundo f f
called singular or degenerate.
We can also dene the transpose of a matrix, where
(A
T
)
ij
= A
ji
. Intuitively, it ips the elements of the
matrix.
(Many numerical examples were excluded from these
notes.)
4 Linear Regression with Multiple Variables
4.1 Multiple Features
Suppose that in addition to size, we also had the number
of bedrooms, number of oors, age of home, and still want
to predict the price. Then introduce the notation
1. m - training examples
2. n - number of features
3. x
(i)
- input (features) of i
th
training example
4. x
(i)
j
- value of feature j in the i
th
training example
Previously, we had h
(x) =
0
+
1
x. Well need to instead
make this:
h
(x) =
0
+
1
x
1
+
2
x
2
+
3
x
3
+
4
x
4
(20)
For convenience, we dene x
0
= 1, and then take
h
(x) =
T
x. (21)
where x, R
n+1
.
This is called Multivariate Linear Regression.
3
4.2 Gradient Descent for Multiple Variables
Well start thinking of being an (n + 1)-dimensional
vector, so that we can write
J() =
1
2m
m
i=1
(h
(x
(i)
) y
(i)
)
2
(22)
=
1
2m
m
i=1
(h
(x
(i)
) y
(i)
)
2
(23)
=
1
2m
m
i=1
(
T
x
(i)
y
(i)
)
2
(24)
=
1
2m
m
i=1
(
n
j=0
j
x
(i)
j
y
(i)
)
2
(25)
(26)
Again, we have that gradient descent is of the form: repeat
j
:=
j
j
J() :=
j
1
m
m
i=1
(h
(x
(i)
)y
(i)
)x
(i)
j
(27)
where we do simultaneous updates for
j
for j = 0, ..., n.
4.3 Gradient Descent in Practice I - Feature
Scaling
Gradient descent will converge more quickly if we scale
variables so that everything is on a similar scale. For
example, if x
1
is in the range 0-2000 and x
2
is in the range
1 5, we will have poor performance, due to oscillation. So
we scale so that steps will be able to be much larger. For
example, we might dene:
1. x
1
=
size(feet
2
)
2000
2. x
2
=
numberofbedrooms
5
.
More generally, we hope to get every feature into approx-
imately a 1 x
i
1. (We only care about scaling to
about order-1). More generally, we might also do mean
normalization! There we replace x
i
by x
i
i
, or
xii
si
.
This will make gradient descent run much faster in
practice.
4.4 Gradient Descent in Practice II - Learning
Rate
We will need to make sure that gradient descent will work
properly, at least in terms of the value.
One key idea: Plot min
J() vs the number of iterations.

This should very clearly be a decreasing function! In fact,
it should decrease after every iteration. We can also tell
from this gure when the iteration has converged.
We can also use this idea to nd an automatic con-
vergence test! We might be able to say if min J() only
decreases by 10
3
then we are probably done.
If we see an increasing or even nondecreasing function,
we might need to pick a smaller . It is possible to prove
that for suciently small , we are guaranteed to converge
and J() will decrease on every iteration., but also want to
be careful that the algorithm is not converging too slowly.
Practically, just run several values of like
0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ... and make sure that we
found one that is too slow and one that is too large. Then
we know that we have found a good learning rate to use.
4.5 Features and Polynomial Regression
We might want to model housing prices via frontage and
depth. With linear regression, we can dene new variables:
one example is to take x as the product of frontage and
depth. So by dening new features, we might get a better
t for our data. For example, we might pick a quadratic or
cubic model. We can actually do this incredibly easily with
the framework weve created, just by changing our h
(x)!
One easy way to do this is for a cubic model is to create
new data where x
1
is the size, x
2
is the squared size of
the house, and x
3
is the size cubed of the house. But then
scaling gets REALLY important!
We can also try totally crazy things, like h
(x) =
0
+
1
(size) +
2
size.
4.6 Normal Equation
Gradient descent might be subpar since we can use the
normal method to solve for analytically.
The intuition is that if we had something like J() =
a
2
+ b + c and if we want to minimize, then we should
take the derivative and set it equal to zero, then solve for :
d
d
J() = ... = 0 (28)
In the case of multiple dimensions, we will get many
equations. Suppose that R
n+1
, and that
J(
0
,
1
, ...,
m
) =
1
2m
m
i=1
(h
(x
(i)
) y
(i)
)
2
(29)
we can still take the derivative(except that its now the
gradient) and equal to zero:
j
J() = ... = 0 for every j (30)
and now solve for
0
,
1
, ...,
n
.
To start, we create the data matrix:
X =
_
_
_
1 2104 5 1 45
1 1416 3 2 40
1 1534 3 2 30
1 852 2 1 36
_
_
_ y =
_
_
_
460
232
315
178
_
_
_ (31)
Then we can nd that the analytic solution ends up being
= (X
T
X)
1
X
T
y. (32)
In the general case, consider that we have m examples
(x
(i)
, y
(i)
) each of which has n features. Since
x
(i)
=
_
_
_
_
_
_
x
(i)
0
x
(i)
1
.
.
.
x
(i)
n
_
_
_
_
_
_
(33)
we create a design matrix X as:
X =
_
_
_
_
_
(x
(1)
)
T

(x
(2)
)
T

.
.
.
(x
(m)
)
T

_
_
_
_
_
(34)
4
Gradient descent normal equations
+ need to choose no need to choose
Needs many iterations! dont need to iterate!
- works well even when need to compute (X
T
X)
1
n is large slow if n is large!
Table 1: Trados between Gradient Descent and Normal
Equations
which is an m (n + 1) matrix. As always, y is the vector
of target values (sale prices in our example). Then:
= (X
T
X)
1
X
T
y (35)
We could implement this (shittily!) in
Listing 3: Octave Normal Equations
pinv(X X)X y
We talked last time about feature scalingthis is not
necessary for the normal equation method!
A quick, hard, 2012 rule-of-thumb: We would probably
use normal equations up to 1000-10000 or so variables, and
then switch gradient descent.
4.7 Normal Equation Noninvertibility
One important note about Normal Equation is that the
normal equation can be noninvertible! What if (X
T
X)
1
is non-invertible/singular/degenerate?
When will this happen?
1. Redundant featureslinearly dependent!
2. Too many features(ie, m n(trying to t 100 parame-
ters with 10 data points...))delete some features or use
regularization
5 Octave Tutorial
5.1 Basic Operations
Can use basic operations: +, , , /, and .
Comparisons are typically: ==, =, &&, and [[ or xor.
Can overwrite prompt with: PS1(
>>
);
Semicolons suppress output.
Can use the disp command to display things or use sprintf.
Can also use 1 : 0.1 : 2 notation, or command ones or
zeros or rand (uniform from 0 to 1) or randn (normally
distributed) or eye (identity matrix). All of these can make
an arbitrary size.
5.2 Moving Data Around
How do you save and load data in octave? Use commands
save (maybe with -ascii), load, size, length, pwd, who,
whos, clear.
Might use subset notation: A(3, 2) returns the 3, 2-
element of the matrix, A(2, :) returns every element along
second row, A([13], :) returns every column of rst and third
rows. You can use this notation for assignments or accesses.
Can also put all things into a single vector by A(:)
or concatenate matrices by C = [AB] (horizontal) or
C = [A; B] (vertical).
5.3 Computing on Data
Some important computational operations, as shown
in listing (4), are initializing matrices, multiplication,
elementwise multiplication,
Listing 4: Computing on Data
A=[1 2; 3 4; 5 6 ] ;
B=[11 12; 13 14; 15 1 6 ] ;
C = [ 1 1; 2 2 ] ;
AC
A . B
A . 2
v =[ 1 ; 2 ; 3 ] ;
1. / v
1. /A
log ( v)
exp( v)
abs ( v)
v % same as 1v
v + ones ( length( v ) , 1)
v + 1 % s i mpl i e r way
A % t r ans pos e
(A )
a=[1 14 2 0 . 5 ]
val = max( a )
[ val , i nd]=max( a )
a < 3 % bool ean vec
find ( a<3) % i ndi c e s
A = magic( 3)
[ r , c ] = find (A>=7)
help find
sum( a )
prod( a )
f l oor ( a )
cei l ( a )
rand( 3)
max( rand( 3) , rand( 3) )
max(A, [ ] , 1 ) % al ong f i r s t di mensi on
max(A, [ ] , 2 ) % al ong second di mensi on
max(max(A) )
max(A( : ) )
A=magic( 9)
sum(A, 1 )
sum(A, 2 )
sum(sum(A. eye ( 9 ) ) )
sum(sum(A. fl i pud ( eye ( 9 ) ) ) )
A=magic( 3)
Ainv=pinv(A)
AAinv % i d e nt i t y
5.4 Plotting Data
Plotting cost function can help ensure that learning
algorithm is converging.
5.5 Control Statements: for, while, and if state-
ments
5.6 Vectorization
Typically languages have numerical linear algebra libraries
which are much more ecient than writing your own
routine for (eg) matrix multiplication.
5
Listing 5: cap
t =[ 0 : 0 . 0 1 : 0 . 9 8 ]
y1=si n (2 pi 4 t ) ;
y2=cos (2 pi 4 t ) ;
plot ( t , y1 ) ;
plot ( t , y2 ) ;
plot ( t , y1 ) ; hold on
plot ( t , y2 ) ;
xlabel ( ti me ) ;
ylabel ( val ue ) ;
legend( s i n , cos ) ;
t i t l e ( my pl ot ) ;
print dpng myPlot . png ;
help plot
cl ose
fi gure ( 1 ) ; plot ( t , y1 ) ;
fi gure ( 2 ) ; plot ( t , y2 ) ;
subplot ( 1 , 2 , 1 ) ; %d i v i d e s p l o t i nt o 1x2 gri d , acces s f i r s t el ement
plot ( t , y1 ) ;
subplot ( 1 , 2 , 2 ) ;
plot ( t , y2 ) ;
axis ( [ 0 . 5 1 1 1 ] ) ;
c l f
A=magic ( 5 ) ;
imagesc(A) ;
imagesc(A) , colorbar , colormap gray
Listing 6: for loops
v=zeros ( 10 , 1)
for i =1:10
v( i )=2 i ;
end
Listing 7: cap
i =1;
while i <=5;
v( i ) = 100;
i=i +1;
end;
Consider
h
(x) =
n
j=0
j
x
j
=
T
x (36)
We can implement this two ways, shown in (13) What
would this look like in C++? See (14) Lets consider a
more sophisticated example. If we have
j
:=
j

1
m
m
i=1
(h
(x
(i)
) y
(i)
)x
(i)
j
(37)
we can implement as := .
5.7 Working on and Submitting programming
exercises
The homework submission system here is pretty slick. Just
run the submit script.
Listing 8: cap
i =1;
while t r ue ;
v( i )=999;
i = i +1;
i f i ==6;
break ;
end;
end;
Listing 9: cap
i f v(1)==1,
disp( the val ue i s one ) ;
el s e i f v( i )==2,
disp( the val ue i s two ) ;
el se
disp( the val ue i s not one or two ) ;
end
Listing 10: cap
function y = squareThisNumber ( x ) ;
y=x 2;
Listing 11: cap
function [ y1 , y2 ] = squareAndCubeThisNumber ( x ) ;
y1=x 2;
y2=x 3;
Listing 12: cap
function J = cost Funct i onJ (X, y , t het a )
m=si ze (X, 1 ) ;
pr e di c t i o ns = XTheta ;
s qr Er r or s = ( pr e di c t i o ns y ) . 2
J = 1/(2m)sum( s qr Er r or s ) ;
Listing 13: vectorized
%% unvect or i z ed ;
pr e di c t i on = 0 . 0 ;
for j =1: n+1;
pr e di c t i on = pr e di c t i on + t het a ( j )x( j ) ;
end
%% v e c t or i z e d
pr e di c t i on = theta x ;
Listing 14: cap
// unvect or i zed
doubl e pr e di c t i on = 0 . 0 ;
for ( i nt j =0; j<= n ; j ++)
pr e di c t i on += t het a [ j ] x [ j ] ;
// ve c t or i z e d
doubl e pr e di c t i on = t het a . t r ans pos e ( ) x ;
6
6 Logistic Regression
6.1 Classication
Well develop Logistic regression. Some examples:
1. Email: Spam/not spam
2. Online transactions: Fradulent (Yes/no)
3. Tumor: malignant/benign.
We say here that
y 0, 1 (38)
where 0 is the negative class and 1 is the positive class(eg,
benign and malignant, respectively).
Can also consider a multi-class algorithmlater for us.
Consider the malignant example. One idea: Threshold
classier output h
(x) at 0.5, so if h
(x) 0.5, predict

y = 1 and otherwise y = 0. Linear regression is kinda silly
here, because we would generate pretty bad predictions for
fairly simple cases. Applying linear regression is a bad idea
in general.
Heres another funny thing: The classication wants to
label y = 0 and y = 1, but h
(x) can output arbitrarily

positive/negative things. So it seems like linear regression
is a really bad idea.
So well develop logistic regression which satises
0 h
(x) 1. Dont be confused by regression titleits

just a name, and it is really classication!
6.2 Hypothesis Representation
Earlier we said that we wanted 0 h
(x) 1. Now we
dene
h
(x) = g(
T
x)
1
1 +e
T
x
g(z) =
1
1 +e
z
(39)
which is the logistic function, which happens to be a
sigmoid function. See plot (2118).
Well take h
(x) to be the estimated probability that

y = 1 on input x. That is, h
(x) = p(y = 1[x; )

probability that y=1, given x, parameterized by . By
basic probability laws, P(y = 0[x; ) +P(y = 1[x; ) = 1, so
we can just subtract from 1 to get other probabilities.
6.3 Decision boundary
We decided that p(y = 1[x; ), so we can kindof imagine
predicting
1. y = 1 if h
(x) 0.5
2. y = 0 if h
(x) < 0.5

Clearly, g(z) 0.5 when z 0, so h
(x) = g(
T
x) whenever
T
0. By similar argument, we predict y = 0 when
T
x < 0.
(plot: 2123) If we have a nice enough classication
problem, we might be able to easily come up with .
Suppose it might end up being:
h
(x) = g(
0
+
1
x
1
+
2
x
2
) (40)
so we say that we predict y = 1 if 3 + x
1
+ x
2
0, or
equivalently x
1
+x
2
3. Well predict y = 0 if x
1
+x
2
< 3.
The line where x
1
+ x
2
= 3 is therefore the decision
boundary for this problem.
Its clear that the decision boundary is determined by
the hypothesis and NOT the training set.
We can also have nonlinear decision boundaries! Plot
2127. We can introduce
h
(x) = g(
0
+
1
x
1
+
2
x
2
+
3
x
2
1
+
4
x
2
2
) (41)
then we would predict y = 1 if 1 + x
2
1
+ x
2
2
0 and
therefore we should predict y = 1 if x
2
1
+ x
2
2
1 so the
circle x
2
1
+x
2
2
= 1 is the decision boundaries.
Once again, the decision boundaries is a property of the
hypothesis, NOT the dataset! The dataset is used to nd
the hypothesis, but the decision boundary itself depends
only on the hypothesis.
We can come up with crazy nonlinear decision boundaries
to t pretty much anything.
6.4 Cost Function
Okay, given a training set of (x
(i)
, y
(i)
), i 1, 2, ..., m, we
have m examples x R
n+1
with x
0
= 1 and y 0, 1, and
the hypothesis function
h
(x) =
1
1 + exp(
T
x)
, (42)
how would we choose parameters ?
Now, introduce the Cost function:
Cost(h
(x), y) =
1
2
(h
(x) y)
2
(43)
and then
J() =
1
m
m
i=1
cost(h
(x
(i)
), y
(i)
) (44)
But this is then a nonconvex function! The sigmoid ends
up messing us up. So well instead use the cost function
Cost(h
(x), y) =
_
log(h
(x)), ify = 1
log(1 h
(x)), ify = 0.
(45)
Then we have a convex function again! Hooray!
This has some intuitive happiness: Cost is zero if y = 1,
and h
(x) = 1. But as h
(x) = 0 then Cost .this

captures the intuition that y = 1 should have a huge cost
for being wrong, which is important in certain applications!
Now if we plot the function for when y = 0 then
log(1 h
(x)). This cost function blows up as h
(x) 1.
6.5 Simplied Cost Function and Gradient Descent
There is a bit simpler way of writing the cost function. By
the end, well have a full gradient descent implementation.
Because either y = 0 or y = 1, we can remove the cases by
writing it into one equation. In particular,
cost(h
(x), y) = y log(h
(x)) (1y) log(1h
(x)) (46)
This reduces to the previous cases if y = 0 or y = 1, so our
equation is now much simpler: No cases!
Then we can write the logistic regression cost function:
J() =
1
m
m
i=1
Cost(h
(x
(i)
), y
(i)
)
(47)
=
1
m
m
i=1
y
(i)
log h
(x
(i)
) + (1 y
(i)
) log(1 h
(x
(i)
))
(48)
(can be derived by principle of maximum likelihood
estimation, and is convex!)
7
Then to t parameters , we just try
min
J() (49)
and to make a prediction given a new x, we just output
h
(x) =
1
1 + exp(
T
x)
. (50)
We can think of this as outputing the probability
p(y = 1[x; ).
Gradient descent will keep working as usual: We repeat
j
:=
j

j
J() (51)
We will nd again that
j
:=
j

m
i=1
(h
(x
(i)
) y
(i)
)x
(i)
j
(52)
where we simultaneously update
j
. This algorithm looks
identical! Whats changed? Well, h
(x) is dierent.
We want to apply the same rule for logistic regression
convergence checking. Ideally we want to use a vectorized
implementation.
We would probably use feature scaling again.
6.6 Advanced Optimization
Heres an alternative view: Given , we have code that can
compute J() and

j
J() for j = 0, 1, ..., n.
Other algorithms: Gradient Descent, Conjugate gradient,
BFGS, and L-BFGS. Some advantages:
1. No need to manually pick .
2. Often faster than gradient descent.
One huge disadvantage: More complex.
Ngs advice: Possible to use algorithms without really
understanding them. We can implement these using library
calls.
One example: Suppose that =
_
2
_
with
J() = (
1
5)
2
+ (
2
5)
2
, and then
1
J() = 2(
1
5)

2
J() = 2(
2
5) (53)
Then we can write the code as shown in (15) and (16).
Listing 15: Cost Function
function [ j Val , gradient ]
= cos t Funct i on ( t het a )
j Val = ( t het a (1) 5)2 + . . .
( t het a (2) 5)2;
gradient = zeros ( 2 , 1 ) ;
gradient (1)=2( t het a (1) 5);
gradient (2)=2( t het a (2) 5);
Well need to change this code for it to work with logistic
regression(or even just Linear Regression!), but that will
just be a similar costFunction.
Listing 16: Minimization
opt i ons=opti mset ( GradObj , . . .
on , MaxIter , 100 ) ;
i ni t i a l The t a = zeros ( 2 , 1 ) ;
[ optTheta , funcVal , eFl ag ] . . .
= fmi nunc ( @costf uncti on , . . .
i ni t i al The t a , opt i ons )
6.7 Multiclass Classication: One-vs-all
Some multiclass classication things:
1. Email foldering/tagging: Work, friends, family, hobby
2. Medical diagrams: not ill, cold , u
3. Weather: sunny, cloudy, rain, snow.
For binary classication, we can hopefully separate with a
hyperplane. Multiclass classication is not really separable
by hyperplanes, so we use an idea called one-vs-all classi-
cation. Here we take two new classes: one is the old rst
class, the second is all other datapoints. We do this for
each class. This gives us three new hypothesis functions:
1. h
1
(x) which tries to estimate P(y = 1[x; ) which is the

probability of it being in that set.
2. h
2

3. h
3

So then we just train all three classiers and evaluate each
of them on the test data; then pick the one with the highest
probability returned since it is the most likely class.
-6 + y = 0 y= 6
7 Regularization
7.1 The PRoblem of Overtting
1. Undert - High Bias - high preconception that housing
prices will be linear (warning: high training error, high
generalization error)
2. Just t
3. Overt - High Variance - not enough data (warning:
low training error, high generalization error))
Overtting ts the training set very well but fails to
generalize to other examples.
If overtting:
1. Reduce number of features: Either manually selecting
which features to keep or model selection algorithms
2. Regularization: Keep all features, but reduce magni-
tude/values of parameters
j
; works well when we have
lots of features.
7.2 Cost Function
Intuition: If we wanted to penalize
3
and
4
really small.
This gives us (again) a (basically) quadratic t. Small
values of
i
are essentially a simpler hypothesis which is
less prone to overtting.
In regularization, we will add:
J() =
1
2m
_
_
m
i=1
(h
(x
(i)
) y
(i)
)
2
+
n
j=1
2
j
_
_
(54)
the regularization parameter balances the t with the
hypothesis t. Picking a too large will essentially give
8
us a constant hypothesisa massive undert, or too high of
bias! So some care should be given to pick is a good way.
7.3 Regularized Linear Regression
For regularized Linear regression, we rst rewrite:
0
:=
0
1
m
m
i=1
(h
(x
(i)
) y
(i)
)x
(i)
0
(55)
j
:=
j

_
1
m
m
i=1
(h
(x
(i)
) y
(i)
)x
(i)
j
+

m
j
_
(56)
For the normal equation approach, we created the design
matrix X and target vector y, and then found that the
minimum of J() occurs at
= (X
T
X +
_
_
_
_
_
_
0
1
1
.
.
.
1
_
_
_
_
_
_
)
1
X
T
y (57)
Last time we talked about non-invertibility. If we have that
m n, then X
T
X is non-invertible, or singular, then prob-
ably pinv will not give a super sweet solution. But it turns
out that X
T
X+
I is never singular, for suciently large .

7.4 Regularized Logistic Regression
Last time we saw that logistic regression can be prone to
overtting, so we can modify it to have regularization:
J() =
_
1
m
m
i=1
y
(i)
log h
(x
(i)
)+ (58)
(1 y
(i)
) log(1 h
(x
(i)
))
+ (59)
+

2m
n
j=1
2
j
(60)
Now we do the same trick for gradient
0
:=
0
1
m
m
i=1
(h
(x
(i)
) y
(i)
)x
(i)
0
(61)
j
:=
j

_
1
m
m
i=1
(h
(x
(i)
) y
(i)
)x
(i)
j
+

m
j
_
(62)
8 Neural Networks: Representation
8.1 Non-linear hypothesis
Currently Neural Nets are the state-of-the-art. We will
use neural networks for nonlinear classication tasks,
particularly with a large number of features.
Including all quadratic terms is not a viable ideafeatures
would scale like O(n
2
) which would be prone to overtting.
Including all cubic terms would of course be even worse!
Computer vision is one very hard case. Suppose we have
50x50 pixel images, so we have n = 2500 features with
grayscale or 7500 for RGB. Quadratic features then would
be on the order of 3 million. Making 100x100 pixels with
quadratic features gives 50 million features!
8.2 Neurons and the Brain
Neural networks are a pretty old algorithm. Origins were
in algorithms that tried to mimic the brain, which started
in 80s-90s, but popularity diminished in the late 90s. Were
experiencing a resurgence, though!
We take the one learning algorithm hypothesisthe
brain can rewire itself to process sights or sounds or touch.
So instead of trying to come up with algorithms for every
task we would want to do, we want to come up with the
learning algorithm! Some evidence for this:
1. seeing with your tongue
2. human echolocation
3. haptic belt for direction sense
4. implanting a third eye
8.3 Model Representation I
Basically, neurons take input through the dendrites and
outputs via axons. They communicate with spikes of
electricity. Neurons move these spikes.
We simulate this with a much simpler model: a logistic
unit which takes a bunch of inputs. (gure 1245). Some-
times we include a bias unit where x
0
= 1. We call the
sigmoid/logistic an activation function. We also might
call the parameters weights. Neural networks just string
together bunches of these models.
Assembled as follows: Layer 1: input one. Layer 2:
Hidden layer. Layer 3: Output layer. We denote
1. a
(j)
i
- activation of unit i in layer j.
2.
(j)
- matrix of weights controlling function mapping
from layer j to layer j + 1.
For 3 nodes of input, 3 nodes of hidden, and one output,
we have the equations:
a
(2)
1
=g(
(1)
10
x
0
+
(1)
11
x
1
+
(1)
12
x
2
+
(1)
13
x
3
) (63)
a
(2)
2
=g(
(1)
20
x
0
+
(1)
21
x
1
+
(1)
22
x
2
+
(1)
23
x
3
) (64)
a
(2)
3
=g(
(1)
30
x
0
+
(1)
31
x
1
+
(1)
32
x
2
+
(1)
33
x
3
) (65)
a
(3)
1
=g(
(2)
10
a
(2)
0
+
(2)
11
a
(2)
1
+
(2)
12
a
(2)
2
+
(2)
13
a
(2)
3
) (66)
where we note that clearly h
(x) = a
(3)
1
.
Then
(1)
R
3x4
. If network has s
j
units in layer j and
s
j+1
units in layer j + 1, then
(j)
will be of dimensions
s
j+1
(s
j
+ 1).
8.4 Model Representation II
We need a vectorized implementationas weve written it
its a little unwieldy.
Well introduce some notation:
z
(2)
1
=
(1)
10
x
0
+
(1)
11
x
1
+
(1)
12
x
2
+
(1)
13
x
3
(67)
z
(2)
2
=
(1)
20
x
0
+
(1)
21
x
1
+
(1)
22
x
2
+
(1)
23
x
3
(68)
z
(2)
3
=
(1)
30
x
0
+
(1)
31
x
1
+
(1)
32
x
2
+
(1)
33
x
3
(69)
and also that
a
(2)
1
=g(z
(2)
1
) (70)
a
(2)
2
=g(z
(2)
2
) (71)
a
(2)
3
=g(z
(2)
3
) (72)
9
x
1
x
2
h
(x)
0 0 g(30) 0
0 1 g(10) 0
1 0 g(10) 0
1 1 g(10) 1
Table 2: Evaluation of a neural net for logical AND.
x
1
x
2
h
(x)
0 0 g(10) 0
0 1 g(+10) 1
1 0 g(+10) 1
1 1 g(+30) 1
Table 3: Evaluation of a mystery neural net. Here the
weights are:
10
= 10,
11
= +20, and
12
= +20, so
h
(x) = g(10 + 20x

1
+ 20x
2
). From the table, it is clear
that this is x
1
ORx
2
.
Now notice that the z
i
columns look like matrix multiplica-
tion! Dene
x =
_
_
_
x
0
x
1
x
2
x
3
_
_
_ z
(2)
=
_
_
_
z
(2)
1
z
(2)
2
z
(2)
3
_
_
_ (73)
and then note that
z
(2)
=
(1)
x = a
(2)
= g(z
(2)
) (74)
To add a bit of consistency, we dene the inputs x = a
(1)
,
so that
z
(2)
=
(1)
a
(1)
= a
(2)
= g(z
(2)
) (75)
We also take a
(2)
0
= 1. Then we have that
z
(3)
=
(2)
a
(2)
= h
(x) = a
(3)
= g(z
(3)
) (76)
This is called forward propagation!
Intuitively, this looks a LOT like standard logistic
regression, but we sortof learn our own features, determined
by the weight matrixso were adding a lot of complexity.
We can have other network architecturesAnything that is
not an input layer or output layer is called a hidden layer.
8.5 Examples and Intuitions
If we want to learn AND, then we could take bias unit to
be 30, and then each weight on x
1
, x
2
to be +20. Then
h
(x) = g(30+20x
1
+20x
2
). Then consider the table (2).
8.6 Examples and Intuitions II
It is quite simple to nd a logical NOT networkjust take
+10 bias and 20 weight, so that h
(x) = g(10 20x

1
).
To nd a network that represents the boolean function
(NOTx
1
)AND(NOTx
2
), consider the possibilities in table
(4). Clearly the boolean function is presented best by
h
1
(x).
Finally, consider XOR/XNOR problem (gure 1307).
Since we can express x
1
XNORx
2
as the OR of x
1
ANDx
2
with (NOTx
1
)AND(NOTx
2
). Putting together the things
weve already built makes this possible.
One fun example: Yan Lecuns handwriting recognition
code for reading zipcodes.
x
1
x
2
h
1
(x) h
2
(x) h
3
(x) h
4
(x)
0 0 g(10) 1 g(30) 1 g(10) 0 g(20) 1
0 1 g(10) 0 g(10) 1 g(10) 1 g(10) 0
1 0 g(10) 0 g(10) 1 g(10) 1 g(0) 0.5
1 1 g(30) 0 g(10) 0 g(30) 1 g(10) 1
Table 4: Here we have: h
1
(x) = g(10 20x

1
20x
2
),
h
2
(x) = g(3020x
1
20x
2
), h
3
(x) = g(10+20x
1
+20x
2
),
and h
4
(x) = g(20 + 20x

1
30x
2
)
8.7 Multiclass Classication
For multiclass classication with neural networks, we
basically build on the one-vs-all example by having an
output unit for each possible class; that is, [h
(x)]
i
= 1 if
and only if x is from class i. Eg: Picture classication might
try to distinguish pedestrian vs car vs truck vs motorcycle.
9 Neural Networks: Learning
9.1 Cost Function
Now wed like to nd out how to automatically learn
the parameters of the network instead of building them
ourselves. To do this, well need a cost function, as always!
Our given data is of the form x
(i)
, y
(i)
for i = 1, m.
Suppose we have L being the total number of layers in the
network, and s
l
is the number of units (not counting bias
unit) in layer l.
9.1.1 Binary Classication
Here it is easyour classes are y = 0 or y = 1, and we have
1 output unit. Then clearly h
(x) R, s
L
= 1, and K = 1.
9.1.2 Multi-class classication
Suppose K classes. Then y R
K
. For example:
_
_
_
1
0
0
0
_
_
_,
_
_
_
0
1
0
0
_
_
_,
_
_
_
0
0
1
0
_
_
_,
_
_
_
0
0
0
1
_
_
_ (77)
might represent pedestrian, car, motorcycle, and truck, re-
spectively. This requires K output units! Also h
(x) R
K
and s
l
= K, for K 3.
The cost function for neural nets will be a generalization
of logistic unit, so we have
J() =
1
m
_
m
i=1
K
k=1
y
(i)
k
log h
(x
(i)
)
k
(78)
+ (1 y
(i)
k
) log(1 (h
(x
(i)
))
k
)
_
(79)
+

2m
L1
l=1
s
l
i=1
s
l+1
j=1
(
(l)
ji
)
2
(80)
We have dened h
(x) R
K
and (h
(x))
i
is the i
th
output.
9.2 Backpropagation Algorithm
Since wed like to compute min
J() using an advanced

optimization method, we will need to compute
1. J()
2.

(l)
ij
J().
10
Quick recap: Given one training example x, y, we rst
apply forward propagation:
a
(1)
=x (81)
z
(2)
=
(1)
a
(1)
(82)
a
(2)
=g(z
(2)
) (83)
z
(3)
=
(2)
a
(2)
(84)
a
(3)
=g(z
(3)
) (85)
z
(4)
=
(3)
a
(3)
(86)
a
(4)
=g(z
(4)
) (87)
(88)
This allows us to compute all the activations. Well try to
compute
(l)
j
which is the error of node j in layer l. Then
(4)
j
=a
(4)
j
y
j
(89)
(3)
j
=(
(3)
)
T
(4)
. g
(z
(3)
) (90)
(2)
j
=(
(2)
)
T
(3)
. g
(z
(2)
) (91)
(92)
where g
(z
(i)
) = a
(i)
. (1 a
(i)
). It is possible to prove that
(l)
ij
)J() = a
(l)
j

(l+1)
i
(93)
So the backpropagation algorithm, given a training set
(x
(1)
, y
(1)
), ...(x
(m)
, y
(m)
). Set
(l)
ij
= 0 for all l, i, j. Then
for i = 1 to i = m: (for each training point)
1. Set a
(1)
= x
(i)
2. Perform forward propagation to compute a
(l)
for
l = 2, 3, ..., L
3. Using y
(i)
, compute
(L)
= a
(L)
y
(i)
.
4. Compute
(L1)
,
(L2)
, ...,
(2)
5. Let
(l)
ij
:=
(l)
ij
+ a
(l)
j

(l+1)
i
. (vectorization possible:
(l)
:=
(l)
+
(l+1)
(a
(l)
)
T
.)
Then nally, we calculate
D
(l)
ij
:=
1
m
(l)
ij
+
_
(l)
ij
if j ,= 0
0 if j = 0.
(94)
And it is possible to prove that

(l)
ij
J()D
(l)
ij
.
9.3 Backpropagation Intuition
Backprop is a bit harder to really get the intuition. Well
consider the mechanical steps.
Consider a 4-layer network, 2 inputs, one output.
Backpropagation is just the chain rule for the forward
propagation. Formally,
(l)
j
=

z
l
j
(y
(i)
log h
(x
(i)
) + (1 y
(i)
) log(1 h
(x
(i)
)))
(95)
Listing 17: Reshaping
Theta1=ones ( 10 , 11) ;
Theta2=ones ( 10 , 11) ;
Theta3=ones ( 1 , 1 1 ) ;
thetaVec =[ Theta1 ( : ) ; Theta2 ( : ) ; Theta3 ( : ) ] ;
reshape( thetaVec ( 1: 110) , 10 , 11)
9.4 Implementation Note: Unrolling Parameters
For the advanced minimization algorithms, we will need to
unroll the matrices into matrices. We can do this with the
Theta1(:) idea or reshape.
In practice, have initial matrices
(1)
etc. We unroll
them to get initialTheta to pass to fminunc. Then in our
cost function, we get
(1)
etc back from thetaVec, use
forward prop/backprop to compute D
(i)
and J(). Then
unroll D
(i)
to get gradientVec!
9.5 Gradient Checking
Backprop has a lot of subtle errors, so we want to use
gradient checking to make sure that our backprop is
properly implemented.
Gradient checking works by nite dierences. By basic
calculus, we know that (for a R)

J( +) J( )
2
(96)
This is the center dierence approximation. We can apply
this idea to a parameter vector as well, which will give
us the full gradient! This is an approximation, though! We
can implement it as (18). We will use numerical gradient
approximation to ensure that DVec and gradApprox are
similar. BUT never use gradient checking in real codesit
is too slow!
Listing 18: nite dierence gradient check
for i =1: n
thetaP = t het a ;
thetaP( i ) = thetaP( i ) + EPSILON;
thetaM = t het a ;
thetaM( i ) = thetaM( i ) + EPSILON;
gradApprox ( i ) = ( J( thetaP)J( thetaM ) ) / . . .
(2EPSILON) ;
end
9.6 Random Initialization
One last point: We need random initialization for for
neural networks, since then parameters will all end up
identical. We use Symmetry breaking, where we initialize
things in [, ]. We might use something like shown in (19)
Listing 19: Symmetry Breaking
Theta1=(2rand(10 , 11) 1)INIT EPSILON;
Theta2=(2rand(1 , 11) 1)INIT EPSILON;
11
9.7 Putting it Together
1. First pick a network architecture, the connectivity
between neurons. The number of input units is the
dimension of features x
(i)
and the number of output
units is the number of classes. The reasonable defeault
is 1 hidden layer, or if more than one hidden layer, have
the same number of hidden units in each layerusually
the more, the better. Usually, hidden units in each layer
will be approximately the number of input units.
2. Next, we need to implement:
(a) Randomly initialize weights
(b) Implement forward prop to get h
(x
(i)
) for any x
(i)
(c) Implement code for the cost function J().
(d) Implement backprop to compute partial derivatives
(l)
jk
J().
i. Perform forward and back-propagation using
(x
(i)
, y
(i)
)
ii. Get activations a
(l)
and delta-terms
(l)
for
l = 2, ..., L.
iii. Get
(l)
:=
(l)
+
(l+1)
(a
(l)
)
T
and now we have

(l)
jk
J() as desired.
(e) Use gradient checking to compare shit using back-
prop and using numerical estimate. Then Disable
gradient checking!
(f) Using gradient descent or advanced optimization
method with backprop to minimize J(). This is
non-convex, so we might have local optima!
9.8 Autonomous Driving
We can use neural networks for autonomous driving. It
basically uses neural nets to learn the steering based on
what it is seeing vs what the human is doing. This is
done with just three hidden layers. Does this 12-times per
second. They use multiple networks for multiple situations.
10 Advice for Applying Machine Learning
10.1 Deciding what to Try Next
Debugging a learning algorithm. Suppose we use regular-
ized linear regression, and nd unacceptably large errors in
predictions. Some things to try:
1. Get more training examples (but careful here!)
2. Try a smaller set of featuresreduce overtting
3. Try getting additional features(but this can be expensive)
4. Adding polynomial features (x
2
1
, x
2
2
, x
1
x
2
,...)
5. Decreasing
6. Increasing .
These should be picked systematically, not at random!
Diagnostic: A test you can run to gain insight into
what is/isnt working with a learning algorithm and gain
guiddance as to how best to improve performance.
10.2 Evaluating a Hypothesis
Really low error in training set might not be a good measure
of performance. So we split the training set into a training
set and a test set. Typically we choose 70% and 30% for
these sets, respectively. As always, (x
(i)
, y
(i)
) i = 1, ..., m,
but also introduce m
test
and (x
(i)
test
, y
(i)
test
), i = 1, .., m
test
to
cross-validate the data.
Then we can diagnose bad overtting by looking for cases
where J() is low and J
test
() is high.
We would want to pick a random 30%.
Heres how it would work in practice:
1. Learn parameter from training data, minimizing
training error J()
2. Compute test set error J
test
()
We can also use misclassication error(0/1 misclassica-
tion error):
err(h
(x), y) =
_
1 ifh
(x) >= 0.5, y = 0orh
(x) < 0.5, y = 1

0 otherwise
(97)
Then the test error is
1
m
test
mtest
i=1
err(h
(x
(i)
test
), y
(i)
test
). (98)
10.3 Model Selection and Train/Validation/Test
Sets
We also might sometimes want to pick which or what
order of polynomial is used, etc. This is the model selection
problem. One example: d is the degree of the polynomial:
1. d = 1: h
(x) =
0
+
1
x
2. d = 1: h
(x) =
0
+
1
x +
2
x
2
3. d = 1: h
(x) =
0
+
1
x + +
3
x
3
4. etc
5. d = 10: h
(x) =
0
+
1
x + +
10
x
10
Denote the solution you get from d as
(d)
. Then we could
compute J
test
(
(d)
) and pick the d which gives the best
value; suppose d = 5 gives the best t. How well does the
model generalize? Report test set error J
test
(
(5)
).
But theres a problem: J
test
(
(5)
) is likely to be an overly
optimistic estimate of generalization error because our
extra parameter d is t to the test set.
How to get around it? Given the test set, we split into
three pieces:
1. Training Set(typically 60%), denoted (x
(i)
, y
(i)
)
i = 1, ..., m
2. Cross-Validation set(typically 20%), denoted (x
(i)
cv
, y
(i)
cv
),
i = 1, ..., m
cv
3. test set(typically 20%). denoted (x
(i)
test
, y
(i)
test
),
i = 1, ..., m
test
Then we also have three dierent error functions:
J
train
() =
1
2m
m
i=1
(h
(x
(i)
) y
(i)
)
2
(Training)
J
cv
() =
1
2m
cv
mcv
i=1
(h
(x
(i)
cv
) y
(i)
cv
)
2
(Cross-Validation)
J
test
() =
1
2m
test
mtest
i=1
(h
(x
(i)
test
) y
(i)
test
)
2
(Test)
Then we do the following:
1. For each potential model d, pick
d
by min
J
train
(),
and evaluate J
cv
(
(d)
)
2. Pick the best model d
3. estimate the generalization error by J
test
(
(d)
).
12
10.4 Diagnosing Bias vs Variance
If a learning algorithm doesnt work well, its almost
always either high bias or high variance. With training,
cross-validation, and test data, were much better prepared
to deal with this. In particular, J
cv
() is usually quadratic-
shaped(in d), and J
train
() is usually 1/d-shaped(in d).
Note here that d is a proxy for complexity of the model,
but specically stands for the degree of polynomial. So if
d = 1 or so, then high error indicates high bias. If d is high,
and J
cv
error is high, then this is high bias.
The converse, then, says that
1. High bias is indicated by J
train
high and J
cv
is
approximately J
train
.
2. High variance is indicated by J
train
low and J
cv
J
train
.
10.5 Regularization and Bias/Variance
Regularization can help prevent overtting, as weve seen.
But we need to come up with a way to choose the regu-
larization parameter . We might try 12 dierent models
from = 0 to = 10.24.Then we get
(i)
for i = 1, .., 12.
We evaluate each on the cross-validation set: J
cv
(
(i)
),
and pick the best one i
. Then we return the test error

as J
test
(
(i
)
). When we do the plots, it comes out the
opposite of d for we get high bias regime on the right
and high variance on the left.
10.6 Learning Curves
Works as a sanity check and a diagnostic. To do this, we
plot the J
train
and J
cv
as a function of m. gure from 1350.
High bias case: J
cv
and J
train
converge to eachother at
a high level of error.
High Variance case: J
train
looks about the same, and J
cv
is much higher. The gap is a primary indication of high
variance. In this case, we might suppose that adding much
more data might bring the gap tighter, and from this we
can say that
10.7 Deciding what to do next Revisited
1. Get more training examples: xes high variance
2. Try a smaller set of features: xes high variance
3. Try getting additional features: xes high bias.
4. Adding polynomial features: xes high bias.
5. Decreasing : xes high bias
6. Increasing : xes high variance.
Neural networks: Small neural networks have fewer
parameters and more prone to overtting, but are com-
putationally cheaper. Large neural networks have more
parameters and are more prone to overtting, while compu-
tationally more expensive. Larger neural networks usually
use regularization to address overtting. You can also try
cross-validating across varying numbers of layers in the
neural networks.
11 Machine Learning System Design
Well consider Spam classication.
11.1 Prioritizing What to Work On
Suppose we want to do two classes:
1. Spam(y = 1)
2. Not-spam(y = 0)
Suppose we want to get a supervised learning where x is a
bunch of features of email and y is spam or not spam. One
approach: Choose 100 words indicative of spam/not spam.
(eg: deal, buy, discount, andrew, now,...). That is, x
i
is
associated with the i
th
word in a list and
x
j
=
_
1 if word j appears in email
0 otherwise
(99)
Note: in practice we would just take the most frequently
occuring n words (10k-50k) in the training set, rather than
100 words.
What are some potential ways to get good data?
1. Collect lots of data (eg, from a honeypot project)
2. Develop sophisticated features based on email routing
information (from email headers)
3. Sophisticated features for message body, like: should
discount and discounts be the same word? Deal vs
Dealer?
4. develop sophisticated algo to detect misspellings.
11.2 Error Analysis
Start with a simple algo that you can implement quickly.
Implement it and test on cross-validation data. Then
we can plot learning curves to decide if more data, more
features, etc are likely to help.
Error analysis: Manually examine the examples in the
cross validation set that algorithm made errors on. See if
you can spot any systematic trend in what type of examples
it is making errors on. This inspires new features.
Eg: In misclassied set, categorize them based on:
1. what type of email (eg, pharma, replica, steal passwords,
etc)
2. what cues you think would have helped the algorithm
classied them correctly.
Then we pick the biggest categories or the most common
ags.
This is why we want a quick and dirty algo: to give us
the hard examples.
Another idea, for spam classier: Porter stemmer. But
this can also be problematic (eg, universe vs university?)
Only solution: Try it and see if it works!
Note here that we use the cross-validation set to estimate
error analysis, NOT the test dataotherwise, J
test
is not a
good estimate of the generalization error. Eg: If stemming
gives 3% error and not stemming gives 5% error, use
stemming. It is important to have a numerical test or single
real-number metric of performance.
Dont worry about something being too quick and
dirtyits almost impossible. Just get a prototype!
11.3 Error Metrics for Skewed Classes
So, single error metrics are extremely important. But its
also extremely tricky when we have very skewed classes.
For example: train a cancer classication example, and
we got 1% error on test set, and 0.5% of patients have
cancer. One clear problem of this is shown in listing (20),
which actually has less error than our learning algorithm!
Listing 20: Skewed Predictor
function y= pr edi ct Cancer ( x)
y=0;
return
So we need something better in the case of skewed classes.
For very skewed classes, we want to use precision/recall We
want a table like (11.3). We dene two things:
13
actual 1 actual 0
predicted 1 true positive false positive
predicted 0 false negative true negative
1. Precision: Of all patients where we predicted y = 1,
what fraction actually has cancer?
true pos
# predicted pos
=
true pos
true pos + false pos
(PRECISION)
2. Recall: Of all patients that actually have cancer, what
fraction did we correctly detect as having cancer?
true pos
# actual pos
=
true pos
true pos + false neg
(RECALL)
There is no way for an algorithm to kinda cheat precision
and recall, so high values of both give us an indicator for a
good algorithm.
Typically we label the more rare class with y = 1.
11.4 Trading o Precision and Recall
We can actually even switch o between precision and
recall. Suppose we have a logistic regression categorizer
0 h
(x) 1 and predict 1 if h
(x) and 0 if h
(x) <
and = 0.5 (typical case!).
We can come up with a higher precision algorithm by
increasing this corresponds to saying we predict cancer if
we are 100%-sure that they have cancer. But this lowers
recall! We miss some cancer patients by only predicting
when we are more condent.
We can also come up with a higher recall algorithm by
decreasing this corresponds to predicting cancer even if
we are less sure that they have it. This helps us avoid false
negatives, and thereby increases recall. Unfortunately, this
lowers precision.
Generallyyou want to predict y = 1 if h
(x) . You
can even make a precision recall curvemake a scatterplot
of recall vs precision for a bunch of values and see what
shape it follows.
Now, what if we have a bunch of algorithms? Can we
choose precision and recall automatically? Some ideas:
1. Average:
P+R
2
but not very good.
2. F
1
score: 2
PR
P+R
. This is more typically used.
In summary: Measure precision and recall on the cross
valdiation set and choose the value of the threshold which
maximizes 2
PR
P+R
.
11.5 Data for Machine Learning
How much data should we train on? Well, under certain
conditions, we want as much as possible.
Banko and Brill, 2001. Eg: Classify between confusable
words.
Its not who has the best algorithm that wins. Its
who has the most data.
(Sometimes.)
The large data rationale: Assume features x R
n+1
has
sucient information to predict y accurately.
1. Example: For breakfast I ate (blank) eggshuman
english expert could ll in.
2. Counterexample: Predict housing price from only size
and no other features.
Useful test: Given input x, can a human expert condently
predict y?
Algorithms: should be low bias(That is, J
train
() is
small). If we use a very large training set, we are unlikely
to overt, so J
train
() J
test
(). THen J
test
() should be
smallthis is a case where we suspect the large data rationale
to hold, and we should throw as much data as possible at it.
Put in the converse, a large training set is unlikely to
help when:
1. The features x to not contain enough information to
predict y accurately and when we are using a simple
learning algorithm such as logistic regression
2. The features x do not contain enough information to
predict y accurately, even if we are using a neural
network with a large number of hidden units.
12 Support Vector Machines
12.1 Optimization Objective
How about an alternative view of logistic regression?
Suppose we have
h
(x) =
1
1 + exp(
T
x)
(100)
so if y = 1 we want h
(x) 1 so we want
T
x 0, and if
y = 0, we want h
(x) 0, so we want
T
x 0.
Consider the cost function of one example: (and introduce
z =
T
x)
(y log h
(x) + (1 y) log(1 h
(x))) (101)
So if y = 1, then only the rst term matters: log
1
1+exp(z)
.
Since were minimizing the cost function, the minimum
occurs as z gets larger and larger. The main idea of support
vector machines is to approximate log
1
1+exp(z)
by
=
_
0 z > 1
1
2
z z < 1
(102)
(see plot 1757) We can do an equivalent thing for the y = 0
case. We call these two cost functions cost
1
(z) and cost
0
(z).
So now we have a support vector machine optimization
which is similar to logistic regression:
min
i=1
y
(i)
cost
1
(
T
x
(i)
) +(1 y
(i)
)cost
0
(
T
x
(i)
) +

2
n
j=0
2
j
(103)
[nb: I think Ng made a mistake heresecond summation
should be j = 1 to n.] We have made a couple of changes:
1. We multiply through by mthis changes the minimum
but we only care about the minimizer.
2. Instead of solving min A+B we solve min CA+B, which
again only changes the minimum, not the minimizer.
(here, c =
1
.)
One important point: The hypothesis function DOES
NOT RETURN PROBABILITY! We have the hypothesis:
h
(x) =
_
1
T
x
0 otherwise.
(104)
14
12.2 Large Margin Intuition
Support vector machines are also known as Large Margin
Classiers. Quick recap:
1. If y = 1, we want
T
x 1 (not just 0)
2. If y = 0, we want
T
x 1 (not just < 0)
So this is saying we dont just want the logistic function
requirement, we want something kindof even stronger!
Suppose we take C huge. Then we really want each term
equal to zero. Then were very motivated to have
T
x
(i)
1
whenever y
(i)
= 1 and
T
x
(i)
< 1 whenever y
(i)
= 0.
Then in the linearly separable case, we have something like
(ss 1810). This basically trying to separate the positive and
negative examples with as large of margin as possible.
What happens with outliers? If C was very large, then
we are extremely sensitive to outliers while if C is more
moderate then we accept a few errors to have a clearer and
simpler boundary.
12.3 Mathematics Behind Large Margin Classi-
cation
First o, u
T
v is the inner product. We have [[u[[ being the
length of vector u, which is [[u[[ =
_
u
2
1
+u
2
2
R. We can
also dene p to be the length of projection of v onto u, then
u
T
v = p [[u[[. But its usually formulated as
u
T
v = u
1
v
1
+u
2
v
2
(105)
Furthermore, p is a signed numberit can be negative.
If we simplify by making
0
= 0, then we can write the
SVM problem as
min
1
2
n
j=1
2
j
s.t.
T
x
(i)
1 if y
(i)
= 1
T
x
(i)
1 if y
(i)
= 0.
(106)
Then clearly the objective
1
2
(
2
1
+
2
2
) =
1
2
(
_
2
1
+
2
2
)
2
=
[[[[
2
. Then we are just minimizing the squared norm. Next,
note that
T
x
(i)
= p
(i)
[[[[ =
1
x
(i)
1
+
2
x
(i)
2
. That means
we can replace the constraints and have a problem like
min
1
2
[[[[
2
s.t. p
(i)
[[[[ 1 if y
(i)
= 1
p
(i)
[[[[ 1 if y
(i)
= 0.
(107)
It is possible to show that is perpendicular to the separat-
ing boundary. Then we can (straightforwardly) argue that
[[[[ can be made much smaller if the separating boundary
makes p
(i)
as large as possible. Since we are trying to
minimize [[[[ and still satisfy constraints, it is clear that
the largest margin decision boundary is picked by the SVM
problem.
Same large-margin proof works when
0
,= 0.
12.4 Kernels I
To make nonlinear classiers, we can use complex poly-
nomial features. In particular, we can dene a bunch of
features f
i
from the original ones x
j
. Is there a dierent/-
better choice of features than the quadratic terms?
One idea: Given x, compute new feature depending on
proximity to landmarks l
(1)
, l
(2)
, and l
(3)
. Eg: given x,
f
i
= similarity(x, l
(1)
) = exp(
[[x l
(i)
[[
2
2
) (108)
where these are called Gaussian Kernels. We can denote
them as k(x, l
(i)
). Consider what happens when:
1. if x l
(1)
: f
1
exp(
0
2
2
2
) 1
2. if x is far from l
(1)
: f
1
exp(
(large)
2
2
2
) 0
It is clear that (relative to
2
= 1),
2
= 0.5 falls to zero
much quicker and
2
= 3 falls away much more slowly.
12.5 Kernels II
Last time we talked about similarity function. But where
would we get landmarks? One wayWe would pick land-
marks at each points.Then we end up with m landmarks.
This is nicewe have a method for talking about how close a
test point is to training data. What does this gain us? Well,
now instead of x
(i)
R
n+1
, we have f
i
R
m
, which can be
a substantial reduction in the dimension of the problem! To
make a prediction on a datapoint x, you compute feature
vector f R
m+1
and predict y = 1 if
T
f 0.
For training, we just compute f
(i)
for each x
(i)
, and then
just minimize
J() = C
m
i=1
y
(i)
cost
1
(
T
f
(i)
)+(1y
(i)
)cost
0
(
T
f
(i)
)+
1
2
m
i=1
2
j
.
(109)
One last bit of implementation detail: Since
j

2
j
=
T
,
we replace it with a rescaled kernel
T
M. This gives
rise to extremely fast algorithms for optimizing the cost
function J().
A few words about bias and variance: Note that C =
1
.
1. Large C: Lower bias, High Variance
2. Small C: Higher bias, Lower variance
The other parameter we need to choose is
2
:
1. Large
2
gives features that vary more smoothly, so
higher bias, lower variance
2. Small
2
gives features that vary less smoothly, so lower
bias, higher variance.
12.6 Using an SVM
Some svm software is very good Use linlinear, libsvm, etc
to solve for . You still do need to specify
1. Choice of parameter C
2. Choice of kernel
(a) No kernel/linear kernel (use if n is large and m is
small)
(b) Gaussian Kernel need to choose
2
. Use if n small
and/or m large. Key point: Do perform feature scal-
ing before using Gaussian Kernel, because otherwise
large-scale variables will swamp out everything else.
Requires writing a kernel function such as listing (21)
(c) Other kernelsneed to satisfy Mercers Theorem
so ensure SVM packages run correctly and do not
diverge. These are much less common
i. Polynomial Kernel: (x
T
l + c)
d
which has
parameters c, d.
ii. String Kernel
15
Listing 21: Gaussian Kernel
function f = ke r ne l ( x1 , x2 )
f=exp(norm( x1x2 )2/(2 si gma 2 ) ) ;
end
iii. Chi-square kernel
iv. Histogram intersection kernel
Just to reiterate: Choose whatever performs best on the
cross-validation data!
Multiclass classication: Most SVM already build this
in. Otherwise, use one-vs-all classication.
When would you use Logistic regression vs SVM? Take
n as the number of features, and m the number of training
examples.
1. If n is large (relative to m): Logistic regression or SVM
without kernel. (n m, n = 10, 000, m = 10 1000)
2. If n is small and m is intermediate: SVM with Gaussian
Kernel (n = 1 1000, m = 10 10, 000)
3. If n is small but m is large: Create/add more features,
then use logistic regression and SVM without a kernel.
(eg n = 1 1000, m = 50, 000+)
4. Neural network likely to work well for most of these
settings, but may be slower to train. (And need to worry
about nonconvexity)
13 Clustering
13.1 Unsupervised Learning: Introduction
Here, we are given x
(i)
without the y
(i)
(still with i = 1...m).
We ask an algorithm to just nd some structure in the
given data. The rst type of algorithm well look at is the
clustering algirthm. This is good for
1. Market segmentation
2. social network analysis
3. organize computer clusters
4. astronomical data analysis
13.2 K-Means Algorithm
We are given a set x
(i)
and ask it to nd clusters. The
basic idea: Initialize random cluster centroids. Then run
two steps:
1. cluster assignment step: assign each datapoint to a
cluster centroid.
2. move centroid step: move centroids to the center of mass
of the points assigned to it.
More formally: K-means takes as input the number of
clusters K and the training set x
(i)
, .... Then
1. Randomly initialize K cluster centroids
1
,...,
K
R
n
2. repeat
(a) for i = 1, ..., m, c
(i)
is the index (from 1 to K) of
cluster centroid closest to x
(i)
(this is a minimization
problem).
(b) for k = 1, .., K: Let
k
be the average (mean) of
points assigned to cluster k.
Some edge cases: If no points are assigned to a centroid,
we might delete it or randomly re-initialize it.
Next, K-means for non-separated kernels. Sometimes
datasets arent easily separated. Eg: T-shirt sizing with
height and weight. This is in fact a market segmentation
problem.
13.3 Optimization Objective
Recall that c
(i)
is the index of cluster to which x
(i)
is
currently assigned, and
k
is the cluster centroid k.
Introduce some new stu:
c
(i) - the cluster centroid of
cluster to which example x
(i)
has been assigned.
Then the optimization problem is in terms of the function
J
J(c
(1)
, ..., c
(m)
,
1
, ...,
K
) =
1
m
i=1
m[[x
(i)
c
(i) [[
2
(110)
and then the minimization is
min
c
(1)
, , c
(m)
1
, ,
K
J(c
(1)
, ..., c
(m)
,
1
, ...,
K
) (111)
This is also called the distortion.
In the above framework, the cluster assignment step
minimized J(...) with respect to c
(i)
holding
j
xed. The
move centroid step is equivalent to minimizing J(...) with
respect to
j
holding c
(i)
xed.
13.4 Random Initialization
Now we need to discuss how to avoid local optima. Should
have K < motherwise kinda weird. We randomly pick K
training examples. Then we set
1
,...,
K
equal to these K
examples. This is the truly recommended way to do this.
K-means can end up at a local optima. This kinda
sucks. To get around this, we might try running K-means
many timeseg, might do it with 50 1000 dierent
random initializations. Then we can be more assured in the
goodness of our optima. This matters most when you have
small numbers of clusters.
13.5 Choosing the Number of Clusters
What is the right value of K? There probably isnt a totally
correct/perfect method. The most common thing is picking
the number of clusters by hand.
One heuristic: The elbow method. Run K-means for
varying numbers of clusters and plot the number of clusters
vs the cost function. The point of diminishing returns is
called an elbow. But its not a great algorithm because it
doesnt work frequently.
Another method: Frequently, a downstream purpose will
imply a number of clusters.
14 Dimensionality Reduction
14.1 Motivation I: Data Compression
One example: We might want to reduce data from 2d to
1d, for example: Centimeters and meters, with roundo
errors. Another way to look at it is that we might have
underlying variables like pilot pilot skill and pilot enjoyment
determining aptitude.
More formally, we might be able to represent x
(m)
R
2
by projecting it into z
(1)
R.
One really really important application is making
machine learning algorithms run faster.
One (contrived) example: If all data in some subset of
R
3
lies in R
2
, then we can just use the plane representation
instead. In fact, its not so contrived since we can project
from 3d into some 2d plane.
16
14.2 Motivation II: Visualization
Suppose we have something like 50 features, and wed like
to visualize them to try and get some insight. One option
is to use a dierent feature representation to try and get it
to some (z
1
, z
2
)then we can plot it easily!
Country example: Might have z
1
corresponding to
country size/GDP and z
2
corresponding to per-person
GDP. Wed typically do data-reduction to get this down to
2- or 3-dimensional data.
14.3 Principal Component Analysis Problem
Formulation
What exactly is PCA? PCA tries to nd a lower-dimensional
surface such that the projection from the dataset onto the
lower-dimensional surface is relatively small.
One important detail: you should do feature scaling/mean
normalization!
There are denitely terrible PCA surfaces.
More formally: If we wanted to reduce from 2-dimensional
to 1-dimension, we nd a direction (a vector u
(1)
R
n
) onto
which to project the data so as to minimize the projection
error.
More generally: Reduce from n-dimensional to k-
dimensional by nding k vectors u
(1)
, u
(2)
, ..., u
(k)
onto
which to project the data so as to minimize the projection
error. (Well construct a subspace onto which to project...)
How does PCA relate to linear regression? They dont.
Theyre totally dierent algorithms.
14.4 Principal Component Analysis Algorithm
Before doing principle component analysis, we really really
need to do data preprocessing. So we have the training set
x
(i)
, and we do mean/variance scaling:
x
(i)
j

x
(i)
j

j
s
j
(112)
The procedure is pretty simple. Here we go:
1. First compute the covariance matrix =
1
m
m
i=1
(x
(i)
)(x
(i)
)
T
2. Compute the singular value decomposition of . (listing
(22))
Listing 22: PCA Algorithm
[ U, S , V] = svd( Sigma ) ;
(You can also use eig for this type of matrix (positive
semi-denite) but svd is a bit more stable.)
It turns out that U returned from SVD will give us n
column vectors u
(i)
R
n
; we can use the rst k of them to
get the k directions we want to project our data.
Next, we need to nd a way to get z R
k
; to get there
we rst introduce
U
reduce
=
_
_
_
_
.
.
.
.
.
.
.
.
.
u
(1)
u
(2)
u
(k)
.
.
.
.
.
.
.
.
.
_
_
_
_
(113)
and then we can nd
z = U
T
reduce
x (114)
Listing 23: cap
Sigma = (1/m)X X;
We can do the covariance matrix as
The proof that all of this works is beyond the space of
this class.
14.5 Choosing the number of principal Compo-
nents
We call k the number of principal compoents. The average
squared projection error is
1
m
m
i=1
[[x
(i)
x
(i)
approx
[[
2
(115)
and the total variation in the data is
1
m
m
i=1
[[x
(i)
[[
2
(116)
Typically we choose k to be the smallest value so that
1
m
m
i=1
[[x
(i)
x
(i)
approx
[[
2
1
m
m
i=1
[[x
(i)
[[
2
0.01 (117)
which is saying 99% of the variance is retained, or the
equivalent between 85%-99%.
To choose k, we can use the algorithm:
1. Try PCA with k = 1
2. Compute U
reduce
, z
(1)
, etc
3. Check if (117) holds
The check can be much more eciently evaluated as
1
k
i=1
S
ii
n
i=1
S
ii
(118)
So we can just slowly increase k until we have
k
i=1
S
ii
n
i=1
S
ii
0.99 (119)
Report the variance retained, not the dimensions.
We get away with this basically only because datasets
tend to have very highly correlated data.
14.6 Reconstruction from Compressed Represen-
tation
If we use it as a compression algorithm, then what can we
do to get the original data back(approximately)?
We take
x
approx
= U
reduce
z (120)
which we hope will give approximately x. This process is
called reconstruction.
14.7 Advice for Applying PCA
The most common use of PCA might be supervised learn-
ing speedup. Eg: Computer vision! If we had a 100x100
image, then we get a 10,000 feature vector, which is a lot.
alternative: Given (x
(i)
, y
(i)
)
17
1. Extract inputs into the unlabeled dataset x
(i)
R
10000
and (by PCA) gets z
(i)
R
1000
.
2. Have a new dataset: (z
(i)
, y
(i)
) i = 1, ..., m.
3. Now do logistic regression or neural nets or whatever
Critical point: The mapping from x
(i)
to z
(i)
should only
be done by running PCA on the training set, then reusing
that mapping on the cross-validation and test set, since
the matrix U
reduce
is in some sense a new parameter of the
supervised learning setup!
Recap: Two applications
1. Compression
(a) Reduce memory/disk needed to store data
(b) speed up learning algorithm
2. Visualization (mapping into k = 2 or k = 3)
One terrible application: To prevent overtting. Idea is
use z
(i)
instead of x
(i)
to reduce the number of features to
k < n (eg, 1000 < 10000); thus, fewer features, less likely
to overt. BAD IDEA might work okay, but doesnt
actually address overtting like regularization would. (Its
bad because it throws away the label data.)
PCA is sometimes used where it should not be. One
clear case is deciding to try the ML system:
1. Get training set (x
(i)
, y
(i)
)
2. Run PCA to reduce x
(i)
in dimension to get z
(i)
3. Train logistic regression on (z
(i)
, y
(i)
)
4. Test on test set: Map x
(i)
test
to z
(i)
test
and run h
(z) on
(z
(i)
test
, y
(1)
test
)
The problem here is that PCA shouldnt be done until
youve tried the ML system without PCA(on the raw data)
and that didnt do what you wanted.
15 Anomaly Detection
15.1 Problem Motivation
Imagine that you design and manufacturing aircraft engines.
Then you have a feature set x
i
which represents things like
heat generated, vibration intensity, etc. Then the problem
is to determine if a new engine x
test
is anomalous or not.
More formally: We estimate density by the dataset x
(i)
and want to see if x
test
is anomalous. We hope to build a
model p(x) where we say p(x
test
) < is a agged anomaly,
and otherwise a typical point.
The most typical use might be fraud detection. For
example:
1. x
(i)
is the feature vector of user is activity
(a) how often do they log in
(b) number of web pages
(c) number of posts on forum
(d) typing speed.
2. Model p(x) from this data
3. Identify unusual users by checking which have p(x) < .
Unfortunately, this only nds strange people, not
fraudulent people.
We can also use this to monitor computers in a data
center
1. x
(i)
feature vector of machine i:
(a) memory use
(b) number of disk accesses/sec
(c) CPU load
(d) ratio of CPU load vs network trac.
15.2 Gaussian Distribution
Say x R. If x is a distributed Gaussian with mean
and variance
2
, then we write x A(,
2
). This is the
standard bell shape centered at and approximate width .
For completeness:
p(x; ,
2
) =
1
2
exp(
(x )
2
2
2
). (121)
Sometimes it is easier to think in terms of variance
2
,
sometimes easier in terms of standard deviation.
Parameter estimation problem: Suppose we have a
dataset x
(i)
, ...x
(m)
with x
(i)
R. Wed like to nd
and , supposing that x
(i)
A(,
2
).
The mean is easy:
=
1
m
m
i=1
x
(i)
(122)
and
2
=
1
m
m
i=1
(x
(i)
)
2
(123)
these are the maximum likelihood estimates. Sometimes
1
m1
instead of
1
m
, but in machine learning it doesnt
matter too much.
15.3 Algorithm
Consider if we had an unlabeled training set x
(i)
for
i = 1, ..., m. Each example is x R
n
. We have
p(x) = p(x
1
;
1
,
2
1
)p(x
2
;
2
,
2
2
) p(x
m
;
m
,
2
m
) (124)
This corresponds to an independence assumption! More
compactly:
p(x) =
n
j=1
p(x
j
;
j
,
2
j
) (125)
Algorithm:
1. Choose features x
i
that you think might be indicative of
anomalous features.
2. Fit parameters
1
,..,
n
,
2
1
,..,
2
n
by standard statistics
formulas:
j
=
1
m
m
i=1
x
(i)
j
(126)
2
j
=
1
m
m
i=1
(x
(i)
j

j
)
2
(127)
3. Given a new example x, compute p(x):
p(x) =
n
j=1
p(x
j
;
j
,
2
j
) =
n
j=1
1
2
j
exp(
(x
j

j
)
2
2
2
)
(128)
and mark as an anomaly if p(x) < .
18
15.4 Developing and Evaluating an Anomaly
Detection System
How do you evaluate an anomaly detection algorithm?
Ideallysingle real-valued measure of ecacy.
Assume that we have some labeled data, of anomalous
and non-anomalous examples, labeled y = 1 and y = 0
respectively. As before, we have x
(i)
for i = 1..m. We also
have a cross validation set x
(i)
cv
, y
(i)
cv
and test set x
(i)
test
, y
(i)
test
.
Aircraft engines motivating example: Suppose we have
10000 good engines and 20 awed engines. Split this as
1. training set: 6000 good engines
2. cross validation: 2000 good engines, 10 anomalous.
3. test set: 2000 good engines, 10 anomalous.
There are alternatives, but they are not recommended.
For algorithm evaluation:
1. Fit model p(x) on training set x
(i)
.
2. On cross validation, predict
y =
_
1 ifp(x) < (anomaly)
0 ifp(x) (normal)
(129)
3. Possible evaluation metrics:
(a) True positive, false positive, false negative, true
negative
(b) precision/recall
(c) F
1
-score.
Can also use a cross-validation set to choose parameter .
15.5 Anomaly Detection vs Supervised Learning
Some places to use anomaly detection:
1. Very small number of positive examples (y = 1) (0-20 is
common).
2. Large number of negative examples
3. many dierent types of anomalies. Hard for any al-
gorithm to learn from positive examples what anomalies
look like; future anomalies may look nothing like any of
the anomalous examples weve seen so far
Some places to use anomaly detection:
1. Large number of positive and negative examples.
2. Enough positive examples for algorithm to get a sense of
what positive examples are like, future positive examples
likely to be similar to ones in training set.
Counterexample: Spam, even through theres many types.
Use cases: Anomaly detection
1. Fraud detection
2. Manufacturing
3. Monitoring machines in a data center
Supervised Learning
1. Email spam classication
2. Weather prediction
3. cancer prediction.
15.6 Choosing what Features to Use
Features have a huge eect on anomaly detection. One
implicit assumption was the gaussian-like, at least vaguely.
You should try plotting histograms of the data.
If you have trouble, try to transform it. Eg: If exponen-
tially distributed, then try taking log. Then it will look
much more gaussian. Transformations:
1. x log(x +c)
2. x x
c
(0 < c < 1.)
How do you come up with features? Well, we want p(x)
large for normal examples, and p(x) small for anomalous
examples x. One common problem is that p(x) is compara-
ble for normal and anomalous examples. We hope to come
up with features to distinguish datapoints by one problem
getting assigned an extremely low probability.
Case study: Computers in data center:
1. memory use of computer
2. number of disk accesses/sec
3. cpu load
4. network load
We notice that cpu load and network load tend to grow
with eachother. When would this not happen? When CPU
load is high and network load is low. So we introduce a new
variable: x
5
=
cpuload
networktraffic
or even x
6
=
(cpuload)
2
networktraffic
.
15.7 Multivariate Gaussian Distribution
One possible extension: Multiple gaussians. Consider
montioring machines in a data center, with CPU load and
memory use statistics. Suppose in the testset we have a
clear outlier. The outlier might not look too bad in any
given dimension. Trouble is, the probability isosurfaces are
circles, not ellipses.
So we introduce multivariate Gaussian distributions:
Suppose x R
n
, and instead of modeling p(x
1
), p(x
2
),
etc separately, we model p(x) all at once, with parameters
R
n
and R
nn
. So we have
p(x; , ) =
1
(2)
n/2
[[
1
2
exp(
1
2
(x)
T
1
(x)) (130)
Suppose we have
=
_
0
0
_
; =
_
1 0
0 1
_
(131)
This gives a standard distribution with circular isosurfaces.
What happens if we take
=
_
0
0
_
; =
_
.6 0
0 0.6
_
(132)
This gives narrower Gaussians. Can also make it bigger by
increasing diagonal elements of .
Can also get ellipsoidal matrices by taking with diag-
onal entries dierent from eachother. Finally, you can get
skewed gaussian distributions by changing the o-diagonal
elements(these say that (eg) x
1
is correlated to x
2
, which
means when one raises, the other does as well, giving rise
to an appropriate plot based on that observation, as long
as odiag are positive). If o diagonal are negative, get
inverse correlation.
We can also shift around the mean by changing , of
course.
15.8 Anomaly Detection using the Multivariate
Gaussian Distribution
Last time we looked at multivariate detection algorithm.
Last time we saw
p(x; , ) =
1
(2)
n
2
[[
1
2
exp(
1
2
(x)
T
1
(x)) (133)
19
and we do a parameter tting by
=
1
m
m
i=1
x
(i)
=
1
m
m
i=1
(x
(i)
)(x
(i)
)
T
(134)
where our dataset is (x
(i)
, x
(i)
)
The general outline is:
1. Fit model p(x) by setting and as above.
2. Given a new example x, compute p(x)
3. Flag an anomaly if p(x) <
It turns out that if the gaussians are axis-aligned,
then the distributions discussed above end up being the
same, and the o-diagonals are zero. This is called the
independence assumption!
When would you use original model vs mutlivariate
gaussian model?
1. original:
(a) Manually create features to capture anomalies where
x
1
and x
2
take unusual combinations of values.
(b) computationally cheaper or scales better to large n.
(eg n = 10, 000-n = 100, 000)
(c) works ok even if m is small.
2. multivariate:
(a) automatically captures correlations between features.
(b) computationally much more expensive.
(c) must have m > n or else is noninvertable.
Ng would only use m n, say m 10n.
If is nonsingular, you probably have
1. not enough features
2. redundant features.
16 Recommender Systems
16.1 Problem Formulation
Many websites/etc try to recommend new products based
on other products consumers have consumed. Recom-
mender systems are a relatively small part of academia, but
they are extremely common in industry.
Features have a huge eect on performance.
Suppose we have: a table with movies down the rows,
and dierent users rate movies 0-5 starts. One interesting
thing is that we might be missing data. We introduce some
notation:
1. n
u
number of users
2. n
m
number of movies
3. r(i, j) = 1 if user j has rated movie j.
4. y
(i,j)
is the rating user j gave to movie i, dened only if
r(i, j) = 1.
The job of a recommender system is to predict the values
for which r(i, j) = 0.
16.2 Content-Based Recommendations
We could dene features like
1. x
1
- romance
2. x
2
- action
Then for each user j, we could learn a linear regression
parameter
(j)
R
3
. Predict user j as rating movie i with
(
(j)
)
T
x
(i)
stars.
To learn
(j)
we just calculate
min
(j)
1
2m
(j)
i:r(i,j)=1
((
(j)
)
T
(x
(i)
) y
(i,j)
)
2
+

2m
(j)
n
k=1
(
(j)
k
)
2
(135)
But consider it without the m
(j)
part:
min
(j)
1
2
i:r(i,j)=1
((
(j)
)
T
(x
(i)
) y
(i,j)
)
2
+

2
n
k=1
(
(j)
k
)
2
(136)
Then to learn
(1)
, ...,
(nu)
, minimize
min
(1)
,...,
nu
1
2
nu
j=1
i:r(i,j)=1
((
(j)
)
T
(x
(i)
)y
(i,j)
)
2
+
nu
j=1
2
n
k=1
(
(j)
k
)
2
(137)
Then our optimization algorithm could be the typical
gradient descent steps:
(j)
k
:=
(j)
k

i:r(i,j)=1
((
(j)
)
T
x
(i)
y
(i,j)
)x
(i)
k
(138)
(j)
k
:=
(j)
k
(
i:r(i,j)=1
((
(j)
)
T
x
(i)
y
(i,j)
)x
(i)
k
+
(j)
k
)
(139)
This is basically just applied linear regression.
One caveat: it might be hard/impossible to have features
we described.
16.3 Collaborative Filtering
We had assumed that we had the features. How would we
get values for the romance/action parameters?
One possible way: suppose you had the partial ratings as
before, but also had users subjective rating of various kinds
of movies. Then we can infer the content feature vectors
of the movies. That is, we would like to learn x such that
(
(j)
)
T
x
(i)
approximately matches user is subjective rating
of movie j., for the cases where they have rated it.
More formally: Given users preferences
(1)
, ...,
(nu)
, to
learn x
(i)
, we want to
min
x
(
i)
1
2
j:r(i,j)=1
((
(j)
)
T
x
(i)
y
(i,j)
)
2
+

2
n
k=1
(x
(i)
k
)
2
(140)
Of course, we also want to nd the features for each movie
1, ..., n
m
, so the actual optimization algorithm is
min
x
(1)
,...,x
(nm)
1
2
nm
i=1
j:r(i,j)=1
((
(j)
)
T
x
(i)
y
(i,j)
)
2
+
2
nm
i=1
n
k=1
(x
(i)
k
)
2
(141)
So then collaborative ltering says that if we have x
(i)
(i = 1, .., n
m
) and movie ratings(r(i, j) and y
(i,j)
) we can
estimate
(1)
, ...,
(nu)
.
Given the parameters
(1)
, ...,
(nu)
, we can estimate
x
(1)
,...,x
(nm)
.
So we can guess , then use it to get x, then iterate this
many times until we converge.
Key point here: Every user is helping every other user.
20
16.4 Collaborative Filter Algorithm
Turns out though, this algorithm kinda sucks. We can
come up with a better one.
We have our two steps as above:
1. Given x
(i)
, estimate
(j)
(i 1..n
m
and i 1..n
u
)
2. Given
(j)
(i 1..n
m
estimate x
(i)
, and i 1..n
u
)
(both with the same objective function, but minimizing
over dierent variables and regularization)
One option: To go back and forth. A smarter idea: Put
both in the same objective function, and minimize over
both simultaneously:
J(x
(i)
,
(j)
) =
1
2
(i,j):r(i,j)=1
((
(j)
)
T
x
(i)
y
(i,j)
)
2
+
2
nm
i=1
n
k=1
(x
(i)
k
)
2
+
2
nm
i=1
n
k=1
(
(j)
k
)
2
(142)
and then solve the minimization problem
min
x
(i)
,
(j)
,(i,j)
J(x
(1)
, ..., x
(nm)
,
(1)
, ...,
(nu)
) (143)
Previously, we used x
0
= 1 and x R
n+1
; here we take
x R
n
and R
n
.
So the algorithm:
1. initialize x
(i)
and
(j)
to small random values.
2. Minimize J(x
(1)
, ..., x
(nm)
,
(1)
, ...
(nu)
) using gradient
descent or advanced optimization algoo.
3. For a user with parameters and a movie with (learned)
features x, predict a star rating of
T
x
We need to initialize with small random values to do
symmetry breaking to ensure that we dont learn the same
values multiple times.
16.5 Vectorization: Low rank Matrix Factorization
We can group all ratings into a matrix Y , a n
m
n
u
matrix.
Then the predicted ratings would be given by a matrix
where the (i, j)-entry is
T
(j)
x
(i)
. That is, we can dene
X =
_
_
_
_
_
(x
(1)
)
T

(x
(1)
)
T

.
.
.
(x
(1)
)
T

_
_
_
_
_
=
_
_
_
_
_
(
(1)
)
T

(
(1)
)
T

.
.
.
(
(1)
)
T

_
_
_
_
_
(144)
Then the above matrix can be calculated as X
T
!
The algorithm we will use is: low rank matrix factoriza-
tion.
To nd related movies: For each product i, we learn
a feature vector x
(i)
R
n
. So we could nd x
i
being
romance, action, comedy, etc. The next question is: How
to nd movies j related to movie i. One measure: Finding
small [[x
(i)
x
(j)
[[ = movies i and j similar. (145)
16.6 Implementation Detail: Mean Normalization
One last implementation detail: should probably normalize
everything with mean/standard deviation. This is impor-
tant since algorithm will predict 0 for unknown peoplenot
useful unless we consider 0 to be a perfectly average rating
for that movie.
17 Large-Scale Machine Learning
Much larger datasets are responsible for a huge amount
of improvement. We saw that training size wins almost
regardless of which algorithms.
17.1 Learning with Large Datasets
Suppose the training size is m = 100, 000, 000, and we want
to do the gradient update
j
:=
j

1
m
m
i=1
(h
(x
(i)
) y
(i)
)x
(i)
j
(146)
Before we do this, we should ask: Why not just do a 1000
examples? One way to check this is to plot a learning curve
for a range of values of m and verify that the algorithm has
high variance when m is small. We know that if we saw a
high bias case, then we could add features or hidden nodes
with neural nets to hopefully get to high bias, then add
more data!
17.2 Stochastic Gradient Descent
Suppose you have linear regression with gradient descent.
The problem with regular(that is, batch) gradient descent
is that the sum is very expensive to calculateit is very slow!
We can reformulate by taking
cost(, (x
(i)
, y
(i)
)) =
1
2
(h
(x
(i)
) y
(i)
) (147)
and
J
train
() =
1
m
m
i=1
cost(, (x
(i)
, y
(i)
)). (148)
As noted before, the batch gradient step is (146), but the
stochastic gradient descent works like this:
1. Randomly shue (reorder) training examples..
2. Repeat
(a) for i = 1, .., m,
j
:=
j
(h
(x
(i)
) y
(i)
)x
(i)
j
j. (149)
Key point: This doesnt actually convergebut thats okay,
since its a point near the optima.
Typically the outer loop will only be run 1-10 times, in
contrast to a great many times wed have to take gradient
descent steps.
17.3 Mini-Batch Gradient Descent
So far:
1. Batch gradient descent: Use all m examples in each
iteration
2. Stochastic gradient descent: Use 1 example in each
iteration
Theres also another option: Mini-batch gradient descent,
where we use b examples in each iteration. That is, instead
of summing over the huge dataset, we only sum over b (say,
b = 10) This looks like
1. Say b = 10, m = 1000. Then
2. Repeat
(a) for i = 1, 11, 21, ..., 991
j
:=
j

1
10
i+9
k=i
(h
(x
(k)
) y
(k)
)x
k
j
j. (150)
This will have positive eect if you have a good vectorized
implementation.
21
17.4 Stochastic Gradient Descent Convergence
How do you tune , and ensure it was converging? During
learning, compute cost(, (x
(i)
, y
(i)
)) before updating
using (x
(i)
, y
(i)
). Every (eg) 1000 iterations or so, plot
cost(, (x
(i)
, y
(i)
)) averaged over the last 1000 iterations
processed by algorithm.
Three potential issues:
1. decreasing oscillation: okay!
2. steady oscillation: decrease learning rate/bug
3. increasing: decrease learning rate.
If you want stochastic gradient descent to actually
converge, then you can slowly converge to the proper
minimum, via:
k
=
c
1
k +c
2
(151)
17.5 Online Learning
Online learning works for applications where we have a
continuous ood of data by a continuous stream of
visitors (for example).
One application: Suppose you have a shipping service
website where use comes, species origin and destination,
you oer to ship their package for some asking price, and
users sometimes choose to use your shipping service (y = 1)
and sometimes not (y = 0).
We might also suppose that we have some features x
which capture properties of the user, origin/destination,
and asking price. We want to learn p(y = 1[x; ) to optimize
price. Our website will run something like:Repeat forever
1. Get (x, y) corresponding to user
2. Update using (x, y):
j
:=
j
(h
(x) y)x
j
j. (152)
One interesting thing about this is that it can adapt to
changing user preferences.
Another one: Product search(learning to search). User
searches for android phone 1080p camera. We have 100
phones in the store, and will return 10 results. In this case,
1. x is the features of the phone, how many words in user
query match, name of phone, how many words in in
query match description of phone, etc
2. y = 1 if user clicks on link, y = 0 otherwise.
3. we learn p(y = 1[x; ).
4. We can use this to show user the 10 phones theyre most
likely to click on.
Other examples: Choosing special oers to show user,
customizing selection of news articles, product recommen-
dation, etc.
Alternative: You can run website for a few days, save
away data, etc.
17.6 Map Reduce and Data Parallelism
We might have problems too big to t in a single computer.
Here we extend it to t.
Suppose we want to run batch gradient descent on
m = 400. (millions?). Then the mapreduce idea from Je
Dean and Sanjay Ghemawat. Then for mapreduce, we split
the dataset into 4 pieces, then send it o to other machines.
That is,
1. Machine 1: Uses (x
(i)
, y
(i)
), i = 1, .., 100 to calculate
temp
(1)
j
=
1
i=1
00(h
(x
(i)
) y
(i)
)x
(i)
j
.
2. ....
Then combines them as
j
:=
j

1
400
(temp
(1)
j
+temp
(2)
j
+temp
(3)
j
+temp
(4)
j
)
(153)
The key question is: Can the learning algorithm be ex-
pressed as computing sums of functions over the training set.
We can, for example, use advanced optimization with
logistic regression, in a mapreduce framework.
One other application: We can use mapreduce on multi-
core machines. In this application, we dont have to care
about network latency.
Some numerical linear algebra libraries can apply
parallelization automatically.
18 Application Example: Photo OCR
18.1 Problem Description and Pipeline
What is the photo OCR problem? Optical Character
Recognition. Wed like to get computers to understand
content from pictures, like printed words.
There are several parts:
1. Where are the texts?
2. What do the texts represent?
OCR from scanned documents is relatively easy nowa-
days. Photo OCR is harder.
The pipeline looks like this:
1. Text detectionnd where there is text in the image
2. Character segmentationtry to segment the text part
into distinct characters
3. Character classicationtry to turn segments from
images into categories representing characters.
You can even get more sophisticated, eg with spelling
correctors.
18.2 Sliding Windows
Pedestrian detection is easier, mostly because the aspect
ratio is approximately the same for dierent pedestrians.
We would apply supervised learning for pedestrian
detection by nding many 82x36 pixel images, with positive
and negative examples. Might use a neural network or other
to classify image patch as pedestrian being in it or not.
Then big idea is that we take a rectangular patch and run
it through the image classier. Then we slide the rectangle
over a bit. Run the next patch through classier. We do
this over and over and over. Here a bit might be 1-32
pixels, depending on the set. So weve run all these dierent
images through the neural net. Next we keep trying larger
and larger samples(scaling them down to 82x36 each time).
Eventually, this should give a list of pedestrians.
For text detection, we again use supervised learning. We
take the positive set to be random images with text, and
the negative set to be non-text examples.
Once weve done this, we get a heat-map of likely images
white to show text and gray to show probability of the text
classier. Next we apply an expansion operatoreg: within
5px or 10px of a white pixel? (This is necessary to ensure
that our bounding boxes include spaces between contiguous
characters.) Then heat-map image is also white where-ever
there is text in the original image. Finally, we apply
box-nding to the contiguous white regions, the connected
regions. Then we select the ones with aspect ratios that
look roughly correctthe aspect ratio is important because
most text is wider than it is tall.
22
Back to character segmentation: Take positive examples
being the splits between two characters, and negative ex-
amples to be whole, distinct characters. Then we train a
classier (with neural network, etc). Next, we run classica-
tion on a sliding window of boxes in our text; the positively-
labeled test examples will be the splits and the negatively la-
beled test examples will be the places we shouldnt split text.
To recap: Our photo OCR pipeline should have three
phases:
1. Text detection -
2. Character Segmentation -
3. Character classication - Typical neural network/other
classier.
18.3 Getting Lots of Data and Articial Data
One of the best ways to get high performance: Take a low
bias algorithm and train on tons and tons of data. So how
do we get the huge datasets? Two ideas:
1. Articial data synthesis
2. dataset augmentation
Consider photo recognition. We can use huge font
libraries and paste them against random backgrounds.
Maybe you want to apply some scaling/ane/etc to train-
ing data. This will give us basically an unlimited supply of
training data.
We can synthesize by inducing distortions. For example,
each training example might give rise to 16+ dierent
distortions. But its important to use distortions that
would arise in practice. That is, DO NOT add purely
random/meaningless noise, that wont help (usually).
Audio/speech recognition example, from original data
1. audio on bad cellphone connection
2. noisy background, crowd
3. noisy background, machinery.
Dont just throw in duplicatesyoull just end up training
the same parameters, half as quickly.
A couple nal points
1. Make sure you have a low bias classier! (eg, plot
learning curves)
2. How much work would it be to get 10x as much data as
we currently have. usually this will make the machine
learning algorithm do much better.
(a) Articial data synthesis
(b) Collect/label it yourself.
(c) Crowd sourceeg, amazon mechanical turk.
18.4 Ceiling Analysis: What part of the Pipeline
to work on Next
Your time is key! (Or your teams time) Dont work on
stu that wont work! How to pick what parts of pipeline
to work on?
Suppose the overall system has 72% accuracy(or another
metric). Then next consider what the other stages would
do if they had perfect data from the prior stages?
1. Given perfect text detection: 89%
2. Given perfect text detection and character segmentation:
90%
3. Given perfect text detection and character segmentation
and character recognition: 100%
This tells us we have approximately 17%, 1%, or 10% from
working on text detection, character detection, or character
recognition, respectively.
Another ceiling analysis example. Pipeline:
1. Camera image
2. preprocess(remove background)
3. face detection
(a) eye segmentation
(b) nose detection
(c) mouth segmentation
4. logistic regression
Then we analyze gains from each stage in the pipeline.
Dont spend human-years on a segment which will not
improve performance!
Dont trust your gut on what component to work ondo
a ceiling analysis every time.
19 Conclusion
19.1 Summary and Thank You
Basic summary:
1. Supervised learning: Linear regression, logistic regres-
sion, neural networks, SVMs (Have (x
(i)
, y
(i)
))
2. Unsupervised Learning: K-means, PCA, Anomaly
detections (Have x
(i)
)
3. Special Applications/topics: Recommender systems,
large scale machine learning
4. Advice on building a machine learning system: Bias/-
variance, regularization, deciding what to work on next,
evaluation of learning algorithms, learning curves, error
analysis, ceiling analysis.
You are now an expert in
machine Learning! Hooray!
23

Mlclass Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mlclass Notes

Uploaded by

Copyright:

Available Formats

Notes from Andrew Ngs

Machine Learning Course

All notes from https://www.coursera.org/course/ml

Department of Engineering Science and Applied Mathematics,

(x) is close to y for our training examples

(x) is a function of f, while J(

J() vs the number of iterations.

(x) 0.5, predict

(x) can output arbitrarily

(x) 1. Dont be confused by regression titleits

(x) to be the estimated probability that

(x) = p(y = 1[x; )

(x) < 0.5

(x) = 0 then Cost .this

(x)). This cost function blows up as h

(x)) (1y) log(1h

(x) which tries to estimate P(y = 1[x; ) which is the

(x) which tries to estimate P(y = 2[x; ) which is the

(x) which tries to estimate P(y = 3[x; ) which is the

I is never singular, for suciently large .

(x) = g(10 + 20x

(x) = g(10 20x

(x) = g(10 20x

(x) = g(20 + 20x

J() using an advanced

(x) >= 0.5, y = 0orh

(x) < 0.5, y = 1

. Then we return the test error

(x) 1 and predict 1 if h

You might also like