W 2

2
CS 3244
Machine Learning
Week 2 The Linear Model, Part I

How can machines learn?
Kan Min-Yen
Background Photo credits: Rafiq Mirza, Luan Ahn, Rosmarie Voegtli @ Flickr
Recap
Learning is used when
1.
2.
3.
A pattern exists
We cannot pin it down
mathematically
We have data on it
Focus on supervised learning

- Unknown target function
= ()
- Data set ' , ' , , (* , * )
- Learning algorithm picks
from a hypothesis set
Example: PLA
Learning an unknown function?

- Impossible. The function can
assume any value outside the
data we have.
- Return to this in learning theory.
NUS CS3244: Machine Learning
Three learning problems
Credit
Analysis
Approve or
Deny?
CLASSIFICATION
= 1
Amount of
Credit
REGRESSION
Next week
Probability
of Default
LOGISTIC REGRESSION
[0, 1]
Linear models are the fundamental models

The linear model is the first model to try
The linear signal
= 9
sign (9 )
= 1
(9 )
[0, 1]

>> Input representation
>> Linear classification
Linear regression
Nonlinear transformation
Outline
Error Measures
Noisy Targets
A real dataset
Each digit is a 16x16 pixel gray-intensity image.

[-1 -1 -1 -1 -1 -1 -1 -0.63 0.86 -0.17 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.99 0.3 1 0.31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.41 1 0.99 -0.57 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.68 0.83 1 0.56 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.94 0.54 1 0.78 -0.72 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.1 1 0.92 -0.44 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.26 0.95 1 -0.16 -1 -1 -1 -0.99 -0.71 -0.83 -1 -1 -1 -1 -1 -0.8
0.91 1 0.3 -0.96 -1 -1 -0.55 0.49 1 0.88 0.09 -1 -1 -1 -1 0.28 1 0.88 -0.8 -1 -0.9 0.14 0.97 1 1 1 0.99 -0.74 -1 -1 -0.95 0.84 1 0.32 -1 -1 0.35 1 0.65 -0.10 -0.18 1 0.98 -0.72 -1 -1 -0.63 1 1
0.07 -0.92 0.11 0.96 0.30 -0.88 -1 -0.07 1 0.64 -0.99 -1 -1 -0.67 1 1 0.75 0.34 1 0.70 -0.94 -1 -1 0.54 1 0.02 -1 -1 -1 -0.90 0.79 1 1 1 1 0.53 0.18 0.81 0.83 0.97 0.86 -0.63 -1 -1 -1 -1 -0.45
0.82 1 1 1 1 1 1 1 1 0.13 -1 -1 -1 -1 -1 -1 -0.48 0.81 1 1 1 1 1 1 0.21 -0.94 -1 -1 -1 -1 -1 -1 -1 -0.97 -0.42 0.30 0.82 1 0.48 -0.47 -0.99 -1 -1 -1 -1]
Input representation
Raw input = (@ , ' , A , B , C , , ADE)
Linear model: (@ , ' , A , , ADE)
Too many (257) parameters!
Features: extract useful information, e.g.,

Intensity and symmetry: = (@ , ' , A )
Linear model:
(@ , ' , A )
Illustration of features
= (@ , ' , A )
' = intensity
A = symmetry
Quick Question: which axes is which?
Iterations of PLA
One iteration of PLA:
+
+1
where (, ) is a misclassified training point.

At iteration = 1, 2, 3, , pick a misclassified
point from:
(' , ' ), (A , A ), , (* , * )
and run a PLA iteration on it.
What PLA does

Final perceptron boundary
Error
Symmetry
Evolution of Ein and Eout
Iterations
Average intensity
10
The pocket algorithm

PLA (for comparison): Pocket:
Symmetry
Run PLA
- At each step keep the best Ein
(and ) so far
(its not rocket science, but it works!)
Iterations
Symmetry
Error (log scale)
Error (log scale)
Average intensity
Iterations
Average intensity
11

Linear classification
>> Linear regression
regression = real-valued output
Outline
Error Measures
Noisy Targets
12
Credit Approval
How much credit
do we extend this
person?
Classification: Approve/Deny
Regression: Credit Line (dollar amount)
Input: <table on the left>
Criterion
Value
Age
32 years
Gender
Male
Salary
40 K
Debt
26 K
Years in Job
1 year
Years at
Current
Residence
3 years
Linear regression output: = ONP@ N N = 9

Data set: Credit officers decide on credit lines
(historical data ): ' , ' , A , A , , * , *
R is the credit line for customer R ; regression tries
to replicate this.
13
Artwork credits: Project DebtRelief
-valued error measurement

How well does = 9 approximate ()?
In linear regression, we use squared error: ( )A
in-sample error:
1
A
in = V( R R )
RP'
Average
How bad
is bad?
14
Illustration of linear regression
15
The expression for Ein

*
NR
1
= V(9 R R )A
RP'
1
=
A
' 9
A 9
where = B 9 , =
Quick Question:
What are the dimensions of X?
* 9
'
A
B
*
16
Minimizing Ein
NR
NR
1
=

A 9
=
*
9 = 9
= Z
where Z = 9
[' 9
Z is the pseudo inverse of

17
The pseudo inverse

Z = 9
Z
[' 9
+ 1
18
Linear regression algorithm

1. Construct the matrix and the vector from the data
set as follows:
' 9
A 9
= B 9 , =
* 9
input data matrix
'
A
B
target vector
2. Compute the pseudo inverse Z = 9

3. Return = Z
[' 9
One-step learning!
19
Linear regression for classification

Linear regression learns a real-valued function =
Binary valued functions are also real valued! 1
Use linear regression to get where 9 R R = 1
In this case, sign(9 R ) is likely to agree with R = 1
Good initial weights for classification
20
Why linear regression doesnt set

good weights for classification
Whats wrong
with this
picture?
Hint: think
squared error
21

Linear regression
>> Nonlinear transformation
Outline
Error Measures
Noisy Targets
22
Linear models are limited

Data:
Hypothesis:
23
Another example
Credit line is affected by years in current residence,
but not in a linear way
Nonlinear features N < 1 and [ N > 5 ] are better.

means return 1 if x is true, 0 otherwise
Can we do that with linear models?
24
But linear in what?

Linear regression implements
O
V N N
NP@
Linear classification implements

O
sign(V N N )
NP@
Algorithms work because of the linearity of weights,

but it doesnt say anything about the observed data .
25
Transform the data nonlinearly

b
(' , A , , * ) ('A , AA , , *A )
Any preserves this linearity!

26
27
What transforms to what

b
= ' , A , , O
= (@ , ' , , Of )
' , A , , *
@ , ' , , *
' , A , , *
?
No weights in
' , A , , *
h = (' , A ,, Of )
h 9
= sign
h 9 )
= sign(
28

Linear regression
Outline
>> Error Measures
>> Noisy Targets
29
Recap: the learning diagram

UNKNOWN TARGET FUNCTION
DATA
(ideal credit approval function)
R = (R )
TRAINING EXAMPLES
', ' , , (* , * )
(historical records of credit customers)
LEARNING
ALGORITHM
FINAL HYPOTHESIS

(final credit approval function)
HYPOTHESIS SET
(set of candidate functions)
30
Error measures
What does mean?
Need an error measure ,
This is almost always a pointwise definition: e( , ())
Examples weve seen:
Squared error
( ())A
Binary error
[[ ]]
Which is for
classification?
31
From pointwise to overall

Overall error , =
average of pointwise errors ( , )
In-sample error:
NR
1
= V ( R , R )
RP'
Why not a
sum instead
of an
average?
Out-of-sample error:
stu = [( , )]
32
The learning diagram

with testing data, pointwise error
DATA
R = (R )
TRAINING EXAMPLES
', ' , , (* , * )
() ()
LEARNING
ALGORITHM
FINAL HYPOTHESIS

HYPOTHESIS SET
33
Choosing your error measure

Are you sick?
Two types of error:
false accept (positive) or
false reject (negative)
+1 sick
1 well
How should we penalize

for each type?
34
During your last final exam

before graduation
Are you sick?
False reject get better on your own,
or come back to the clinic later. At
least you graduate on time.
+1 sick
1 well
False accept Take the exam next

year. Possibly pay tuition fees. $$$
35
During SARS
Are you sick?
False reject highly costly!
Epidemic ensures!
False accept requires inconvenience
of quarantine.
+1 sick
1 well
00
36
How you measure matters

Where possible, we should use error measures
that fit the task, specified by the user.
However, this isnt always possible. Then use:
Plausible measures:
squared error Gaussian noise
Convenient measures: closed form solution,

convex optimization
37

with error measure
DATA
R = (R )
TRAINING EXAMPLES
', ' , , (* , * )
ERROR
MEASURE
()
() ()
LEARNING
ALGORITHM
FINAL HYPOTHESIS

HYPOTHESIS SET
Quick Question:
where does the
error measure go?
38
Noisy targets
The target function isnt always a function :
Criterion
Value
Age
32 years
Gender
Male
Salary
40 K
Debt
26 K
Years in Job
1 year
Years at
Current
Residence
3 years
Consider two identical

customers for loan
approval
could have two different
outcomes!
Why? And how to we
characterize these sources
of noise?
Is misreporting
salary also a
cause of
noisy targets?
39
Target distribution
Instead of saying the target is a function, think of it as a
distribution: (|)
Our data , is now generated by the joint distribution:
(|)
Well revisit the

likelihood of the data
again later.
Noisy target = deterministic target function = (|)

+ noise ()
A deterministic target is just a special case:
= 0, except for = ()
40

including noisy target
|
: , plus noise
DATA
(noisy credit approval function)
TRAINING EXAMPLES
', ' , , (* , * )
ERROR
MEASURE
()
LEARNING
ALGORITHM
() ()
FINAL HYPOTHESIS

HYPOTHESIS SET
41
Summary
Linear models use the signal:
= 9
Classification: = sign 9
Regression: = 9
Linear regression algorithm:

= 9
['
Error measures
Application specific, user should
specify
False accepts and rejects may
differ in badness
Noisy targets
= ~(|)
9 is linear in
b
Any preserves this linearity

b
E.g., (' , A ) ('A , AA )
42

W 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W 2

Uploaded by

Copyright:

Available Formats

2

Week 2 The Linear Model, Part I

Focus on supervised learning

Learning an unknown function?

NUS CS3244: Machine Learning

Three learning problems

Linear models are the fundamental models

The linear signal

NUS CS3244: Machine Learning

Three learning problems

NUS CS3244: Machine Learning

Each digit is a 16x16 pixel gray-intensity image.

NUS CS3244: Machine Learning

Features: extract useful information, e.g.,

NUS CS3244: Machine Learning

Quick Question: which axes is which?

NUS CS3244: Machine Learning

where (, ) is a misclassified training point.

What PLA does

Evolution of Ein and Eout

NUS CS3244: Machine Learning

The pocket algorithm

Error (log scale)

Error (log scale)

Three learning problems

regression = real-valued output

NUS CS3244: Machine Learning

Linear regression output: = ONP@ N N = 9

NUS CS3244: Machine Learning

Artwork credits: Project DebtRelief

-valued error measurement

Illustration of linear regression

NUS CS3244: Machine Learning

The expression for Ein

Z is the pseudo inverse of

The pseudo inverse

NUS CS3244: Machine Learning

Linear regression algorithm

2. Compute the pseudo inverse Z = 9

NUS CS3244: Machine Learning

Linear regression for classification

NUS CS3244: Machine Learning

Why linear regression doesnt set

NUS CS3244: Machine Learning

Three learning problems

NUS CS3244: Machine Learning

Linear models are limited

NUS CS3244: Machine Learning

Nonlinear features N < 1 and [ N > 5 ] are better.

Can we do that with linear models?

NUS CS3244: Machine Learning

But linear in what?

Linear classification implements

Algorithms work because of the linearity of weights,

Transform the data nonlinearly

Any preserves this linearity!

NUS CS3244: Machine Learning

What transforms to what

Three learning problems

NUS CS3244: Machine Learning

Recap: the learning diagram

(ideal credit approval function)

(set of candidate functions)