Professional Documents
Culture Documents
Nuno Vasconcelos
(Ken Kreutz-Delgado)
UCSD
set
y = yi *
where
i* = arg min d ( x, xi )
i{1,..., n}
Metrics
we have seen some examples:
Rd
-- Continuous functions
Inner Product :
Inner Product :
d
x, y = x y = xi yi
T
i =1
norm2 = energy:
Euclidean norm:
x = x x=
T
f ( x), g ( x) = f ( x) g ( x)dx
f ( x) =
x
i =1
2
i
Euclidean distance:
d ( x, y ) = x y =
f 2 ( x)dx
2
x
y
(
)
i i
i =1
d ( f , g) =
2
f
x
g
x
dx
[
(
)
(
)]
Euclidean distance
We considered in detail the Euclidean distance
d ( x, y ) =
2
(
)
x
y
i i
i =1
Equidistant points to x?
d ( x, y ) = r ( xi yi ) 2 = r 2
i =1
E.g. ( x1 y1 ) 2 + ( x2 y2 ) 2 = r 2
Inner Products
fish example:
features are L = fish length, W = scale width
measure L in meters and W in milimeters
typical L: 0.70m for salmon, 0.40m for sea-bass
typical W: 35mm for salmon, 40mm for sea-bass
Inner Product
Suppose the scale width is also measured in meters:
d ' ( x, y ) =
( x1 y1 ) 2 + 1, 000, 000( x2 y2 ) 2
Often right units are not clear (e.g. car speed vs weight)
8
Inner Products
We need to work with the right, or at least better, units
Apply a transformation to get a better feature space
x' = Ax
examples:
Taking A = R, R proper and orthogonal, is
equivalent to a rotation
Another important special case is
scaling (A = S, for S diagonal)
1 0 0 x1 1 x1
0 0 =
0 0 n xn n xn
We can combine these two transformations
by making taking A = SR
SR
S
Ax
Ax ) Ay x
A Ay
=
( x ') y ' (=
T
SR
S
x ', y ' = x T My
x
=
x, y x=
My
T
=
x, y
T
T
=
y
M
x
y
M
x and
(
)
T
=
y, x yT Mx , x, y
x , x = x T Mx > 0, x 0
11
a1,1
A2 =
a2,1
a1, 2
a2, 2
a1,1
A3 = a2,1
a3,1
a1, 2
a2 , 2
a3, 2
a1,3
a2 , 3
a3,3
12
Metrics
What is a good
weighting matrix M ?
Let the data tell us!
Use the inverse of the
covariance matrix M = -1
= E ( x )( x )T
= E [x ]
Mahalanobis Distance:
d ( x, y ) =
( x y )T 1 ( x y )
13
PX ( x )
=
T 1
exp ( x ) ( x )
d
2
(2 ) | |
1
1
1 2
=
PX ( x )
exp d ( x , )
K
2
exp ( x )T 1 ( x )
2
(2 )d | |
1
12 12
=
2
12 2
i2 is variance of Xi
12 = cov(X1,X2) controls
how dependent X1 and
X2 are
Note that, when 12 = 0:
=
PX1 , X 2 ( x1 , x2 ) PX1 ( x1 ) PX 2 ( x2 ) Xi are independent
16
12
=
1 2
but since you do not change 1 and 2 when you are
changing , this has the same impact as changing 12
17
Optimal Classifiers
Some metrics are better than others
The meaning of better is connected to how well adapted the
metric is to the properties of the data
Can we be more rigorous? Can we have an optimal metric?
What could we mean by optimal?
To talk about optimality we start by defining cost or loss
f (.)
y = f ( x)
L( y, y )
say = face
you have a
false-positive
say = non-face
false-negative (miss)
Loss functions
are some errors more important than others?
Frog Frog=
=dart regular
regular
dart
10
20
Loss functions
but not all snakes are the same
this one is a dart frog predator
it can still classify each frog
it sees
Y = {dart, regular}
it actually prefers dart frogs
but the other ones are good to
eat too
snake
prediction
Frog Frog=
=dart regular
regular
10
dart
10
21
L [ j i]
Conditioned on an observed data vector x, to measure
how good the classifier is, on average, use the
(conditional) expected value of the loss, aka the
(conditional) Risk,
def
R( x , i ) = E{L [Y i ] | x } =
L [ j i ] PY | X ( j | x )
j
(Conditional) Risk
Note that:
This immediately allows us to define an optimal classifier as the
one that minimizes the (conditional) risk
For a given observation x, the Optimal Decision is given by
i ( x) = arg min R( x, i )
*
arg min L [ j i ] PY | X ( j | x)
i
R ( x) min
=
R( x, i ) min L [ j i ] PY | X ( j | x)
*
23
(Conditional) Risk
Back to our example
A snake sees this
X
0
PY | X ( j | x) =
1
j = dart
j = regular
(Conditional) Risk
Info an ordinary snake is presumed to have
Class probabilities conditioned on x
j = dart
j = regular
0
PY | X ( j | x) =
1
snake
prediction
dart
frog
regular
frog
regular
10
dart
L [ j reg ] P
Y |X
( j | x) =
0 1 + 0= 0 + 0= 0
25
(Conditional) Risk
Info the ordinary snake has for the given observation x
j = dart
j = regular
0
PY | X ( j | x) =
1
snake
prediction
dart
frog
regular
frog
regular
10
dart
L [ j dart ] P
Y |X
( j | x) =
10 1 + 0 0= 10 + 0 = 10
26
(Conditional) Risk
The next time the ordinary snake goes foraging for food
It sees this image x
X
0.1
PY | X ( j | x) =
0.9
j = dart
j = regular
27
(Conditional) Risk
Info the ordinary snake has given the new measurement x
Class probabilities conditioned on new x
0.1 j = dart
PY | X ( j | x) =
0.9 j = regular
snake
prediction
dart
frog
regular
frog
regular
10
dart
L [ j reg ] P
Y |X
( j | x) =
28
(Conditional) Risk
Info the snake has given x
j = dart
j = regular
0.1
PY | X ( j | x) =
0.9
L [ j dart ] P
Y |X
snake
prediction
dart
frog
regular
frog
regular
dart
10
( j | x) =
10 0.9 + 0 0.1= 9
29
(Conditional) Risk
What about the dart-snake that can safely eat dart frogs?
The dart-snake sees this
X
0
PY | X ( j | x) =
1
j = dart
j = regular
(Conditional) Risk
Info the dart-snake has given x
0
PY | X ( j | x) =
1
j = dart
j = regular
Dart-Snake Losses
snake
prediction
dart
frog
regular
frog
regular
10
dart
10
L [ j reg ] P
Y |X
( j | x) =
(Conditional) Risk
Info the dart-snake has given x
0
PY | X ( j | x) =
1
j = dart
j = regular
Dart-Snake Losses
snake
prediction
dart
frog
regular
frog
regular
10
dart
10
L [ j dart ] P
Y |X
( j | x) =
10 1 + 0 0= 10
32
(Conditional) Risk
Now the dart-snake sees this
X
0.1
PY | X ( j | x) =
0.9
j = dart
j = regular
33
(Conditional) Risk
Info dart-snake has given new x
j = dart
j = regular
0.1
PY | X ( j | x) =
0.9
Dart-Snake Losses
snake
prediction
dart
frog
regular
frog
regular
10
dart
10
L [ j reg ] P
Y |X
( j | x) =
(Conditional) Risk
Info dart-snake has given new x
0.1
PY | X ( j | x) =
0.9
j = dart
j = regular
Dart-Snake Losses
snake
prediction
dart
frog
regular
frog
regular
10
dart
10
L [ j dart ] P
Y |X
( j | x) =
10 0.9 + 0 0.1= 9
35
(Conditional) Risk
In summary, if both snakes have
0
PY | X ( j | x) =
1
j = dart
j = regular
j = dart
j = regular
R
=
( x, i )
L [ j i] P
Y |X
( j | x)
are
the Loss Function
L[i j ]
PY | X ( j | x)
0.9
j
=
regular
0 i = j
L[i j ] =
1 i j
Under the 0/1 loss:
snake
prediction
dart
frog
regular
frog
regular
dart
i ( x) arg min L [ j i ] PY | X ( j | x)
*
= arg min PY | X ( j | x)
i
j i
39
i * ( x) = arg min PY | X ( j | x)
j i
= arg max PY | X (i | x)
i
This is the Bayes Decision Rule (BDR) for the 0/1 loss
We will simplify our discussion by assuming this loss,
but you should always be aware that other losses may be used
40
R( x,=
i ( x))
*
L j i ( x) P
*
Y |X
( j | x)
PY | X ( j | x)
j i* ( x )
= 1 PY | X (i ( x) | x)
*
42