Michael Importance Weighting

learning bounds for importance weighting
Tamas Madarasz & Michael Rabadi

April 15, 2015
Introduction
Often, training distribution does not match testing distribution

Want to utilize information about test distribution
Correct bias or discrepancy between training and testing
distributions
1
Importance Weighting
Labeled training data from source distribution Q

Unlabeled test data from target distribution P
Weight the cost of errors on training instances.
Common denition of weight for point x: w(x) = P(x)/Q(x)
2
Importance Weighting
Reasonable method, but sometimes doesnt work

Can we give generalization bounds for this method?
When does DA work? When does it not work?
How should we weight the costs?
3
Overview
Preliminaries
Learning guarantee when loss is bounded
Learning guarantee when loss is unbounded, but second moment
is bounded
Algorithm
4
Preliminaries: Rnyi divergence
For 0, D (P||Q) between distributions P and Q
( )1
1 P(x)
D (P||Q) = log2 P(x)
1 x
Q(x)
[ ] 1
1
P (x)
D (P||Q)
d (P||Q) = 2 =
x
Q1 (x)
Metric of info lost when Q is used to approximate P

D (P||Q) = 0 iff P = Q
5
Preliminaries: Importance weights
Lemma 1:
E[w] = 1 E[w2 ] = d2 (P||Q) 2 = d2 (P||Q) 1
Proof:
( P(x) )2
2 2
EQ [w ] = w (x)Q(x) = Q(x) = d2 (P||Q)
Q(x)
xX xX
Lemma 2: For all > 0 and x X,

1
EQ [w2 (x)L2h (x)] d+1 (P||Q)R(h)1
6
1 1
Hlders Inequality (Jin, Wilson, and Nobel, 2014): Let p + q = 1, then
( ) p1 ( ) q1
p q
|ax bx | |ax | |bx |
x x x
7
Proof for Lemma 2: Let the loss be bounded by B = 1, then
[ ]2 [ ]
2 P(x) 1 P(x) 1
ExQ [w (x)L2h (x)] = Q(x) L2h (x) = P(x) P(x) L2h (x)
x
Q(x) x
Q(x)
[ [ ] ] 1 [ ] 1
P(x) 2
P(x) P(x)Lh (x)
1
x
Q(x) x
[ +1
] 1

= d+1 (P||Q) P(x)Lh (x)Lh 1
(x)
x
1 1 1
d+1 (P||Q)R(h)1 B1+ = d+1 (P||Q)R(h)1
8
Learning Guarantees: Bounded case
P(x)
supx w(x) = supx Q(x) = d (P||Q) = M. Let d (P||Q) < +. Fix h H.
Then, for any > 0, with probability at least 1 ,

2
w (h)| M log
|R(h) R
2m
M can be very large, so we naturally want a more favorable bound...
9
Overview
Preliminaries
is bounded
Algorithm
10
Theorem 1: Fix h H. For any 1, for any > 0, with probability at

least 1 , the following bound holds:

1 1 1 R(h)2 ] log 1
R(h) R w (h) + 2M log + 2[d+1 (P||Q)R(h)
3m m
11
Bernsteins inequality (Bernstein 1946):

( n ) ( )
1 n2
Pr xi exp
n 2 2 + 2M/3
i=1
when |xi | M.
12
Proof of Theorem 1: Let Z be the random variable w(x)Lh (x) R(x).

Then |Z| M. Thus, by lemma 2, the variance of Z can be bounded in
terms of d+1 (P||Q):
1
2 (Z) = EQ [w2 (x)Lh (x)2 )] R(h)2 d+1 (P||Q)R(h)1 R(h)2
( )
m2 /2
Pr[R(h) R
w (h) > ] exp .
2 (Z) + M/3
13
Thus, setting to match upper bound, then with probability at least

1

1
2M log M2 log2 1 2 2 (Z) log 1
R(h) Rw (h) +
+ +
3m 9m2 m

1
w (h) + 2M log + 2 2 (Z) log 1
=R
3m m
14
Theorem 2: Let H be a nite hypothesis set. Then for any > 0, with
probability at least 1 , the following holds for the importance
weighting method:

w (h) + 2M(log|H| + log ) + 2d2 (P||Q)(log |H| = log
1 1
R(h) R
3m m
15
Theorem 2 holds when = 1. Note that theorem 1 can be simplied

in the case of = 1:

1 1
R(h) R w (h) + 2M log + 2d2 (P||Q) log
3m m
Thus, theorem 2 follows by including the cardinality of H
16
Proposition 2: Lower bound. Assume M < and 2 (w)/M2 1/m.

Assume there exists h0 H such that Lh0 (x) = 1 for all x. There exists
an absolute constant c, c = 2/412 , such that
[ ]
d2 (P||Q) 1
Pr sup |R(h) Rw (h)|
c>0
hH 4m
Proof from theorem 9 of Cortes, Mansour, and Mohri, 2010.
17
Overview
Preliminaries
is bounded
Algorithm
18
Learning Guarantees: Unbounded case
d (P||Q) < does not always hold... Assume P and Q follow a

Guassian distribution with P and Q with means and
[ ]
P(x) P Q2 (x )2 P2 (x )2
= exp
Q(x) Q 2P2 Q2
P(x)
Thus, even if P = Q and = , d (P||Q) = supx Q(x) = , thus
Theorem 1 is not informative.
19
However, the variance of the importance weights is bounded.

+ [ ]
Q 2 2 (x )2 P2 (x )2 2
dw (P||Q) = 2 exp Q Q dx
P 2 2P2
20
Intuition: if = and P >> Q
Q provides some useful information about P

But sample from Q only has few points far from
A few extreme sample points would have large weights
Likewise, if P = Q but >> , weights would be negligible.
21
Theorem 3: Let H be a hypothesis set such that

Pdim({Lh (x) : H H}) = p < . Assume that d2 (P||Q) < + and
w(x) = 0 for all x. Then for > 0, with probability at least 1 , the
following holds:

5/4
3 p log
2me
p + log
4
R(h) Rw (h) + 2
8
d2 (P||Q)
m
22
Proof outline (full proof in of Cortes, Mansour, Mohri, 2010):

[ ]
h]
Pr suphH E[Lh ] E[L
2]
> 2 + log
1
E[L
[ h ]

Pr suphH,t Pr[Lh>t]Pr[L h >t]
>
h >t]
Pr[L
[ ] ( )
2
Pr suphH
R(h)R(h)
> 2 + log 4H (2m) exp m4
1
R(h)
[ ]
h (x)]
Pr sup h H 2
E[Lh (x)]E[L
> 2 + log 1
E[Lh (x)]
( )
m2
p 4
4 exp p log 2em
[ ] ( )
h (x)] m8/3
Pr suphH E[Lh (x)]E[L
2
> 4 exp p log 2em
p 45/3
E[Lh (x)]
|E[Lh (x)] E[L

(x)]|

h 3 p log 2me +log 8
5/4 2 2
2 max{ E[Lh (x)], E[Lh (x)]} 8 p
m 23
Thus, we can show the following:

] ( )
[ R(h) R w (h) 2em m8/3
Pr sup > 4 exp p log 5/3 .
hH d2 (P||Q) p 4
Where p = Pdim({Lh (x) : h H} is the pseudo-dimension of

H = {w(x)Lh (x) : h H}. Note, any set shattered by H is shattered
by H , since there exists a subset B of a set A that is shattered by H ,
such that H shatters A with witnesses si = ri /w(xi ).
24
Overview
Preliminaries
is bounded
Algorithm
25
Alternative algorithms
We can generalize this analysis to an arbitrary function

u (h) = 1 m u(xi )Lh (xi ) and let Q
u : X 7 R, u > 0. Let R be the
m i=1
empirical distribution: Theorem 4: Let H be a hypothesis set such
that Pdim({Lh (x) : h H}) = p < . Assume that
0 < EQ [u2 (x)] < + and u(x) = 0 for all x. Then for any > 0 with
probability at least 1 ,
[ ]
|R(h) R
u (h) |EQ [w(x) u(x)]Lh (x) |

( ) p log 2me 4
p + log
3
+25/4 max
8
EQ [u2 (x)L2h (x)], EQ [u2 (x)L2h (x)]
m
26
Other functions u than w can be used to reweight cost of error

Minimize upper bound
( )
( )
max EQ [u ], EQ [u ] EQ [u2 ] 1 + O(1/ m) ,
2 2
[ ]
minuU E |w(x u(x)| + EQ [u2 ]
Trade-off between bias and variance minimization.
27
28
29

Michael Importance Weighting

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Michael Importance Weighting

Uploaded by

Copyright:

Available Formats

learning bounds for importance weighting

Tamas Madarasz & Michael Rabadi

Often, training distribution does not match testing distribution

Labeled training data from source distribution Q

Reasonable method, but sometimes doesnt work

For 0, D (P||Q) between distributions P and Q

Metric of info lost when Q is used to approximate P

E[w] = 1 E[w2 ] = d2 (P||Q) 2 = d2 (P||Q) 1

Lemma 2: For all > 0 and x X,

Proof for Lemma 2: Let the loss be bounded by B = 1, then

M can be very large, so we naturally want a more favorable bound...

Theorem 1: Fix h H. For any 1, for any > 0, with probability at

Bernsteins inequality (Bernstein 1946):

Proof of Theorem 1: Let Z be the random variable w(x)Lh (x) R(x).

Thus, setting to match upper bound, then with probability at least

Theorem 2 holds when = 1. Note that theorem 1 can be simplied

Thus, theorem 2 follows by including the cardinality of H

Proposition 2: Lower bound. Assume M < and 2 (w)/M2 1/m.

Proof from theorem 9 of Cortes, Mansour, and Mohri, 2010.

d (P||Q) < does not always hold... Assume P and Q follow a

However, the variance of the importance weights is bounded.

Intuition: if = and P >> Q

Q provides some useful information about P

Likewise, if P = Q but >> , weights would be negligible.

Theorem 3: Let H be a hypothesis set such that

Proof outline (full proof in of Cortes, Mansour, Mohri, 2010):

|E[Lh (x)] E[L

Thus, we can show the following:

Where p = Pdim({Lh (x) : h H} is the pseudo-dimension of

We can generalize this analysis to an arbitrary function

Other functions u than w can be used to reweight cost of error

You might also like