You are on page 1of 30

learning bounds for importance weighting

Tamas Madarasz & Michael Rabadi


April 15, 2015
Introduction

Often, training distribution does not match testing distribution


Want to utilize information about test distribution
Correct bias or discrepancy between training and testing
distributions

1
Importance Weighting

Labeled training data from source distribution Q


Unlabeled test data from target distribution P
Weight the cost of errors on training instances.
Common denition of weight for point x: w(x) = P(x)/Q(x)

2
Importance Weighting

Reasonable method, but sometimes doesnt work


Can we give generalization bounds for this method?
When does DA work? When does it not work?
How should we weight the costs?

3
Overview

Preliminaries
Learning guarantee when loss is bounded
Learning guarantee when loss is unbounded, but second moment
is bounded
Algorithm

4
Preliminaries: Rnyi divergence

For 0, D (P||Q) between distributions P and Q

( )1
1 P(x)
D (P||Q) = log2 P(x)
1 x
Q(x)

[ ] 1
1
P (x)
D (P||Q)
d (P||Q) = 2 =
x
Q1 (x)

Metric of info lost when Q is used to approximate P


D (P||Q) = 0 iff P = Q

5
Preliminaries: Importance weights

Lemma 1:

E[w] = 1 E[w2 ] = d2 (P||Q) 2 = d2 (P||Q) 1

Proof:
( P(x) )2
2 2
EQ [w ] = w (x)Q(x) = Q(x) = d2 (P||Q)
Q(x)
xX xX

Lemma 2: For all > 0 and x X,


1
EQ [w2 (x)L2h (x)] d+1 (P||Q)R(h)1

6
Preliminaries: Importance weights

1 1
Hlders Inequality (Jin, Wilson, and Nobel, 2014): Let p + q = 1, then

( ) p1 ( ) q1
p q
|ax bx | |ax | |bx |
x x x

7
Preliminaries: Importance weights

Proof for Lemma 2: Let the loss be bounded by B = 1, then

[ ]2 [ ]
2 P(x) 1 P(x) 1
ExQ [w (x)L2h (x)] = Q(x) L2h (x) = P(x) P(x) L2h (x)
x
Q(x) x
Q(x)

[ [ ] ] 1 [ ] 1
P(x) 2
P(x) P(x)Lh (x)
1

x
Q(x) x

[ +1
] 1

= d+1 (P||Q) P(x)Lh (x)Lh 1
(x)
x

1 1 1
d+1 (P||Q)R(h)1 B1+ = d+1 (P||Q)R(h)1

8
Learning Guarantees: Bounded case

P(x)
supx w(x) = supx Q(x) = d (P||Q) = M. Let d (P||Q) < +. Fix h H.
Then, for any > 0, with probability at least 1 ,

2
w (h)| M log
|R(h) R
2m

M can be very large, so we naturally want a more favorable bound...

9
Overview

Preliminaries
Learning guarantee when loss is bounded
Learning guarantee when loss is unbounded, but second moment
is bounded
Algorithm

10
Learning Guarantees: Bounded case

Theorem 1: Fix h H. For any 1, for any > 0, with probability at


least 1 , the following bound holds:

1 1 1 R(h)2 ] log 1
R(h) R w (h) + 2M log + 2[d+1 (P||Q)R(h)
3m m

11
Learning Guarantees: Bounded case

Bernsteins inequality (Bernstein 1946):


( n ) ( )
1 n2
Pr xi exp
n 2 2 + 2M/3
i=1

when |xi | M.

12
Learning Guarantees: Bounded case

Proof of Theorem 1: Let Z be the random variable w(x)Lh (x) R(x).


Then |Z| M. Thus, by lemma 2, the variance of Z can be bounded in
terms of d+1 (P||Q):
1
2 (Z) = EQ [w2 (x)Lh (x)2 )] R(h)2 d+1 (P||Q)R(h)1 R(h)2

( )
m2 /2
Pr[R(h) R
w (h) > ] exp .
2 (Z) + M/3

13
Learning Guarantees: Bounded case

Thus, setting to match upper bound, then with probability at least


1

1
2M log M2 log2 1 2 2 (Z) log 1
R(h) Rw (h) +
+ +
3m 9m2 m


1
w (h) + 2M log + 2 2 (Z) log 1
=R
3m m

14
Learning Guarantees: Bounded case

Theorem 2: Let H be a nite hypothesis set. Then for any > 0, with
probability at least 1 , the following holds for the importance
weighting method:

w (h) + 2M(log|H| + log ) + 2d2 (P||Q)(log |H| = log
1 1
R(h) R
3m m

15
Learning Guarantees: Bounded case

Theorem 2 holds when = 1. Note that theorem 1 can be simplied


in the case of = 1:

1 1
R(h) R w (h) + 2M log + 2d2 (P||Q) log
3m m

Thus, theorem 2 follows by including the cardinality of H

16
Learning Guarantees: Bounded case

Proposition 2: Lower bound. Assume M < and 2 (w)/M2 1/m.


Assume there exists h0 H such that Lh0 (x) = 1 for all x. There exists
an absolute constant c, c = 2/412 , such that
[ ]
d2 (P||Q) 1
Pr sup |R(h) Rw (h)|
c>0
hH 4m

Proof from theorem 9 of Cortes, Mansour, and Mohri, 2010.

17
Overview

Preliminaries
Learning guarantee when loss is bounded
Learning guarantee when loss is unbounded, but second moment
is bounded
Algorithm

18
Learning Guarantees: Unbounded case

d (P||Q) < does not always hold... Assume P and Q follow a


Guassian distribution with P and Q with means and
[ ]
P(x) P Q2 (x )2 P2 (x )2
= exp
Q(x) Q 2P2 Q2
P(x)
Thus, even if P = Q and = , d (P||Q) = supx Q(x) = , thus
Theorem 1 is not informative.

19
Learning Guarantees: Unbounded case

However, the variance of the importance weights is bounded.


+ [ ]
Q 2 2 (x )2 P2 (x )2 2
dw (P||Q) = 2 exp Q Q dx
P 2 2P2

20
Learning Guarantees: Unbounded case

Intuition: if = and P >> Q

Q provides some useful information about P


But sample from Q only has few points far from
A few extreme sample points would have large weights

Likewise, if P = Q but >> , weights would be negligible.

21
Learning Guarantees: Unbounded case

Theorem 3: Let H be a hypothesis set such that


Pdim({Lh (x) : H H}) = p < . Assume that d2 (P||Q) < + and
w(x) = 0 for all x. Then for > 0, with probability at least 1 , the
following holds:

5/4
3 p log
2me
p + log
4
R(h) Rw (h) + 2
8
d2 (P||Q)
m

22
Learning Guarantees: Unbounded case

Proof outline (full proof in of Cortes, Mansour, Mohri, 2010):


[ ]
h]
Pr suphH E[Lh ] E[L
2]
> 2 + log
1
E[L
[ h ]


Pr suphH,t Pr[Lh>t]Pr[L h >t]
>
h >t]
Pr[L
[ ] ( )
2
Pr suphH
R(h)R(h)
> 2 + log 4H (2m) exp m4
1
R(h)
[ ]
h (x)]
Pr sup h H 2
E[Lh (x)]E[L
> 2 + log 1
E[Lh (x)]
( )
m2
p 4
4 exp p log 2em
[ ] ( )
h (x)] m8/3
Pr suphH E[Lh (x)]E[L
2
> 4 exp p log 2em
p 45/3
E[Lh (x)]

|E[Lh (x)] E[L


(x)]|

h 3 p log 2me +log 8
5/4 2 2
2 max{ E[Lh (x)], E[Lh (x)]} 8 p
m 23
Learning Guarantees: Unbounded case

Thus, we can show the following:


] ( )
[ R(h) R w (h) 2em m8/3
Pr sup > 4 exp p log 5/3 .
hH d2 (P||Q) p 4

Where p = Pdim({Lh (x) : h H} is the pseudo-dimension of


H = {w(x)Lh (x) : h H}. Note, any set shattered by H is shattered
by H , since there exists a subset B of a set A that is shattered by H ,
such that H shatters A with witnesses si = ri /w(xi ).

24
Overview

Preliminaries
Learning guarantee when loss is bounded
Learning guarantee when loss is unbounded, but second moment
is bounded
Algorithm

25
Alternative algorithms

We can generalize this analysis to an arbitrary function



u (h) = 1 m u(xi )Lh (xi ) and let Q
u : X 7 R, u > 0. Let R be the
m i=1
empirical distribution: Theorem 4: Let H be a hypothesis set such
that Pdim({Lh (x) : h H}) = p < . Assume that
0 < EQ [u2 (x)] < + and u(x) = 0 for all x. Then for any > 0 with
probability at least 1 ,
[ ]
|R(h) R
u (h) |EQ [w(x) u(x)]Lh (x) |


( ) p log 2me 4
p + log
3
+25/4 max
8
EQ [u2 (x)L2h (x)], EQ [u2 (x)L2h (x)]
m

26
Alternative algorithms

Other functions u than w can be used to reweight cost of error


Minimize upper bound
( )
( )
max EQ [u ], EQ [u ] EQ [u2 ] 1 + O(1/ m) ,
2 2

[ ]
minuU E |w(x u(x)| + EQ [u2 ]
Trade-off between bias and variance minimization.

27
Alternative algorithms

28
Alternative algorithms

29