Professional Documents
Culture Documents
yuczhang@eecs.berkeley.edu
John Duchi
jduchi@eecs.berkeley.edu
Martin Wainwright
wainwrig@stat.berkeley.edu
Abstract
We study a decomposition-based scalable approach to performing kernel ridge regression.
The method is simple to describe: it randomly partitions a dataset of size N into m subsets
of equal size, computes an independent kernel ridge regression estimator for each subset,
then averages the local solutions into a global predictor. This partitioning leads to a substantial reduction in computation time versus the standard approach of performing kernel
ridge regression on all N samples. Our main theorem establishes that despite the computational speed-up, statistical optimality is retained: if m is not too large, the partition-based
estimate achieves optimal rates of convergence for the full sample size N . As concrete examples, our theory guarantees that m may grow polynomially in N for Sobolev spaces, and
nearly linearly for finite-rank kernels and Gaussian kernels. We conclude with simulations complementing our theoretical results and exhibiting the computational and statistical
benefits of our approach.
1. Introduction
In non-parametric regression, the statistician receives N samples of the form {(xi , yi )}N
i=1 ,
where each xi X is a covariate and yi R is a real-valued response, and the samples are drawn i.i.d. from some unknown joint distribution P over X R. The goal is
to estimate a function fb : X R that can be used to predict future responses based
on observing only the covariates. Frequently, the quality of an estimate fb is measured in
terms of the mean-squared prediction error E[(fb(X) Y )2 ], in which case the conditional
expectation f (x) = E[Y | X = x] is optimal. The problem of non-parametric regression
is a classical one, and researchers have studied a wide range of estimators (see, e.g., the
books by Gyorfi et al. (2002),Wasserman (2006), and van de Geer (2000)). One class of
methods, known as regularized M -estimators (van de Geer, 2000), are based on minimizing the sum of a data-dependent loss function with a regularization term. The focus of
this paper is a popular M -estimator that combines the least-squares loss with a squared
Hilbert norm penalty for regularization. When working in a reproducing kernel Hilbert
space (RKHS), the resulting method is known as kernel ridge regression, and is widely used
in practice (Hastie et al., 2001; Shawe-Taylor and Cristianini, 2004). Past work has established bounds on the estimation error for RKHS-based methods (e.g., Koltchinskii, 2006;
Mendelson, 2002; van de Geer, 2000; Zhang, 2005), which have been refined and extended
in more recent work (e.g., Steinwart et al., 2009).
Although the statistical aspects of kernel ridge regression (KRR) are well-understood,
the computation of the KRR estimate can be challenging for large datasets. In a standard
implementation (Saunders et al., 1998), the kernel matrix must be inverted, which requires
costs O(N 3 ) and O(N 2 ) in time and memory respectively. Such scalings are prohibitive
when the sample size N is large. As a consequence, approximations have been designed to
avoid the expense of finding an exact minimizer. One family of approaches is based on lowrank approximation of the kernel matrix; examples include kernel PCA (Sch
olkopf et al.,
1998), the incomplete Cholesky decomposition (Fine and Scheinberg, 2002), or Nystrom
sampling (Williams and Seeger, 2001). These methods reduce the time complexity to
O(dN 2 ) or O(d2 N ), where d N is the preserved rank. To our knowledge, however,
there are no results showing that such low-rank versions of KRR still achieve minimaxoptimal rates in estimation error. A second line of research has considered early-stopping
of iterative optimization algorithms for KRR, including gradient descent (Yao et al., 2007;
Raskutti et al., 2011) and conjugate gradient methods (Blanchard and Kr
amer, 2010), where
early-stopping provides regularization against over-fitting and improves run-time. If the algorithm stops after t iterations, the aggregate time complexity is O(tN 2 ).
In this work, we study a different decomposition-based approach. The algorithm is appealing in its simplicity: we partition the dataset of size N randomly into m equal sized
subsets, and we compute the kernel ridge regression estimate fbi for each of the i = 1, . . . , m
subsets independently, with a careful
of the regularization parameter. The estimates
Pchoice
bi . Our main theoretical result gives conditions
are then averaged via f = (1/m) m
f
i=1
under which the average f achieves the minimax rate of convergence over the underlying
Hilbert space. Even using naive implementations of KRR, this decomposition gives time
and memory complexity scaling as O(N 3 /m2 ) and O(N 2 /m2 ), respectively. Moreover, our
approach dovetails naturally with parallel and distributed computation: we are guaranteed superlinear speedup with m parallel processors (though we must still communicate the
function estimates from each processor). Divide-and-conquer approaches have been studied by several authors, including McDonald et al. (2010) for perceptron-based algorithms,
Kleiner et al. (2012) in distributed versions of the bootstrap, and Zhang et al. (2012) for
parametric smooth convex optimization objectives arising out of statistical estimation problems. This paper demonstrates the potential benefits of divide-and-conquer approaches for
nonparametric and infinite-dimensional regression problems.
One difficulty in solving each of the sub-problems independently is how to choose the
regularization parameter. Due to the infinite-dimensional nature of non-parametric problems, the choice of regularization parameter must be made with care (e.g., Hastie et al.,
2001). An interesting consequence of our theoretical analysis is in demonstrating that, even
though each partitioned sub-problem is based only on the fraction N/m of samples, it is
nonetheless essential to regularize the partitioned sub-problems as though they had all N
samples. Consequently, from a local point of view, each sub-problem is under-regularized.
This under-regularization allows the bias of each local estimate to be very small, but it
causes a detrimental blow-up in the variance. However, as we prove, the m-fold averaging
underlying the method reduces variance enough that the resulting estimator f still attains
optimal convergence rate.
2
K(x, x ) =
j j (x)j (x ),
j=1
X
X
j2
j2 , and kf k2H = hf, f i =
f 2 (x)dP(x) =
kf k22 =
.
j
X
j=1
j=1
Consequently, we see that the RKHS can be viewed as an elliptical subset of the sequence
space 2 (N) defined by the non-negative eigenvalues {j }
j=1 .
3
N
1 X
2
2
(f (xi ) yi ) + kf kH ,
N
(2)
i=1
(3)
(x,y)Si
1
m
Pm b
i=1 fi .
This description actually provides a family of estimators, one for each choice of the regularization parameter > 0. Our main result applies to any choice of , while our corollaries
for specific kernel classes optimize as a function of the kernel.
We now describe our main assumptions. Our first assumption, for which we have two
variants, deals with the tail behavior of the basis functions {j }
j=1 .
Assumption A For some k 2, there is a constant < such that E[j (X)2k ] 2k
for all j = 1, 2, . . ..
In certain cases, we show that sharper error guarantees can be obtained by enforcing a
stronger condition of uniform boundedness:
Assumption A There is a constant < such that supxX |j (x)| for all j =
1, 2, . . ..
Recalling that f (x) : = E[Y | X = x], our second assumption involves the regression
function f and the deviations of the zero-mean noise variables Y f (x):1
Assumption B The function f H, and for x X , we have E[(Y f (x))2 | x] 2 .
3.2. Statement of main results
With these assumptions in place, we are now ready for the statement of our main result.
Our result gives bound on the mean-squared estimation error E[kf f k22 ] associated with
the averaged estimate f based on an assigning n = N/m samples to each of m machines.
The theorem statement involves the following three kernel-related quantities:
tr(K) :=
X
j=1
j ,
() :=
X
j=1
1
,
1 + /j
and d =
j .
(4)
j=d+1
The first quantity is the kernel trace, which serves a crude estimate of the size of the kernel
operator, and assumed to be finite. The second quantity (), familiar from previous work
on kernel regression (Zhang, 2005), is known as the effective dimensionality of the kernel
K with respect to L2 (P). Finally, the quantity d is parameterized by a positive integer d
that we may choose in applying the bounds, and it describes the tail decay of the eigenvalues
of K. For d = 0, note that 0 reduces to the ordinary trace. Finally, Theorem 1 involves
one further quantity that depends on the number of moments k in Assumption A, namely
p
max{k, log(2d)}
max{k, log(2d)},
b(n, d, k) := max
.
(5)
n1/21/k
Here the parameter d N is a quantity that may be optimized to obtain the sharpest
possible upper bound. (The algorithms execution is independent of d.)
Theorem 1 With f H and under Assumptions A and B, the mean-squared error of the
averaged estimate f is upper bounded as
h
i
6
12 2 ()
2
+ inf T1 (d) + T2 (d) + T3 (d) ,
(6)
E f f 2 4+
kf k2H +
dN
m
N
1. We can extend our results to the more general setting in which f 6 H. In this paper, we limit ourselves
to the case f H to facilitate comparisons with minimax rates and for conciseness in presentation.
The assumption f H is, in any case, fairly standard (Gyorfi et al., 2002; Wasserman, 2006).
where
4 kf k2H + 2 2 /
24 kf k2H tr(K)d
34 tr(K)d
, T2 (d) =
d+1 +
,
T1 (d) =
!
k
2 2 4 kf k2H
2 () + 1
,
kf k22 1 +
+
T3 (d) = Cb(n, d, k)
n
m
m
and
Variance
This inequality reveals the usual bias-variance trade-off in non-parametric regression; choosing a smaller value of > 0 reduces the first squared bias term, but increases the second
variance term. Consequently, the setting of that minimizes the sum of these two terms is
defined by the relationship
()
.
(8)
N
This type of fixed point equation is familiar from work on oracle inequalities and local
complexity measures in empirical process theory (Bartlett et al., 2005; Koltchinskii, 2006;
van de Geer, 2000; Zhang, 2005), and when is chosen so that the fixed point equation (8) holds this (typically) yields minimax optimal convergence rates (Bartlett et al.,
2005; Koltchinskii, 2006; Zhang, 2005; Caponnetto and De Vito, 2007). In Section 3.3, we
provide detailed examples in which the choice specified by equation (8), followed by application of Theorem 1, yields minimax-optimal prediction error (for the Fast-KRR algorithm)
for many kernel classes.
kf k2H 2
given the the remarks following Theorem 1: we first choose the regularization parameter
to balance the bias and variance terms, and then show, by comparison to known minimax
lower bounds, that the resulting upper bound is optimal. Finally, we derive an upper bound
on the number of subsampled data sets m for which the minimax optimal convergence rate
can still be achieved.
Our first corollary applies to problems for which the kernel has finite rank r, meaning
that its eigenvalues satisfy j = 0 for all j > r. Examples of such finite rank kernels
include the linear kernel K(x, x ) = hx, x iRd , which has rank at most r = d; and the
kernel K(x, x ) = (1 + x x )m generating polynomials of degree m, which has rank at most
r = m + 1.
Corollary 2 For a kernel with rank r, consider the output of the Fast-KRR algorithm with
= r/N . Suppose that Assumption B and Assumption A (or A ) hold, and that the number
of processors m satisfy the bound
k4
mc
N k2
r2
4k
k2
log
k
k2
(Assumption A) or m c
N
r 2 4 log N
(Assumption A ),
.
(9)
E
f f
2 = O
N
For finite-rank kernels, the rate (9) is minimax-optimal: if BH (1) denotes the 1-norm
ball in H, there is a universal constant c > 0 such that inf f supf BH (1) E[kf f k22 ] c Nr .
This lower bound follows from Theorem 2(a) of Raskutti et al. (2012) with s = d = 1.
Our next corollary applies to kernel operators with eigenvalues that obey a bound of
the form
j C j 2
for all j = 1, 2, . . .,
(10)
where C is a universal constant, and > 1/2 parameterizes the decay rate. Kernels with
polynomially-decaying eigenvalues include those that underlie for the Sobolev spaces with
different smoothness orders (e.g. Birman and Solomjak, 1967; Gu, 2002). As a concrete
example, the first-order Sobolev kernel K(x, x ) = 1 + min{x, x } generates an RKHS of
Lipschitz functions with smoothness = 1. Other higher-order Sobolev kernels also exhibit
polynomial eigen-decay with larger values of the parameter .
Corollary 3 For any kernel with -polynomial eigen-decay (10), consider the output of the
2
Fast-KRR algorithm with = (1/N ) 2+1 . Suppose that Assumption B and Assumption A
(or A ) hold, and that the number of processors satisfy the bound
2(k4)k ! 1
21
k2
N (2+1)
N 2+1
(Assumption A ),
(Assumption A) or m c 4
mc
k
4k
log
N
log N
where c is a constant only depending on . Then the mean-squared error is bounded as
2 2
h
i
2+1
2
.
(11)
E f f 2 =O
N
7
for all j = 1, 2, . . .,
(12)
for strictly positive constants (c1 , c2 ). Such classes include the RKHS generated by the
Gaussian kernel K(x, x ) = exp(kx x k22 ).
Corollary 4 For a kernel with exponential eigen-decay (12), consider the output of the
Fast-KRR algorithm with = 1/N . Suppose that Assumption B and Assumption A (or A )
hold, and that the number of processors satisfy the bound
k4
mc
N k2
4k
k2
log
2k1
k2
(Assumption A)
or
mc
N
4 log2 N
(Assumption A ),
.
(13)
E f f 2 =O
N
Summary: Each corollary gives a critical threshold for the number m of data partitions:
as long as m is below this threshold, we see that the decomposition-based Fast-KRR algorithm gives the optimal rate of convergence. It is interesting to note that the number of splits
may be quite large: each grows asymptotically with N whenever the basis functions have
more sufficiently many moments (viz. Assumption A). Moreover, the Fast-KRR method
can attain these optimal convergence rates while using substantially less computation than
standard kernel ridge regression methods.
where we used the fact that E[fbi ] = E[f] for each i [m]. Using this unbiasedness once
more, we bound the variance of the terms fbi E[f] to see that
h
i
2 i
1 h
E
f f
2 E kfb1 E[fb1 ]k22 + kE[fb1 ] f k22
m
i
1 h
E kfb1 f k22 + kE[fb1 ] f k22 ,
(14)
m
where we have used the fact that E[fbi ] minimizes E[kfbi f k22 ] over f H.
The error bound (14) suggests our strategy: we bound E[kfb1 f k22 ] and kE[fb1 ] f k22
respectively. Based on equation (3), the estimate fb1 is obtained from a standard kernel ridge
regression with sample size n = N/m and ridge parameter . Accordingly, the following
two auxiliary results provide bounds on these two terms, where the reader should recall the
definitions of b(n, d, k) and d from equation (4). In each lemma, C represents a universal
(numerical) constant.
Lemma 5 (Bias bound) Under Assumptions A and B, for d = 1, 2, . . ., we have
k
24 kf k2H tr(K)d
2 () + 1
2
2
b
kf k22 . (15)
+ Cb(n, d, k)
kE[f ] f k2 4 kf kH +
n
Lemma 6 (Variance bound) Under Assumptions A and B, for d = 1, 2, . . ., we have
12 2 ()
E[kfb f k22 ] 6 kf k2H +
n
!
2
4 tr(K)
2 () + 1 k
2
3
d
2
2
+
kf k2 . (16)
d+1 +
+ 4 kf kH
+ Cb(n, d, k)
n
The proofs of these lemmas, contained in Appendices A and B respectively, constitute one
main technical contribution of this paper.
Given these two lemmas, the remainder of the theorem proof is straightforward. Combining the inequality (14) with Lemmas 5 and 6 yields the claim of Theorem 1.
4.2. Proof of Corollary 2
We first present a general inequality bounding the size of m for which optimal convergence
rates are possible. We assume that d is chosen large enough that for some constant c, we
have c log(2d) k in Theorem 1, and that the regularization has been chosen. In this
case, inspection of Theorem 1 shows that if m is small enough that
s
!k
1
() + 1
log d 2
(() + 1)
,
N/m
m
N
then the term T3 (d) provides a convergence rate given by (() + 1)/N . Thus, solving the
expression above for m, we find
m log d 4
2/k m2/k (() + 1)2/k
(() + 1)2 =
N
N 2/k
9
or
k2
k
k N
(() + 1)2
k2
k
k1
k
4 log d
k2 N
4k
k1
(17)
k4
N k2
4k
N k2
4r 2 k2 log k2 r
4k
(r + 1)2 k2 log k2 r
If we have sufficiently many moments that k log N , and N r (for example, if the basis
functions j have a uniform bound ), then we may take k = log N , which implies that
k4
N k2 = (N ), and we replace log d = log r with log N (we assume N r), by recalling
Theorem 1. Then so long as
N
mc 2 4
r log N
for some constant c > 0, we obtain an identical result.
4.3. Proof of Corollary 3
We follow the program outlined in our remarks following Theorem 1. We must first choose
2
so that = ()/N . To that end, we note that setting = N 2+1 gives
() =
X
j=1
1
1 + j 2 N
2
2+1
N 2+1 +
j>N
1
2+1
+N
1
2
1
2+1
2
2+1
1 + j 2 N 2+1
1
2+1
1
1
1
1
du = N 2+1 +
N 2+1 .
u2
2 1
4
(2+1)(k2)
2(k1)
2(k4)k
4k
N (2+1)(k2) k2 log k2 N
=c
N (2+1)(k2)
4k
k2 log k2 N
2
Whenever this holds, we have convergence rate = N 2+1 . Now, let Assumption A hold,
and take k = log N . Then the above bound becomes (to a multiplicative constant factor)
21
N 2+1 /4 log N , as claimed.
10
10
10
10
m=1
m=4
m=16
m=64
3
10
10
256
512
1024
2048
4096
m=1
m=4
m=16
m=64
8192
256
512
1024
2048
4096
8192
Figure 1. The squared L2 (P)-norm between between the averaged estimate f and the
optimal solution f . (a) These plots correspond to the output of the Fast-KRR algorithm:
each sub-problem is under-regularized by using N 2/3 . (b) Analogous plots when each
sub-problem is not under-regularizedthat is, with n2/3 is chosen as usual.
j
(log N )/c2
j
(log N )/c2
(log N )/c2
N k2
,
2k1
4k
k2 log k2 N
5. Simulation studies
In this section, we explore the empirical performance of our subsample-and-average methods
for a non-parametric regression problem on simulated datasets. For all experiments in
this section, we simulate data from the regression model y = f (x) + for x [0, 1],
where f (x) := min{x, 1 x} is 1-Lipschitz, the noise variables N(0, 2 ) are normally
distributed with variance 2 = 1/5, and the samples xi Uni[0, 1]. The Sobolev space
11
10
10
N=256
N=512
N=1024
N=2048
N=4096
N=8192
10
10
10
0.2
0.4
0.6
log(m)/ log(N )
0.8
of Lipschitz functions
on [0, 1] has reproducing kernel K(x, x ) = 1 + min{x, x } and norm
R
1
2
kf kH = f 2 (0) + 0 (f (z))2 dz. By construction, the function f (x) = min{x, 1 x} satisfies
kf kH = 1. The kernel ridge regression estimator fb takes the form
fb =
N
X
i=1
i K(xi , ),
where = (K + N I)1 y,
(18)
and K is the N N Gram matrix and I is the N N identity matrix. Since the firstorder Sobolev kernel has eigenvalues (Gu, 2002) that scale as j (1/j)2 , the minimax
convergence rate in terms of squared L2 (P)-error is N 2/3 (e.g. Tsybakov, 2009; Stone,
1982; Caponnetto and De Vito, 2007).
By Corollary 3 with = 1, this optimal rate of convergence can be achieved by Fast-KRR
with regularization parameter N 2/3 as long as the number of partitions m satisfies
m . N 1/3 . In each of our experiments, we begin with a dataset of size N = mn, which we
partition uniformly at random into m disjoint subsets. We compute the local estimator fbi
for each of the m subsets using n samples via (18), where the Gram matrix is constructed
P
b
using the ith batch of samples (and n replaces N ). We then compute f = (1/m) m
i=1 fi .
Our experiments compare the error of f as a function of sample size N , the number of
partitions m, and the regularization .
In Figure 5(a), we plot the error kf f k22 versus the total number of samples N , where
N {28 , 29 , . . . , 213 }, using four different data partitions m {1, 4, 16, 64}. We execute each
simulation 20 times to obtain standard errors for the plot. The black circled curve (m = 1)
gives the baseline KRR error; if the number of partitions m 16, Fast-KRR has accuracy
comparable to the baseline algorithm. Even with m = 64, Fast-KRRs performance closely
matches the full estimator for larger sample sizes (N 211 ). In the right plot Figure 5(b),
we perform an identical experiment, but we over-regularize by choosing = n2/3 rather
than = N 2/3 in each of the m sub-problems, combining the local estimates by averaging
as usual. In contrast to Figure 5(a), there is an obvious gap between the performance of
the algorithms when m = 1 and m > 1, as our theory predicts.
It is also interesting to understand the number of partitions m into which a dataset
of size N may be divided while maintaining good statistical performance. According to
Corollary 3 with = 1, for the first-order Sobolev kernel, performance degradation should
be limited as long as m . N 1/3 . In order to test this prediction, Figure 2 plots the mean12
N
212
213
214
215
216
217
Error
Time
Error
Time
Error
Time
Error
Time
Error
Time
Error
Time
m=1
1.26 104
1.12 (0.03)
6.40 105
5.47 (0.22)
3.95 105
30.16 (0.87)
Fail
Fail
Fail
m = 16
1.33 104
0.03 (0.01)
6.29 105
0.12 (0.03)
4.06 105
0.59 (0.11)
2.90 105
2.65 (0.04)
1.75 105
16.65 (0.30)
1.19 105
90.80 (3.71)
m = 64
1.38 104
0.02 (0.00)
6.72 105
0.04 (0.00)
4.03 105
0.11 (0.00)
2.84 105
0.43 (0.02)
1.73 105
2.21 (0.06)
1.21 105
10.87 (0.19)
m = 256
m = 1024
N/A
N/A
N/A
N/A
3.89 105
0.06 (0.00)
2.78 105
0.15 (0.01)
1.71 105
0.41 (0.01)
1.25 105
1.88 (0.08)
N/A
N/A
1.67 105
0.23 (0.01)
1.24 105
0.60 (0.02)
square error kf f k22 versus the ratio log(m)/ log(N ). Our theory predicts that even as the
number of partitions m may grow polynomially in N , the error should grow only above some
constant value of log(m)/ log(N ). As Figure 2 shows, the point that kf f k2 begins to
increase appears to be around log(m) 0.45 log(N ) for reasonably large N . This empirical
performance is somewhat better than the (1/3) thresholded predicted by Corollary 3, but it
does confirm that the number of partitions m can scale polynomially with N while retaining
minimax optimality.
Our final experiment gives evidence for the improved time complexity partitioning provides. Here we compare the amount of time required to solve the KRR problem using the
naive matrix inversion (18) for different partition sizes m and provide the resulting squared
errors kf f k22 . Although there are more sophisticated solution strategies, we believe this
is a reasonable proxy to exhibit Fast-KRRs potential. In Table 1, we present the results
of this simulation, which we performed in Matlab using a Windows machine with 16GB
of memory and a single-threaded 3.4Ghz processor. In each entry of the table, we give
the mean error of Fast-KRR and the mean amount of time it took to run (with standard
deviation over 10 simulations in parentheses; the error rate standard deviations are an order
of magnitude smaller than the errors, so we do not report them). The entries Fail correspond to out-of-memory failures because of the large matrix inversion, while entries N/A
indicate that kf f k2 was significantly larger than the optimal value (rendering time improvements meaningless). The table shows that without sacrificing accuracy, decomposition
via Fast-KRR can yield substantial computational improvements.
13
References
P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of
Statistics, 33(4):14971537, 2005.
A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and
Statistics. Kluwer Academic, 2004.
R. Bhatia. Matrix Analysis. Springer, 1997.
M. Birman and M. Solomjak. Piecewise-polynomial approximations of functions of the
classes Wp . Sbornik: Mathematics, 2(3):295317, 1967.
G. Blanchard and N. Kr
amer. Optimal learning rates for kernel conjugate gradient regression. In Advances in Neural Information Processing Systems 24, 2010.
A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm.
Foundations of Computational Mathematics, 7(3):331368, 2007.
R. Chen, A. Gittens, and J. A. Tropp. The masked sample covariance estimator: an analysis
using matrix concentration inequalities. Information and Inference, to appear, 2012.
S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations.
Journal of Machine Learning Research, 2:243264, 2002.
C. Gu. Smoothing spline ANOVA models. Springer, 2002.
L. Gyorfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics. Springer, 2002.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,
2001.
A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12:5567, 1970.
A. Kleiner, A. Talwalkar, P. Sarkar, and M. Jordan. Bootstrapping big data. In Proceedings
of the 29th International Conference on Machine Learning, 2012.
V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.
Annals of Statistics, 34(6):25932656, 2006.
M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991.
D. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.
R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured
perceptron. In North American Chapter of the Association for Computational Linguistics
(NAACL), 2010.
S. Mendelson. Geometric parameters of kernel machines. In Proceedings of COLT, pages
2943, 2002.
14
15
fb
f
x
b
j=1 j j
d
Integer-valued truncation point
M
Diagonal matrix with M = diag(1 , . . . , d )
Q
Diagonal matrix with Q = Idd + M 1
v
Truncation of vector v. For v = j j j H, defined as v = dj=1 j j ; for
v 2 (N) defined as v = (v1 , . . . , vd )
v
Untruncated part
P of vector v, defined as v = (vd+1 , vd+1 , . . .)
d
The tail sum j>d j
P
()
The sum
/j )
j=1 1/(1 +
p
b(n, d, k) The maximum max{ max{k, log(2d)}, max{k, log(2d)}/n1/21/k }
Table 2: Notation used in proofs
2
1X
hxi , f iH yi + kf k2H .
n
(19)
i=1
This objective is Frechet differentiable, and as a consequence, the necessary and sufficient
conditions for optimality (Luenberger, 1969) of fb are that
n
1X
1X
(hxi , fb f iH i ) + fb =
(hxi , fbiH yi ) + fb = 0.
n
n
i=1
(20)
i=1
Taking conditional expectations over the noise variables {i }ni=1 with the design X = {xi }ni=1
fixed, we find that
n
1X
xi hxi , E[ | X]i + E[fb | X] = 0.
n
i=1
16
b :=
Define the sample covariance operator
from the above equation yields
1
n
Pn
i=1 xi
b + I)E[ | X] = f .
(
(21)
b 0.
Consequently, we see we have kE[ | X]kH kf kH , since
We now use a truncation argument to reduce the problem to a finite dimensional problem. To do so, we let 2 (N) denote the coefficients of E[ | X] when expanded in the
basis {j }
j=1 :
E[ | X] =
j j ,
j=1
(22)
(23)
By definition, we have
X
j2 (i)
d+1 X 2
j d+1
d+1 kE[ | X]k2H
=
d+1
j
j=d+1
j=d+1
(ii)
d+1 kf k2H ,
(24)
P
j2
where inequality (i) follows since kE[ | X]k2H =
j=1 j ; and inequality (ii) follows from
the bound kE[ | X]kH kf kH , which is a consequence of equality (21).
Control of the term k k22 : Let (1 , 2 , . . .) be the coefficients of f in the basis {j }.
In addition, define the matrices Rnd by
ij = j (xi ) for i {1, . . . , n}, and j {1, . . . , d}
and M = diag(1 , . . . , d ) 0 Rdd . Lastly, define the tail error vector v Rn by
X
vi : =
j j (xi ) = E[ (xi ) | X].
j>d
Let l N be arbitrary. Computing the (Hilbert) inner product of the terms in equation (21)
with l , we obtain
D
E
l
b + )E[ | X]
= hl , f i = l , (
l
n
n
1X
l
1X
hl , xi i hxi , E[ | X]i + hl , E[ | X]i =
l (xi )E[(xi ) | X] + .
=
n
n
l
i=1
i=1
17
We can rewrite the final sum above using the fact that = + , which implies
X
d
n
n
X
1X
1X
j (xi )j +
j (xi )j
l (xi )E[(xi ) | X] =
l (xi )
n
n
i=1
j=1
i=1
j>d
(25)
We now show how the expression (25) gives us the desired bound in the lemma. By
definining the shorthand matrix Q = (I + M 1 ), we have
1 T
1 T
1 T
1
1
1
+ M = I + M + I = Q I + Q
I
.
n
n
n
As a consequence, we can rewrite expression (25) to
1
1 T
I
= Q1 M 1 Q1 T v.
I + Q1
n
n
(26)
v
n
4
2
(27a)
(27b)
Define the event E := Q1 ((1/n)T I) 1/2 . Under Assumption A with moment
bound E[j (X)2k ] 2k , there exists a universal constant C such that
k
p
k log(2d) C(2 () + 1)
P(E ) max
k log(2d), 1/21/k
.
n
n
c
(28)
18
Applying the bounds (27a) and (27b), along with the elementary inequality (a + b)2
2a2 + 2b2 , we have
i
i
h
h
24 kf k2H tr(K)d
+ E 1(E c ) k k22 .
E k k22 4 kf k2H +
(29)
Now we use the fact that by the gradient optimality condition (21), kE[ | X]k22 kf k22 .
Recalling the shorthand (5) for b(n, d, k), we apply the bound (28) to see
k
i
h
Cb(n, d, k)(2 () + 1)
1 1
2
Q M
= ( )T (M + I)2 = ( )T (M 2 + 2 I + 2M )1
2
( )T (2M )1
1 T 1
1
( ) M
kf k2H .
2
2
j
1
1
1
21 1
2
,
(M + M ) = max
= max
j + / j
j[d]
j[d] j +
2
(31)
(32)
=1
=1
using the Cauchy-Schwarz inequality. Taking expectations with respect to the design {xi }ni=1
and applying Holders inequality yields
q
q
E[k k22 kvk22 ] E[k k42 ] E[kvk42 ].
19
We bound each of the terms in this product in turn. For the first, we have
E[k k42 ]
=E
X
n
2
2 (Xi )
i=1
=E
X
n
2 (Xi )2 (Xj )
i,j=1
n2 E[4 (X1 )] n2 4
q
since the Xi are i.i.d., E[2 (X1 )] E[4 (X1 )], and E[4 (X1 )] 4 by assumption. Turning to the term involving v, we have
vi2
X
j>d
2 X 2 X
j
2
j j (xi )
j j (xi )
j
j>d
j>d
=E
1X 2
vi
n
n
i=1
2
21
n
X
E[vi4 ]
i=1
2
X 2 2 X
n
X
j
2
j j (Xi )
E
n
j
i=1
j>d
j>d
2
X
j 2j (X1 )
,
n2 E kE[ | X]k4H
j>d
since the Xi are i.i.d. Using the fact that kE[ | X]kH kf kH , we expand the second
square to find
X 2
X
X
1
4
4
4 4
2
2
4 4
E[kvk2 ] kf kH
E j k j (X1 )k (X1 ) kf kH
j k = kf kH
j .
n2
j,k>d
j,k>d
j>d
Combining our bounds on k k2 and kvk2 with our initial bound (32), we obtain the inequality
v
X 2
X X
u
d
d
2 X
p
u
1/2 T
4
2
2
4
n2 4 tn2 kf kH 4
E
M v
j
j
= n kf kH
.
2
l=1
j>d
j>d
l=1
P
P
Dividing by n2 , recalling the definition of d = j>d j , and noting that tr(K) dl=1
shows that
"
#
1 1/2 T
2
4
2
E
n M v
kf kH d tr(K).
2
Combining this inequality with our expansion (30) and the bound (31) yields the claim (27b).
Proof of bound (28): Let us consider the expectation of the norm of the matrix Q1 ((1/n)T I).
For each i [n], let i = (1 (xi ), . . . , d (xi )) Rd denote the ith row of the matrix
Rnd . Then we know that
n
1 1 X
1 T
1
(i iT I).
I = Q
Q
n
n
i=1
20
0
Q1 i iT I
Ai :=
T
1
I i i Q
0
Then the matrices Ai = ATi R2d2d , and moreover max (Ai ) = |||Ai ||| = Q1 i iT I ,
and similarly for their averages Bhatia (1997). Note that E[Ai ] = 0 and let i be i.i.d.
{1, 1}-valued Rademacher random variables. Applying a standard symmetrization argument (Ledoux and Talagrand, 1991), we find that for any k 1, we have
k
k
"
k #
n
n
1 X
1 X
1 1 T
(33)
I = E
Ai 2k E
i Ai .
E Q
n
n
n
i=1
i=1
h P
k i1/k
Lemma 8 The quantity E n1 ni=1 i Ai
is upper bounded by
p
e(k 2 log(2d))
Pd
1
j=1 1+/j
+1
4e(k 2 log(2d))
+
n11/k
X
d
j=1
2
+1 .
1 + /j
(34)
definition of the constant () = j=1 1/(1 + /j ) j=1 1/(1 + /j ). Then using our
symmetrization inequality (33), we have
k
1 1 T
E Q
I
n
k
p
2 () + 1 4e(k 2 log(2d)) 2
k
+
e(k log(2d))
( () + 1)
2
n
n11/k
k
p
k log(2d) k C(2 () + 1)
k log(2d), 1/21/k
,
(35)
max
n
n
k log(2d), 1/21/k
max
.
P(E )
2k
n
n
This completes the proof of the bound (28).
It remains to prove Lemma 8, for which we make use of the following result, due
to Chen et al. (2012, Theorem A.1(2)).
Lemma 9 Let Xi Rdd be independent symmetrically distributed Hermitian matrices.
Then
k 1/k
1/2
X
X
1/k
p
n
n
k
2
e(k 2 log d)
Xi
E
.
E[Xi ] + 2e(k 2 log d) E[max |||Xi ||| ]
i=1
i=1
(36)
21
The proof of Lemma 8 is based on applying this inequality with Xi = i Ai /n, and then
bounding the two terms on the right-hand side of inequality (36).
We begin with the first term. Note that
Q1 i iT + i iT Q1 i iT Q2 i iT I
2
E[Ai ] = E diag
.
Q1 i iT + i iT Q1 Q1 i iT i iT Q1 I
Moreover, we have E[i iT ] = Idd , which leaves us needing to compute E[i iT Q2 i iT ]
and E[Q1 i iT i iT Q1 ]. Instead of computing them directly, we provide bounds on their
norms. Since i iT is rank one and Q is diagonal, we have
d
X
1
j (xi )2
Q i iT = i iT Q1 = iT (I + M 1 )1 i =
.
1 + /j
j=1
k
Pd
X
k
d
d
2
2
X
j (xi )
j (xi )
1/(1 + / )
= Pl=1
d
1 + /j
1
+ /j
l=1 1/(1 + / )
j=1
j=1
X
d
l=1
1
1 + /
k
X
d
j=1
j (xi )2
1 + /j
k
d
X
j (xi )2k
,
Pd
1 + /j
l=1 1/(1 + / )
j=1
X
d
j=1
1
1 + /j
k
2k .
(37)
2
The sub-multiplicativity of the matrix norm implies Q1 i iT i iT Q1 Q1 i iT ,
and consequently we have
X
d
h
1
2 i
4
T
T 1
T
1 1
E[ Q i i i i Q
] E i (I + M ) i
j=1
1
1 + /j
2
v
1/2
u d
1/2 X
2
n
n
1 X
X
1
1 u
1
2
2
t
4
E[Ai ]
E[Ai ]
+ 1.
= 2
2
n
n
n
1 + /j
i=1
j=1
i=1
We have thus obtained the first term on the right-hand side of expression (34).
22
We now turn to the second term in expression (34). For real k 1, we have
n
1
1 X
k
E[max |||i Ai /n||| ] = k E[max |||Ai ||| ] k
E[|||Ai |||k ]
i
i
n
n
k
i=1
k1
|||Ai ||| 2
X
d
j=1
j (xi )2
1 + /j
k
+ 2k1 .
nk1
k1
X
d
j=1
1
1 + /j
k
2k
k1
+2
Taking kth roots yields the second term in the expression (34).
1X b
E[ kfbk2H | {xi }ni=1 ] E
(f (xi ) f (xi ) i )2 + kfbk2H | {xi }ni=1
n
i=1
n
(i) 1 X
E[2i | xi ] + kf k2H
n
i=1
(ii)
2 + kf k2H ,
where inequality (i) follows since fb minimizes the objective function (2); and inequality (ii)
uses the fact that E[2i | xi ] 2 . Applying the triangle inequality to kkH along with the
elementary inequality (a + b)2 2a2 + 2b2 , we find that
E[kk2H | {xi }ni=1 ] 2 kf k2H + 2E[kfbk2H | {xi }ni=1 ]
2 2
+ 4 kf k2H ,
With Lemma 10 in place, we now proceed to the proof of the theorem proper. Recall
from Lemma 5 the optimality condition
n
1X
xi (hxi , fb f i i ) + fb = 0.
n
i=1
23
(38)
Now, let 2 (N) be the expansion of the error in the basis {j }, so that =
and (again, as in Lemma 5), we choose d N and truncate via
:=
d
X
j=1
j j and := =
j=1 j j ,
j j .
j>d
Let Rd and denote the corresponding vectors for the above. As a consequence of
the orthonormality of the basis functions, we have
E[kk22 ] = E[k k22 ] + E[k k22 ] = E[k k22 ] + E[k k22 ].
(39)
j>d
(40)
j>d
The remainder of the proof is devoted the bounding the term E[k k22 ] in the decomposition (39). By taking the Hilbert inner product of k with the optimality condition (38),
we find as in our derivation of the matrix equation (25) that for each k {1, . . . , d}
d
n
n
1 XX
1X
k
k (xi )j (xi )j +
k (xi )( (xi ) i ) +
= 0.
n
n
k
i=1 j=1
i=1
P
P
Given the expansion f = j=1 j j , define the tail error vector v Rn by vi = j>d j j (xi ),
and recall the definition of the eigenvalue matrix M = diag(1 , . . . , d ) Rdd . Given the
matrix defined by its coordinates ij = j (xi ), we have
1 T
1
1
+ M 1 = M 1 T v + T .
(41)
n
n
n
(42)
here Q := (I + M 1 ).
We now recall
(27a)and (28) from Lemma 7, as well as the previously defined
1the bounds
T
event E := { Q ((1/n) I) 1/2}. When E occurs, the expression (42) implies the
inequality
2
k k22 4
Q1 M 1 (1/n)Q1 T v + (1/n)Q1 T
.
2
When E fails to hold, Lemma 10 may still be applied since E is measureable with respect
to {xi }ni=1 . Doing so yields
E[k k22 ] = E[1(E) k k22 ] + E[1(E c ) k k22 ]
2
h
i
4E
Q1 M 1 (1/n)Q1 T v + (1/n)Q1 T
+ E 1(E c ) E[k k22 | {xi }ni=1 ]
2
"
2 #
2
1 1 1 1 T
1 1 T
2
2
c
+ 4 kf kH .
(43)
4E
Q M + Q v Q
+ P(E )
n
n
2
24
Since the bound (28) still holds, it remains to provide a bound on the first term in the
expression (43).
As in the proof of Lemma 5, we have kQ1 M 1 k22 /2 via the bound (27a).
Turning to the second term inside the norm, we claim that, under the conditions of Lemma 6,
the following bound holds:
h
2 i 4 tr(K)d (2 2 / + 4 kf k2H )
.
E
(1/n)Q1 T v
2
4
(44)
This claim is an analogue of our earlier bound (27b), and we prove it shortly. Lastly, we
bound the norm of Q1 T /n. Noting that the diagional entries of Q1 are 1/(1 + /j ),
we have
n
d X
h
2 i X
1
E[2j (Xi )2i ]
E
Q1 T
2 =
(1 + /j )2
j=1 i=1
Since E[2j (Xi )2i ] = E[2j (Xi )E[2i | Xi ]] 2 by assumption, we have the inequality
d
h
i 2 X
1
1 T
2
E (1/n)Q 2
.
n
(1 + /j )2
j=1
Applying the bound (28) to control P(E c ) and bounding E[k k22 ] using inequality (40)
completes the proof of the lemma.
It remains to prove bound (44). Recalling the inequality (31), we see that
2
2
2
1
1/2 T
1/2 T
(1/n)Q1 T v
2 Q1 M 1/2
v
(1/n)M
(1/n)M
.
2
4
2
2
(45)
d
d
d
X
h
h
ii
i X
h
X
1/2 T
2
2
2
2
E k k22 E kvk22 | X .
E k k2 kvk2 =
E[h , vi ]
E
M v
=
2
l=1
l=1
l=1
Now consider the inner expectation. Applying the Cauchy-Schwarz inequality as in the
proof of the bound (27b), we have
kvk22
n
X
i=1
vi2
n X 2 X
X
j
2
j j (Xi ) .
j
i=1
j>d
25
j>d
Notably, the second term is X-measureable, and the first is bounded by k k2H kk2H .
We thus obtain
X
d
n X
2 X
j 2j (Xi ) E[kk2H | X] .
E k k22
E
M 1/2 T v
2
i=1 l=1
(46)
j>d
2
l=1 j>d
26