You are on page 1of 15

Low Rank Matrix-valued Cherno Bounds and Approximate Matrix

Multiplication
Avner Magen

Anastasios Zouzias

Abstract
In this paper we develop algorithms for approximating
matrix multiplication with respect to the spectral norm. Let
A Rnm and B Rnp be two matrices and > 0. We
e Rtm
approximate the product A B using two sketches A
tp
e
and B R , where t  n, such that

e e

A B A B A2 B2
2

with high probability. We analyze two dierent sampling


e and B;
e one of them is done
procedures for constructing A
by i.i.d. non-uniform sampling rows from A and B and the
other by taking random linear combinations of their rows.
We prove bounds on t that depend only on the intrinsic
dimensionality of A and B, that is their rank and their stable
rank.
For achieving bounds that depend on rank when taking
random linear combinations we employ standard tools from
high-dimensional geometry such as concentration of measure
arguments combined with elaborate -net constructions. For
bounds that depend on the smaller parameter of stable
rank this technology itself seems weak. However, we show
that in combination with a simple truncation argument it
is amenable to provide such bounds. To handle similar
bounds for row sampling, we develop a novel matrix-valued
Cherno bound inequality which we call low rank matrixvalued Cherno bound. Thanks to this inequality, we are
able to give bounds that depend only on the stable rank of
the input matrices.
We highlight the usefulness of our approximate matrix
multiplication bounds by supplying two applications. First
we give an approximation algorithm for the 2 -regression
problem that returns an approximate solution by randomly
projecting the initial problem to dimensions linear on the
rank of the constraint matrix. Second we give improved
approximation algorithms for the low rank matrix approximation problem with respect to the spectral norm.

1 Introduction
In many scientic applications, data is often naturally
expressed as a matrix, and computational problems on
such data are reduced to standard matrix operations
including matrix multiplication, 2 -regression, and low
rank matrix approximation.
University of Toronto, Department of Computer Science,
Email : avner@cs.toronto.edu.
University of Toronto, Department of Computer Science,
Email : zouzias@cs.toronto.edu.

In this paper we analyze several approximation


algorithms with respect to these operations. All of
our algorithms share a common underlying framework
which can be described as follows: Let A be an input
matrix that we may want to apply a matrix computation
on it to infer some useful information about the data
that it represents. The main idea is to work with a
 and hope that the
sample of A (a.k.a. sketch), call it A,

obtained information from A will be in some sense close
to the information that would have been extracted from
A.
In this generality, the above approach (sometimes
called Monte-Carlo method for linear algebraic problems) is ubiquitous, and is responsible for much of
the development in fast matrix computations [FKV04,
DKM06a, Sar06, DMM06, AM07, CW09, DR10].
 our goal
As we sample A to create a sketch A,

is twofold: (i) guarantee that A resembles A in the
 using as few
relevant measure, and (ii) achieve such a A
samples as possible. The standard tool that provides a
handle on these requirements when the objects are real
numbers, is the Cherno bound inequality. However,
since we deal with matrices, we would like to have an
analogous probabilistic tool suitable for matrices. Quite
recently a non-trivial generalization of Cherno bound
type inequalities for matrix-valued random variables
was introduced by Ahlswede and Winter [AW02]. Such
inequalities are suitable for the type of problems that we
will consider here. However, this type of inequalities and
their variants that have been proposed in the literature
[GLF+ 09, Rec09, Gro09, Tro10] all suer from the fact
that their bounds depend on the dimensionality of the
samples. We argue that in a wide range of applications,
this dependency can be quite detrimental.
Specically, whenever the following two conditions
hold we typically provide stronger bounds compared
with the existing tools: (a) the input matrix has low
intrinsic dimensionality such as rank or stable rank,
(b) the matrix samples themselves have low rank. The
validity of condition (a) is very common in applications
from the simple fact that viewing data using matrices
typically leads to redundant representations. Typical
sampling methods tend to rely on extremely simple

1422

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

sampling matrices, i.e., samples that are supported on


only one entry [AHK06, AM07, DZ10] or samples that
are obtained by the outer-product of the sampled rows
or columns [DKM06a, RV07], therefore condition (b)
is often natural to assume. By incorporating the rank
assumption of the matrix samples on the above matrixvalued inequalities we are able to develop a dimensionfree matrix-valued Cherno bound. See Theorem 1.1
for more details.
Fundamental to the applications we derive, are two
probabilistic tools that provide concentration bounds
of certain random matrices. These tools are inherently
dierent, where each pertains to a dierent sampling
procedure. In the rst, we multiply the input matrix
by a random sign matrix, whereas in the second we
sample rows according to a distribution that depends
on the input matrix. In particular, the rst method is
oblivious (the probability space does not depend on the
input matrix) while the second is not.
The rst tool is the so-called subspace JohnsonLindenstrauss lemma. Such a result was obtained
in [Sar06] (see also [Cla08, Theorem 1.3]) although
it appears implicitly in results extending the original
Johnson Lindenstrauss lemma (see [Mag07]). The
techniques for proving such a result with possible worse
bound are not new and can be traced back even to
Milmans proof of Dvoretsky theorem [Mil71].
Rd

Lemma 1.1. (Subspace JL lemma [Sar06]) Let W


be a linear subspace of dimension k and (0, 1/3).

Let R be a t d random
sign matrix rescaled by 1/ t,

namely Rij = 1/ t with equal probability. Then




2
2
2
P (1 ) w2 Rw2 (1 + ) w2 , w W
(1.1)

1 ck2 exp(c1 2 t),

where c1 > 0, c2 > 1 are constants.

Given the above discussion, we resort to a dierent methodology, called matrix-valued Cherno bounds.
These are non-trivial generalizations of the standard
Cherno bounds over the reals and were rst introduced in [AW02]. Part of the contribution of the current work is to show that such inequalities, similarly
to their real-valued ancestors, provide powerful tools
to analyze randomized algorithms. There is a rapidly
growing line of research exploiting the power of such
inequalities including matrix approximation by sparsication [AM07, DZ10]; analysis of algorithms for matrix completion and decomposition of low rank matrices [CR07, Gro09, Rec09]; and semi-denite relaxation and rounding of quadratic maximization problems [Nem07, So09a, So09b].
The quality of these bounds can be measured by the
number of samples needed in order to obtain small error
probability. The original result of [AW02, Theorem 19]
shows that1 if M is distributed according to some
distribution over n n matrices with zero mean2 , and
if M1 , . . . , Mt are independent copies of M then for any
> 0,

 t



1  
2 t


(1.2)
P 
Mi  > n exp C 2 ,
t


i=1
2

where M 2 holds almost surely and C > 0 is an


absolute constant.
Notice that the number of samples in Ineq. (1.2)
depends logarithmically in n. In general, unfortunately,
such a dependency is inevitable: take for example a diagonal random sign matrix of dimension n. The operator norm of the sum of t independent samples is precisely
the maximum deviation among n independent random
walks of length t. In order to achieve a xed bound on
the maximum deviation with constant probability, it is
easy to see that t should grow logarithmically with n in
this scenario.
In their seminal paper, Rudelson and Vershynin
provide a matrix-valued Cherno bound that avoids
the dependency on the dimensions by assuming that
the matrix samples are the outer product x x of a
randomly distributed vector x [RV07]. It turns out
that this assumption is too strong in most applications,
such as the ones we study in this work, and so we wish
to relax it without increasing the bound signicantly.
In the following theorem we replace this assumption
with that of having low rank. We should note that we

The importance of such a tool, is that it allows us to


get bounds on the necessary dimensions of the random
sign matrix in terms of the rank of the input matrices,
see Theorem 3.2 (i.a).
While the assumption that the input matrices have
low rank is a fairly reasonable assumption, one should
be a little cautious as the property of having low rank
is not robust. Indeed, if random noise is added to a
matrix, even if low rank, the matrix obtained will have
full rank almost surely. On the other hand, it can be
shown that the added noise cannot distort the Frobenius
1 For ease of presentation we actually provide the restatement
and operator norm signicantly; which makes the notion
presented in [WX08, Theorem 2.6], which is more suitable for this
of stable rank robust and so the assumption of low stable discussion.
2 Zero mean means that the (matrix-valued) expectation is the
rank on the input is more applicable than the low rank
assumption.
zero n n matrix.

1423

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

are not aware of a simple way to extend Theorem 3.1


of [RV07] to the low rank case, even constant rank.
The main technical obstacle is the use of the powerful
Rudelson selection lemma, see [Rud99] or Lemma 3.5
of [RV07], which applies only for Rademacher sums
of outer product of vectors. We bypass this obstacle
by proving a more general lemma, see Lemma 5.2.
The proof of Lemma 5.2 relies on the non-commutative
Khintchine moment inequality [LP86, Buc01] which is
also the backbone in the proof of Rudelsons selection
lemma. With Lemma 5.2 at our disposal, the proof
techniques of [RV07] can be adapted to support our
more general condition.
Theorem 1.1. Let 0 < < 1 and M be a random
symmetric real matrix with E M 2 1 and M 2
almost surely. Assume that each element on the support
of M has at most rank r. Set t = ( log(/2 )/2 ). If
r t holds almost surely, then

 t

1 

1


P 
.
Mi E M  >

t

poly
(t)
i=1
2

where M1 , M2 , . . . , Mt are i.i.d. copies of M .

Proof. See Appendix, page 12.

1} be the (n 1)-dimensional sphere. A random


Gaussian matrix is a matrix whose entries are i.i.d.
standard Gaussians, and a random sign matrix is a
matrix whose entries are independent Bernoulli random
variables, that is they take values from {1} with
equal probability. For a matrix A Rnm , A(i) , A(j) ,
denote the ith row, jth column, respectively. For a
matrix with rank r, the Singular Value Decomposition
(SVD) of A is the decomposition of A as U V  where
U Rnr , V Rmr where the columns of U and V are
orthonormal, and = diag(1 (A), . . . , r (A)) is r r
diagonal matrix. We further assume 1 . . . r > 0
and call these real numbers the singular values of A. By
Ak = Uk k Vk we denote the best rank k approximation
to A, where Uk and Vk are the matrices formed by the
rst k columns of U and V , respectively. We denote by
A2 = max{Ax
2 | x2 = 1} the spectral norm of

2
A, and by AF =
i,j Aij the Frobenius norm of A.

We denote by A the Moore-Penrose pseudo-inverse of


A, i.e., A = V 1 U  . Notice that 1 (A) = A2 . Also
we dene by sr (A) := A2F / A22 the stable rank of
A. Notice that the inequality sr (A) rank (A) always
holds. The orthogonal projector of a matrix A onto the
row-space of a matrix C is denoted by PC (A) = AC C.
By PC,k (A) we dene the best rank-k approximation of
the matrix PC (A).

Remark 1. (Optimality) The above theorem cannot


be improved in terms of the number of samples required 3 Applications
without changing its form, since in the special case
where the rank of the samples is one it is exactly All the proofs of this section have been deferred to
the statement of Theorem 3.1 of [RV07], see [RV07, Section 4.
Remark 3.4].
3.1 Matrix Multiplication The seminal research
We highlight the usefulness of the above main tools of [FKV04] focuses on using non-uniform row sampling
by rst proving a dimension-free approximation al- to speed-up the running time of several matrix comgorithm for matrix multiplication with respect to the putations. The subsequent developments of [DKM06a,
spectral norm (Section 3.1). Utilizing this matrix mul- DKM06b, DKM06c] also study the performance of
tiplication bound we get an approximation algorithm for Monte-Carlo algorithms on primitive matrix algorithms
the 2 -regression problem which returns an approximate including the matrix multiplication problem with resolution by randomly projecting the initial problem to spect to the Frobenius norm. Sarlos [Sar06] extended
dimensions linear on the rank of the constraint matrix (and improved) this line of research using random pro(Section 3.2). Finally, in Section 3.3 we give improved jections. Most of the bounds for approximating matrix
approximation algorithms for the low rank matrix ap- multiplication in the literature are mostly with respect
proximation problem with respect to the spectral norm, to the Frobenius norm [DKM06a, Sar06, CW09]. In
and moreover answer in the armative a question left some cases, the techniques that are utilized for bounding the Frobenius norm also imply weak bounds for the
open by the authors of [NDT09].
spectral norm, see [DKM06a, Theorem 4] or [Sar06,
Corollary 11] which is similar with part (i.a) of The2 Preliminaries and Definitions
orem 3.2.
The next discussion reviews several denitions and facts
In this section we develop approximation algorithms
from linear algebra; for more details, see [SS90, GV96,
for matrix multiplication with respect to the spectral
Bha96]. We abbreviate the terms independently and
norm. The algorithms that will be presented in this
identically distributed and almost surely with i.i.d. and
section are based on the tools mentioned in Section 1.
n1
n
a.s., respectively. We let S
:= {x R | x2 =

1424

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

Assumption on the sample M


M 2 a.s.
M 2 a.s., E M 2 2 2
M 2 a.s., M = x x, E M 2 1
M 2 , rank (M ) t a.s., E M 2 1

Variants of Matrix-valued Inequalities


# of samples (t)
Failure Prob.
( 2 log(n)/2 )
1/poly (n)
((2 + /3) log(n)/2 )
1/poly (n)
( log(/2 )/2 )
exp((2 t/( log t)))
( log(/2 )/2 )
1/poly (t)

References
[WX08]
[Rec09]
[RV07]
Theorem 1.1

Comments
Hoeding
Bernstein
Rank one
Low rank

Table 1: Summary of matrix-valued Cherno bounds. M is a probability distribution over symmetric n n


matrices. M1 , . . . , Mt are i.i.d. copies of M .
Before stating our main dimension-free matrix multi- Theorem 3.2. Let 0 < < 1/2 and let A Rnm ,
plication theorem (Theorem 3.2), we discuss the best B Rnp both having rank and stable rank at most r
possible bound that can be achieved using the current and r, respectively. The following hold:
known matrix-valued inequalities (to the best of our
(i) Let R be a t n random sign matrix rescaled by
knowledge). Consider a direct application of Ineq. (1.2),

 = RA and B
 = RB.
1/ t. Denote by A
where a similar analysis with that in proof of Theorem 3.2 (ii) would allow us to achieve a bound of
(a) If t = (r/2 ) then
(
r2 log(m + p)/2 ) on the number of samples (details omitted). However, as the next theorem indicates
 B
 A B)y|
P(x Rm , y Rp , |x (A
(proof omitted) we can get linear dependency on the
stable rank of the input matrices gaining from the vari Ax2 By2 ) 1 e(r) .
ance information of the samples; more precisely, this
(b) If t = (
r /4 ) then
can be achieved by applying the matrix-valued Bern+
stein Inequality see e.g. [GLF 09], [Rec09, Theorem 3.2]



r
e
  

or [Tro10, Theorem 2.10].
P A
B A B  A2 B2 1e( 2 ) .
2

Theorem 3.1. Let 0 < < 1/2 and let A Rnm ,


B Rnp both having stable rank at most r. The
following hold:

 


(ii) Let pi = A(i) 2 B(i) 2 /S, where S =




n 
 

i=1 A(i) 2 B(i) 2 be a probability distribution
 and a
over [n]. If we form a t m matrix A
 by taking t = (
t p matrix B
r log(
r /2 )/2 )
i.i.d. (row indices) samples from pi , then

(i) Let R be a t n random sign matrix rescaled by

 = RA and B
 = RB. If
1/ t. Denote by A
t = (
r log(m + p)/2 ) then



  

P A
B A B  A2 B2 1
2

1
.
poly (
r)

 


(ii) Let pi = A(i) 2 B(i) 2 /S, where S =
 

n 

 

i=1 A(i) 2 B(i) 2 be a probability distribution
 and a t p
over [n]. If we form a t m matrix A
 by taking t = (
matrix B
r log(m + p)/2 ) i.i.d.
(row indices) samples from pi , then



  

P A
B A B  A2 B2 1
2




  

P A
B A B  A2 B2 1

1
.
poly (
r)

Notice that the above bounds depend linearly on the


stable rank of the matrices and logarithmically on their
dimensions. As we will see in the next theorem we
can remove the dependency on the dimensions, and
replace it with the stable rank. Recall that in most
cases matrices do have low stable rank, which is much
smaller that their dimensionality.

1
.
poly (
r)

Remark 2. In part (ii), we can


actually achieve the
stronger bound of t = ( sr (A) sr (B) log(sr (A)
sr (B) /4 )/2 ) (see proof ). However, for ease of presentation and comparison we give the above displayed
bound.
Part (i.b) follows from (i.a) via a simple truncation argument. This was pointed out to us by Mark Rudelson (personal communication). To understand the signicance and the dierences between the dierent components of this theorem, we rst note that the probabilistic event of part (i.a) is superior to the probabilistic
event of (i.b) and (ii). Indeed, when B = A the former
 A
 A A)x| < x A Ax for evimplies that |x (A



  
2
A A A A2 .
ery x, which is stronger than A
2
We will heavily exploit this fact in Section 4.3 to prove
Theorem 3.4 (i.a) and (ii). Also notice that part (i.b) is

1425

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

essential computationally inferior to (ii) as it gives the


If t = (r/2 ), then with high probability,
same bound while it is more expensive computationally

b Axopt 2 .
(3.4) xopt x
opt 2
to multiply the matrices by random sign matrices than
min (A)
just sampling their rows. However, the advantage of
part (i) is that the sampling process is oblivious, i.e., Remark 3. The above result can be easily generalized
does not depend on the input matrices.
to the case where b is an n p matrix B of rank
at most r (see proof ). This is known as the gen3.2 2 -regression In this section we present an ap- eralized 2 -regression problem in the literature, i.e.,
proximation algorithm for the least-squares regression arg minXmp AX B where B is an n p rank
2
problem; given an nm, n > m, real matrix A of rank r r matrix.
n

and a real vector b R we want to compute xopt = A b


that minimizes Ax b2 over all x Rm . In their 3.3 Spectral Low Rank Matrix Approximation
seminal paper [DMM06], Drineas et al. show that if we A large body of work on low rank matrix approximanon-uniformly sample t = (m2 /2 ) rows from A and b, tions [DK03, FKV04, DRVW06, Sar06, RV07, AM07,
then with high probability the optimum solution of the RST09, CW09, NDT09, HMT09] has been recently det d sampled problem will be within (1 + ) close to the veloped with main objective to develop more ecient
original problem. The main drawback of their approach algorithms for this task. Most of these results study apis that nding or even approximating the sampling prob- proximation algorithms with respect to the Frobenius
abilities is computationally intractable. Sarlos [Sar06] norm, except for [RV07, NDT09] that handle the specimproved the above to t = (m log m/2 ) and gave the tral norm.
rst o(nm2 ) relative error approximation algorithm for
In this section we present two (1 + )-relativethis problem.
error approximation algorithms for this problem with
In the next theorem we eliminate the extra log m respect to the spectral norm, i.e., given an n m,
factor from Sarlos bounds, and more importantly, re- n > m, real matrix A of rank r, we wish to compute
place the dimension (number of variables) m with Ak = Uk k V  , which minimizes A Xk  over the
k
2
the rank r of the constraints matrix A. We should set of n m matrices of rank k, Xk . The rst additive
point out that independently, the same bound as our bound for this problem was obtained in [RV07]. To
Theorem 3.3 was recently obtained by Clarkson and the best of our knowledge the best relative bound
Woodru [CW09] (see also [DMMS09]). The proof of was recently achieved in [NDT09, Theorem 1]. The
Clarkson and Woodru uses heavy machinery and a latter result is not directly comparable with ours, since
completely dierent approach. In a nutshell they man- it uses a more restricted projection methodology and
age to improve the matrix multiplication bound with so their bound is weaker compared to our results.
respect to the Frobenius norm. They achieve this by The rst algorithm randomly projects the rows of the
bounding higher moments of the Frobenius norm of the input matrix onto t dimension. Here, we set t to be
approximation viewed as a random variable instead of either (r/2 ) in which case we get an (1 + ) error
bounding the local dierences for each coordinate of the guarantee, or to be (k/2 ) in which case we show

product. To do so, they rely on intricate moment calcu- a (2 + (r k)/k) error approximation. In both
lations spanning over four pages, see [CW09] for more. cases the algorithm succeeds with high probability. The
On the other hand, the proof of the present 2 -regression second approximation algorithm samples non-uniformly
bound uses only basic matrix analysis, elementary devi- (r log(r/2 )/2 ) rows from A in order to satisfy the
ation bounds and -net arguments. More precisely, we (1 + ) guarantee with high probability.
argue that Theorem 3.2 (i.a) immediately implies that
The following lemma (Lemma 3.1) is essential for
by randomly-projecting to dimensions linear in the in- proving both relative error bounds of Theorem 3.4. It
trinsic dimensionality of the constraints, i.e., the rank gives a sucient condition that any matrix A
 should
of A, is sucient as the following theorem indicates.
satisfy in order to get a (1 + ) spectral low rank matrix
Theorem 3.3. Let A Rnm be a real matrix of rank approximation of A for every k, 1 k rank (A).
r and b Rn . Let minxRm b Ax2 be the 2 - Lemma 3.1. Let A be an n m matrix and > 0. If
regression problem, where the minimum is achieved with there exists a tm matrix A
 such that for every x Rm ,
 
xopt = A b. Let 0 < < 1/3, R be a t n random sign (1 )x A Ax x A

Ax (1 + )x A Ax, then

matrix rescaled by 1/ t and x


opt = (RA) Rb.





P
(A)
A
 (1 + ) A Ak 2 ,
e
A,k
If t = (r/), then with high probability,
2

(3.3)

b A
xopt 2 (1 + ) b Axopt 2 .

for every k = 1, . . . , rank (A).

1426

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

a power-law distribution and those with exponentially


decaying spectrum. Therefore the following theorem
combined with the remark below (partially) answers in
the armative the question posed by [NDT09]: Is there
a relative error approximation algorithm with respect
to the spectral norm when the spectrum of the input
be matrix decays in a power law?

The theorem below shows that its possible to satisfy the conditions of Lemma 3.1 by randomly projecting A onto (r/2 ) or by non-uniform sampling i.i.d.
(r log(r/2 )/2 ) rows of A as described in parts (i.a)
and (ii), respectively.
Theorem 3.4. Let 0 < < 1/3 and let A = U V 
a real n m matrix of rank r with n m.

(i) (a) Let R be a t n random sign matrix rescaled

 = RA. If t = (r/2 ), then


by 1/ t and set A
with high probability




A PA,k
e (A) (1 + ) A Ak 2 ,
2

for every k = 1, . . . , r.

(b) Let R be a t n random Gaussian matrix

 = RA. If t =
rescaled by 1/ t and set A
(k/2 ), then with high probability



rk


) A Ak 2 .
A PA,k
e (A) (2 +
k
2

Theorem 3.5. Let 0 < < 1/3 and let A be a real


n m matrix with a k-low stable rank tail. Let R be
a t n random sign matrix rescaled by 1/ t and set
 = RA. If t = (k/4 ), then with high probability
A




A PA,k
e (A) (2 + ) A Ak 2 .
2

Remark 4. The (2 + ) bound can be improved to


a relative (1 + ) error bound if we return as the
approximate solution a slightly higher rank matrix, i.e.,
by returning the matrix PAe(A), which has rank at most
t = (k/4 ) (see [HMT09, Theorem 9.1]).

4 Proofs
4.1 Proof of Theorem 3.2 (Matrix Multiplica2

(ii) Let pi = U(i) 2 /r be a probability distribution tion)
Random Projections - Part (i )
 be a t m matrix that is formed
over [n]. Let A
Part (a): In this section we show the rst, to the
(row-by-row) by taking t i.i.d. samples from pi and
best
of our knowledge, non-trivial spectral bound for
2
2
rescaled appropriately. If t = (r log(r/ )/ ),
matrix
multiplication. Although the proof is an immethen with high probability
diate
corollary
of the subspace Johnson-Lindenstrauss




lemma
(Lemma
1.1), this result is powerful enough to
A PA,k
e (A) (1 + ) A Ak 2 ,
2
give, for example, tight bounds for the 2 regression
problem. We prove the following more general theorem
for every k = 1, . . . , r.
from which Theorem 3.2 (i.a) follows by plugging in
2
We should highlight that in part (ii) the probability t = (r/ ).
distribution pi is in general hard to compute. Indeed,
2

Theorem 4.1. Let A Rnm and B Rnp . Assume
computing U(i) 2 requires computing the SVD of
that the ranks of A and B are at most
r. Let R be a
A. In general, these values are known as statistical
t n random sign matrix rescaled by 1/ t. Denote by
leverage scores [DM10]. In the special case where A 
 = RB. The following inequality holds
A = RA and B
is an edge-vertex matrix of an undirected weighted


graph then pi , the probability distribution over edges
m
p
  

P
x

R
,
y

R
,
|x
(
A
B)y|

Ax
By
B

A
2
2
(rows), corresponds to the eective-resistance of the ith edge [SS08].
1 cr2 exp(c1 2 t),
Theorem 3.4 gives an (1 + ) approximation algorithm for the special case of low rank matrices. How- where c > 0, c > 1 are constants.
1
2
ever, as discussed in Section 1 such an assumption is
too restrictive for most applications. In the following Proof. (of Theorem 4.1) Let A = UA A VA , B =
theorem, we make a step further and relax the rank UB B VB be the singular value decomposition of A and
condition with a condition that depends on the stable B respectively. Notice that UA RnrA , UB RnrB ,
rank of the residual matrix A Ak . More formally, for where rA and rB is the rank of A and B, respectively.
an integer k 1, we say that a matrix A has a k-low
Let x1 Rm , x2 Rp two arbitrary unit vectors.
stable rank tail i k sr (A Ak ).
Let w1 = Ax1 and w2 = Bx2 . Recall that
Notice that the above denition is useful since it
  

A R RB A B  =
contains the set of matrices whose spectrum follows
2
1427

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

 

|x
1 (A R RB A B)x2 |.

The
Gaussian distribution is symmetric, so Gij and
tRij |Gij |, where Gij is a Gaussian random variWe will bound the last term for any arbitrary vector. able have the same distribution. By
Jensens inequality and the fact that E |Gij | =
2/, we get that
Denote with V the subspace3 colspan(UA )colspan(UB )

of Rn . Notice that the size of dim(V) rA + rB 2r.


2/ E RA2 E GA2 / t. Dene
 the function
Applying Lemma 1.1 to V, we get that with probability f : {1}tn R by f (S) =  1 SA . The calcu t

at least 1 cr2 exp(c1 2 t) that
2
lation above shows
that median(f ) 2. Since f is
2
2
2
(4.5)
v V : | Rv2 v2 | v2 .
convex and (1/ t)-Lipschitz as a function of the entries
of S, Talagrands measure concentration inequality for
Therefore we get that for any unit vectors v1 , v2 V:
convex functions yields
sup

x1 Sm1 ,x2 Sp1

(Rv1 ) Rv2

=
+
=

Rv1 + Rv2 2 Rv1 Rv2 2


4
2
2
(1 + ) v1 + v2 2 (1 ) v1 v2 2
4
2
2
v1 + v2 2 v1 v2 2
4
v1 + v2 22 + v1 v2 22

4
2
2
v

1
2 + v2 2
= v1 v2 + ,
v1 v2 +
2

where the rst equality follows from the Parallelogram


law, the rst inequality follows from Equation (4.5),
and the last inequality since v1 , v2 are unit vectors.
By similar considerations we get that (Rv1 ) Rv2
v1 v2 . By linearity of R, we get that
v1 , v2 V : |(Rv1 ) Rv2 v1 v2 | v1 2 v2 2 .
Notice that w1 , w2 V, hence |w1 R Rw2 w1 w2 |
w1 2 w2 2 = Ax1 2 Bx2 2 .
Part (b): We start with a technical lemma that
bounds the spectral norm of any matrix A whenits
multiplied by a random sign matrix rescaled by 1/ t.
Lemma 4.1. Let A be an n m real matrix, and
let
R be a t n random sign matrix rescaled by 1/ t. If
t sr (A), then

P (RA2 median(f ) + ) 2 exp( 2 t/2).


Setting = 1 in the above inequality implies the lemma.
Now using the above Lemma together with Theorem 3.2
(i.a) and a simple truncation argument we can prove
part (i.b).
Proof. (of Theorem 3.2 (i.b)) Without loss of generality assume that A2 = B2 = 1. Set r =
(A),sr(B)}
 = A Ar , B
 = B Br .
1600 max{sr
. Set A
2
rank(A)
2
j (A)2 ,
Since AF = j=1
 
 
A

 
AF
B

 

, and B
.
 F
40
40
r
2
r

By triangle inequality, it follows that




  

A B A B 
2





(4.7)
A
r R RBr Ar Br 2



  
+ A
R RBr 

2 

   
    
(4.8)
+ Ar R RB  + A
R RB 
2


2



  

   

+ A
(4.9)
Br  + A
+
A
B
B


 .
r
2

Choose a constant in Theorem 3.2 (i.a) so that the


failure probability of the right hand side of (4.7) does
not exceed exp(c2 t), where c = c1 /32. The same
(4.6)
P (RA2 4 A2 ) 2et/2 .
argument shows that P (RAr 2 1 + ) exp(c2 t)
Proof. Without
loss of generality assume that A2 = 1. and P (RBr 2 1 + ) exp(c2 t). This combined
 and B
 yields that the
Then AF =
sr (A). Let G be a t n Gaussian with Lemma 4.1 applied on A
4
matrix. Then by the Gordon-Chev`et inequality
sum in (4.8) is less than 2(1 + )/10 + 2 /100. Also,
since Ar 2 , Br 2 1, the sum in (4.9) is less that
E GA2 It 2 AF + It F A2
2/10 + 2 /100. Combining the bounds for (4.7), (4.8)

and (4.9) concludes the claim.


= AF + t 2 t.
3 We

denote by colspan(A) the subspace generated by the


columns of A, and rowspan(A) the subspace generated by the
rows of A.
4 For example, set S = I , T = A in [HMT09, Proposit
tion 10.1, p. 54].

Row Sampling - Part (ii ): By homogeneity


normalize A and B such
n that A2 = B2 = 1.
Notice that A B =
Dene pi =
i=1 A(i) B(i) .


n 
A
  
(i) 2 B(i) 2
, where S = i=1 A(i)  B(i) 2 . Also
S

1428

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

dene a distribution over matrices in R(m+p)(m+p)


with n elements by




A(i)
0
B(i)
1
= pi .
P M=
pi A
(i) B(i) 0
First notice that
EM

i=1

0
A B

BA
0

0


(4.10)

i=1

sr (A) sr (B)

(sr (A) + sr (B))/2,

by denition of pi , properties of norms, CauchySchwartz inequality, and arithmetic/geometric mean inequality. Notice that this quantity (since the spectral
norms of both A, B are one) is at most r by assumption. Also notice that every element on the support of
the random variable M , has rank at most two. It is
easy to see that, by setting = r, all the conditions in
Theorem 1.1 are satised, and hence we get i1 , i2 , . . . , it
indices from [n], t = (
r log(
r /2 )/2 ), such that with
high probability


1

t
1 0
pij B(ij ) A(ij )

t j=1 p1i A
(ij ) B(ij ) 0
j


0
BA
2 .

A B 0

2
w2

2
y2

since wcolspan(U )
since U  U = I.

It suces to bound the norm of y, i.e., y2 3 w2 .


Recall that given A, b the vector w is uniquely dened.
On the other hand, vector y depends on the random
projection R. Next we show the connection between y
and w through the normal equations.
RA
xopt

RA
xopt =
RA(
xopt xopt ) =
U  R RU y
(4.11) U  R RU y

n


 

A(i)  B(i)  A B
=
F
F
2
2

= w2 + U y2 ,



 A B(i) 
 (i)

sup 



p
i
i[n]
2


 A

  (i)  B(i) 
= S sup  
 
  =S1
i[n]  A(i) 2 B(i) 2 

= b A(
xopt xopt ) Axopt 2
= w U y22



This implies that E M 2 = A B 2 1. Next notice
that the spectral
norm of the random matrix M is upper
bounded by sr (A) sr (B) almost surely. Indeed,
M 2

Proof. (of Theorem 3.3) Similarly as the proof


in [Sar06]. Let A = U V  be the SVD of A. Let
b = Axopt + w, where w Rn and wcolspan(A). Also
let A(
xopt xopt ) = U y, where y Rrank(A) . Our goal
is to bound this quantity

(i)

A
(i) B(i)

Proof of Theorem 3.3 (2 -regression)

b A
xopt 2



n


A(i)
0
B(i)
1
pi
A
p
(i) B(i) 0
i=1 i


n

0
B  A(i)


4.2

=
=

Rb + w2 =
R(Axopt + w) + w2 =
Rw + w2 =
U  R Rw + U  R w2 =
U  R Rw,

where w2 colspan(R), and used this fact to derive Ineq. (4.11). A crucial observation is that the
colspan(U ) is perpendicular
to w. Set A = B = U
in Theorem 3.2, and set  = , and t = (r/2 ). Notice that rank (A) + rank (B) 2r, hence with constant
probability we
that1  i (RU ) 1 +  . It
 know


follows that U R RU y 2 (1  )2 y2 . A similar
argument (set
guar A = U and
 B = w in Theorem 3.2) 
antees that U  R Rw2 = U  R Rw U  w2
 U 2 w2 =  w2 . Recall that U 2 = 1, since
U  U = In with high probability. Therefore, taking Euclidean norms on both sides of Equation (4.11) we get
that

y2
w2 4 w2 .
(1  )2

Summing up, it follows from Equation (4.10) that,


with constant probability, b A
xopt 22 (1 +
2
2
2
16 ) b Axopt 2 = (1 + 16) b Axopt 2 . This
proves Ineq. (3.3).
Ineq. (3.4) follows directly from the bound on the
norm of y repeating the above proof for  .
First recall that xopt is in the row span of A, since
1 



x
U b and the columns of V span the
A =
The rst sum can be rewritten as A B where
opt = V

1
1
1



1
row
space
of
A.
Similarly for x
opt since the row span

and
pi1 A(i1 )
pi2 A(i2 ) . . .
pit A(it )
t
of
R

A
is
contained
in
the
row-span
of A. Indeed,


1
1
1



1


B
B
.
.
.
B

w

y
=
U
y
=
A(x
opt )2
B =
.
opt x
2
2
2
pi1 (i1 )
pi2 (i2 )
pit (it )
t

x

x


.
opt
opt 2
min(A)
This concludes the theorem.
1429

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

 A)
 (1 + )i (A A)
(4.12) (1 )i (A A) i (A


Lemma 4.2. Let A = Ak + Urk rk Vrk
, Hk =
Urk rk and R be any t n matrix. If the matrix
(RUk ) has full column rank, then the following inequality holds,
(4.14)






A P(RA),k (A) 2 A Ak  + 
(RUk ) RHk  .

 = 1 RA where R is a (r/2 ) n random


Proof. Set A
t
sign matrix. Apply Theorem 3.2 (i.a) on A we have
with high probability that
(4.13)
 Ax
 (1+)x A Ax.
x Rn , (1)x A Ax x A

Theorem 4.2. [RV09, Proposition 2.3] Let W be a


t(r k) random matrix whose entries are independent
mean zero Gaussian random variables. Assume that
r k t, then



2
(4.15)
P W 2 r k ec0 rk .

Part (b): The proof is based on the following


lemma which reduces the problem of low rank matrix
approximation to the problem of bounding the norm
of a random matrix. We restate it here for readers
convenience and completeness [NDT09, Lemma 8], (see
also [HMT09, Theorem 9.1] or [BMD09]).

Apply union bound on the above theorem with be


a sucient large constant and on the conditions of
Lemma 4.2, we get that with high probability, W 2

C3 r k and min ((RUk ) ) 1/(1 ). Hence,


Lemma 4.2 combined with the above discussion implies
that

4.3 Proof of Theorems 3.4, 3.5 (Spectral Low


Rank Matrix Approximation)
Proof. (of Lemma 3.1) By the assumption and using
Lemma 5.1 we get that

 k be the projection Notice that the above lemma, reduces the problem of
for all i = 1, . . . , rank (A). Let
 i.e., spectral low rank matrix approximation to a problem of
matrix onto the rst k right singular vectors of A,

approximation the spectral norm of the random matrix


k . It follows that for every k = 1, . . . , rank (A)
k ) A
(A

(RUk ) RHk .
2



First notice that by setting t = (k/2 ) we can





(A)


A PA,k

A
k
e
guarantee that the matrix (RUk ) will have full column
2
2

2
rank
with high probability. Actually, we can say

 k )x
=
sup
A(I

something
much stronger; applying Theorem 3.2 (i.a)
2
xRm , x=1
with A = Uk we can guarantee that all the singular
=
sup
Ax22
values are within 1 with high probability. Now by
e
xker k , x=1
conditioning on the above event ( (RUk ) has full column
rank), it follows from Lemma 4.2 that
=
sup
x A Ax
e k , x=1
xker






A P(RA),k (A) 2 A Ak  + 
  
)
RH

(RU
k
k
2
(1 + )
sup
x A Ax
2
2


e k , x =1
xker
2


2 A Ak 2 + (RUk )  RHk 2
 A)

2
= (1 + )k+1 (A
1
2

(1 + ) k+1 (A A)
RHk 2
2 A Ak 2 +
1
2
2
= (1 + ) A Ak 2 ,
3
2 A Ak 2 + RUrk 2 rk 2
2
 k x = x, left side of the
 k implies
using that x ker
 A
 (see Eqn. (5.17)), using the sub-multiplicative property of matrix norms,
hypothesis, Courant-Fischer on A
Eqn. (4.12), and properties of singular values, respec- and that < 1/3. Now, it suces to bound the norm
of W := RUrk . Recall that R = 1t G where G
tively.
is a t n random Gaussian matrix, It is well-known
Proof of Theorem 3.4 (i ):
that the distribution of the random matrix GUrk (by
Part (a): Now we are ready to prove our rst rotational invariance of the Gaussian distribution) has
corollary of our matrix multiplication result to the entries which are also i.i.d. Gaussian random variables.
problem of computing an approximate low rank matrix Now, we can use the following fact about random subapproximation of a matrix with respect to the spectral Gaussian matrices to give a bound on the spectral norm
norm (Theorem 3.4).
of W . Indeed, we have the following

Combining Lemma 3.1 with Ineq. (4.13) concludes the


proof.

for any > 0 , where 0 is a positive constant.

1430

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

the condition of Lemma 3.1. Indeed,


 

 x (S  S )x 




A P(RA),k (A) 2 A Ak 
sup


2
2
x x
xRn , x=0
 

+ 3/2 RUrk 2 A Ak 2
x (S  S )x

sup
= 2 A Ak 2
x x
xker , x=0
3
 

y (S  S )y 
GUrk 2 A Ak 2
+
2 t

sup



yy
yIm(A), y=0
rk
  

A Ak 2 ,

2 + c4
x A (S  S )Ax
k

sup
x A Ax
xRm , Ax=0
   

x (A S SA A A)x
where c4 > 0 is an absolute constant. Rescaling by c4

sup
x A Ax
xRm , Ax=0
concludes Theorem 3.4 (i.b).


   

Proof of Theorem 3.4 (ii ) Here we prove that
x (A A A A)x
we can achieve the same relative error bound as with
,
sup
x A Ax
xRm , Ax=0
random projections by just sampling rows of A through
a judiciously selected distribution. However, there is a
price to pay and thats an extra logarithmic factor on since x  ker implies x Im (A), Im (A) Im (),
the number of samples, as is stated in Theorem 3.4, part and A = A. By re-arranging terms we get Equation (4.13) and so the claim follows.
(ii).
Proof. (of Theorem 3.4 (ii)) The proof follows closely
the proof of [SS08]. Similar with the proof of part (a).
Let A = U V  be the singular value decomposition
of A. Dene the projector matrix = U U  of size
n n. Clearly, the rank of is equal to the rank of A
and has the same image with A since every element
in the image of A and is a linear combination of
columns of U . Recall that for any projection matrix, the
following holds 2 = and hence sr () = rank (A) =
2


n 



r. Moreover,
= tr () =
i=1 U(i) 2 = tr U U
2

 2

tr = r. Let pi = (i, i)/r = U(i) 2 /r be a
probability distribution on [n], where Ui is the i-th row
of U .
Dene a t n random matrix S as follows: Pick t
samples from pi ; if the i-th sample is equal to j( [n])

then set Sij = 1/ pj . Notice that S has exactly one


non-zero entry in each row, hence it has t non-zero
 = SA.
entries. Dene A
It is easy to verify that ES S  S = 2 = .
Apply Theorem 1.1 (alternatively we can use [RV07,
Theorem 3.1], since the matrix samples are rank one)
2
on the matrix

 , notice that F = r and 2 = 1,
ES S  S 1, hence the stable rank of is
2
r. Therefore, if t = (r log(r/2 )/2 ) then with high
probability
(4.16)

Proof of Theorem 3.5: Similarly with the proof


of Theorem 3.4 (i.b). By following the proof of part
(i.b), conditioning on the event that (RUk ) has full
column rank in Lemma 4.2, we get with high probability
that

  


U R RHk 


k
2
2 A Ak 2 +
A PA,k
e (A)
(1 )2
2
using the fact that if (RUk ) has full column

rank then (RUk) = ((RUk ) RUk )1 Uk R and



((RUk ) RUk )1  1/(1 )2 .
Now observe
2
that Uk Hk = 0.
Since sr (Hk ) k, us4
ing Theorem
   3.2 (i.b) with
 get
 t = (k/ ), we



that Uk R RHk 2 =
Uk R RHk Uk Hk 2
Uk 2 Hk 2 = A Ak 2 with high probability.
Rescaling concludes the proof.
5

Acknowledgments

Many thanks go to Petros Drineas for many helpful


discussions and pointing out the connection of Theorem 3.2 with the 2 -regression problem. The second
author would like to thank Mark Rudelson for his valueable comments on an earlier draft and also for sharing
with us the proof of Theorem 3.2 (i.b).

 

S S  .
2

It suces to show that Ineq. (4.16) is equivalent with

1431

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

References

[AHK06] S. Arora, E. Hazan, and S. Kale. A Fast Random


Sampling Algorithm for Sparsifying Matrices. In Proceedings of the International Workshop on Randomization and Approximation Techniques (RANDOM), pages
272279, 2006. (Cited on page 2)
[AM07] D. Achlioptas and F. Mcsherry. Fast Computation
of Low-rank Matrix Approximations. Journal of the
ACM (JACM), 54(2):9, 2007. (Cited on pages 1, 2
and 5)
[AW02] R. Ahlswede and A. Winter. Strong Converse for
Identication via Quantum Channels. IEEE Transactions on Information Theory, 48(3):569579, 2002.
(Cited on pages 1 and 2)
[Bha96] R. Bhatia. Matrix Analysis, volume 169. Graduate
Texts in Mathematics, Springer, First edition, 1996.
(Cited on pages 3 and 13)
[BMD09] C. Boutsidis, M. W. Mahoney, and P. Drineas.
An Improved Approximation Algorithm for the Column
Subset Selection Problem. In Proceedings of the ACMSIAM Symposium on Discrete Algorithms (SODA),
pages 968977, 2009. (Cited on page 9)
[Buc01] A. Buchholz. Operator Khintchine Inequality in
Non-commutative Probability. Mathematische Annalen,
319(1-16):116, January 2001. (Cited on pages 3 and 13)
[Cla08] K. L. Clarkson. Tighter Bounds for Random Projections of Manifolds. In Proceedings of the ACM Symposium on Computational Geometry (SoCG), pages 3948,
2008. (Cited on page 2)
[CR07] E. Cand`es and J. Romberg. Sparsity and Incoherence in Compressive Sampling. Inverse Problems,
23(3):969, 2007. (Cited on page 2)
[CW09] K. L. Clarkson and D. P. Woodru. Numerical
Linear Algebra in the Streaming Model. In Proceedings
of the Symposium on Theory of Computing (STOC),
pages 205214, 2009. (Cited on pages 1, 3 and 5)
[DK03] P. Drineas and R. Kannan. Pass Ecient Algorithms for Approximating Large Matrices. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 223232, 2003. (Cited on page 5)
[DKM06a] P. Drineas, R. Kannan, and M. W. Mahoney.
Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication. Journal of the ACM
(JACM), 36(1):132157, 2006. (Cited on pages 1, 2
and 3)
[DKM06b] P. Drineas, R. Kannan, and M. W. Mahoney.
Fast Monte Carlo Algorithms for Matrices II: Computing
a Low-Rank Approximation to a Matrix. Journal of the
ACM (JACM), 36(1):158183, 2006. (Cited on page 3)
[DKM06c] P. Drineas, R. Kannan, and M. W. Mahoney.
Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition. Journal of the ACM (JACM), 36(1):184206, 2006.
(Cited on page 3)
[DM10] P. Drineas and M. W. Mahoney. Eective Resistances, Statistical Leverage, and Applications to Linear

Equation Solving. Available at arxiv:1005.3097, May


2010. (Cited on page 6)
[DMM06] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling Algorithms for 2 -regression and Applications. In Proceedings of the ACM-SIAM Symposium
on Discrete Algorithms (SODA), pages 11271136, 2006.
(Cited on pages 1 and 5)
[DMMS09] P. Drineas, M. W. Mahoney, S. Muthukrishnan,
and T. Sarlos. Faster Least Squares Approximation.
Available at arvix:0710.1435, May 2009. (Cited on
page 5)
[DR10] A. Deshpande and L. Rademacher. Ecient Volume
Sampling for Row/column Subset Selection. In Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2010. (Cited on page 1)
[DRVW06] A. Deshpande, L. Rademacher, S. Vempala,
and G. Wang. Matrix Approximation and Projective Clustering via Volume Sampling. In Proceedings
of the ACM-SIAM Symposium on Discrete Algorithms
(SODA), pages 11171126, 2006. (Cited on page 5)
[DZ10] P. Drineas and A. Zouzias. A Note on Elementwise Matrix Sparsication via Matrix-valued Cherno
Bounds.
Available at arxiv:1006.0407, June 2010.
(Cited on page 2)
[FKV04] A. Frieze, R. Kannan, and S. Vempala. Fast
Monte-carlo Algorithms for Finding Low-rank Approximations. Journal of the ACM (JACM), 51(6):1025
1041, 2004. (Cited on pages 1, 3 and 5)
[GLF+ 09] D. Gross, Y.-K. Liu, S. T. Flammia, S. Becker,
and J. Eisert. Quantum State Tomography via Compressed Sensing. Available at 0909.3304, September
2009. (Cited on pages 1 and 4)
[Gro09] D. Gross. Recovering Low-rank Matrices from Few
Coecients in any Basis. Available at arxiv:0910.1879,
December 2009. (Cited on pages 1 and 2)
[GV96] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Third
edition, October 1996. (Cited on pages 3 and 12)
[HMT09] N. Halko, P. G. Martinsson, and J. A. Tropp.
Finding Structure with Randomness: Stochastic Algorithms for Constructing Approximate Matrix Decompositions. Available at arxiv:0909.4061, Sep. 2009. (Cited
on pages 5, 6, 7 and 9)
[LP86] F. Lust-Piquard. Inegalites de Khintchine dans
er. I Math.,
Cp (1 < p < ). C. R. Acad. Sci. Paris S
303(7):289292, 1986. (Cited on pages 3 and 13)
[LPP91] F. Lust-Piquard and G. Pisier. Non Commutative
Khintchine and Paley Inequalities. Arkiv f
or Matematik,
29(1-2):241260, December 1991. (Cited on page 13)
[LT91] M. Ledoux and M. Talagrand. Probability in Banach Spaces, volume 23 of Ergebnisse der Mathematik
und ihrer Grenzgebiete (3). Springer-Verlag, 1991.
Isoperimetry and Processes. (Cited on page 13)
[Mag07] A. Magen. Dimensionality Reductions in 2 that
Preserve Volumes and Distance to Ane Spaces. Discrete & Computational Geometry, 38(1):139153, 2007.
(Cited on page 2)
[Mil71] V.D. Milman. A new Proof of A. Dvoretzkys The-

1432

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

orem on Cross-sections of Convex Bodies. Funkcional.


Anal. i Prilozhen., 5(4):2837, 1971. (Cited on page 2)
[NDT09] N. H. Nguyen, T. T. Do, and T. D. Tran. A Fast
and Ecient Algorithm for Low-rank Approximation of
a Matrix. In Proceedings of the Symposium on Theory
of Computing (STOC), pages 215224, 2009. (Cited on
pages 3, 5, 6, 9 and 13)
[Nem07] A. Nemirovski. Sums of Random Symmetric Matrices and Quadratic Optimization under Orthogonality
Constraints. Mathematical Programming, 109(2):283
317, 2007. (Cited on page 2)
[Rec09] B. Recht. A Simpler Approach to Matrix Completion. Available at arxiv:0910.0651, October 2009. (Cited
on pages 1, 2 and 4)
[RST09] V. Rokhlin, A. Szlam, and M. Tygert. A Randomized Algorithm for Principal Component Analysis.
SIAM Journal on Matrix Analysis and Applications,
31(3):11001124, 2009. (Cited on page 5)
[Rud99] M. Rudelson. Random Vectors in the Isotropic
Position. J. Funct. Anal., 164(1):6072, 1999. (Cited
on pages 3 and 13)
[RV07] M. Rudelson and R. Vershynin. Sampling from
Large Matrices: An Approach through Geometric Functional Analysis. Journal of the ACM (JACM), 54(4):21,
2007. (Cited on pages 2, 3, 4, 5, 10 and 13)
[RV09] M. Rudelson and R. Vershynin. The Smallest Singular Value of a Random Rectangular Matrix. Communications on Pure and Applied Mathematics, 62(1-2):1707
1739, 2009. (Cited on page 9)
[Sar06] T. Sarlos. Improved Approximation Algorithms for
Large Matrices via Random Projections. In Proceedings
of the Symposium on Foundations of Computer Science
(FOCS), pages 143152, 2006. (Cited on pages 1, 2, 3,
5 and 8)
[So09a] A. Man-Cho So. Improved Approximation Bound
for Quadratic Optimization Problems with Orthogonality Constraints. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1201
1209, 2009. (Cited on page 2)
[So09b] A. Man-Cho So. Moment Inequalities for sums
of Random Matrices and their Applications in Optimization. Mathematical Programming, December 2009.
(Cited on page 2)
[SS90] G. W. Stewart and J. G. Sun. Matrix Perturbation
Theory (Computer Science and Scientific Computing).
Academic Press, June 1990. (Cited on page 3)
[SS08] D. A. Spielman and N. Srivastava. Graph Sparsication by Eective Resistances. In Proceedings of the Symposium on Theory of Computing (STOC), pages 563
568, 2008. (Cited on pages 6 and 10)
[Tro10] J. A. Tropp. User-Friendly Tail Bounds for Sums of
Random Matrices. Available at arxiv:1004.4389, April
2010. (Cited on pages 1 and 4)
[WX08] A. Wigderson and D. Xiao. Derandomizing the
Ahlswede-Winter Matrix-valued Cherno Bound using
Pessimistic Estimators, and Applications. Theory of
Computing, 4(1):5376, 2008. (Cited on pages 2 and 4)

Appendix
The next lemma states that if a symmetric positive
 approximates the Rayleigh quosemi-denite matrix A
tient of a symmetric positive semi-denite matrix A,
 also approximate the eigenthen the eigenvalues of A
values of A.

 are n n
Lemma 5.1. Let 0 < < 1. Assume A, A
symmetric positive semi-definite matrices, such that the
following inequality holds
 (1 + )x Ax,
(1 )x Ax x Ax

x Rn .

 are
Then, for i = 1, . . . , n the eigenvalues of A and A
the same up-to an error factor , i.e.,
 (1 + )i (A).
(1 )i (A) i (A)

Proof. The proof is an immediate consequence of the


Courant-Fischers characterization of the eigenvalues.
 have the same
First notice that by hypothesis, A and A
null space. Hence we can assume without loss of gen > 0 for all i = 1, . . . , n. Let
erality, that i (A), i (A)

i (A) and i (A) be the eigenvalues (in non-decreasing
 respectively. The Courant-Fischer
order) of A and A,
min-max theorem [GV96, p. 394] expresses the eigenvalues as
i (A) = min
maxi
i

(5.17)

xS

x Ax
,
x x

where the minimum is over all i-dimensional subspaces


S i . Let the subspaces S0i and S1i where the minimum
 respectively.
is achieved for the eigenvalues of A and A,
Then, it follows that
 = min max
i (A)
i
i
S

xS


 x Ax
x Ax
x Ax
maxi 
(1+)i (A).

x x
xS0 x Ax x x

and similarly,

i (A) = min max


S i xS i



x Ax
x Ax x Ax
i (A)

max

.


i
 x x
x x
1
xS1 x Ax

Therefore, it follows that for i = 1, . . . , n,

 (1 + )i (A).
(1 )i (A) i (A)

Proof of Theorem
1.1 For notational convenience,



t
let Z =  1t i=1 Mi E M  and dene Ep :=
2
Moreover, let X1 , X2 , . . . , Xn be
EM1 ,M2 ,...,Mt Z p .
copies of a (matrix-valued) random variables X, we will
denote EX1 ,X2 ,...,Xn by EX[n] . Our goal is to give sharp
bounds on the moments of the non-negative random

1433

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

variable Z and then using the moment method to give


concentration result for Z.
First we give a technical lemma of independent
interest that bounds the p-th moments of Z as a function
of p, r (the rank of the samples),
and

 the p/2-th moment
 t
2
of the random variable  j=1 Mj  . More formally, we
2
have the following

Now we are ready to prove Lemma 5.2.


Proof. (of Lemma 5.2) The proof is inspired from [RV07,
Theorem 3.1]. Let p 2. First, apply a standard
symmetrization argument (see [LT91]), which gives that

p  p1


p  p1


t
t

1 
1 





Lemma 5.2. Let M1 , . . . , Mt be i.i.d. copies of M ,
EM[t] 
Mi E M 
2 EM[t] E[t] 
i M i 
.

t
t

where M is a symmetric matrix-valued random variable
i=1
i=1
2
2
that has rank at most r almost surely. Then for every
p2
Indeed, let 1 , 2 , . . . , t denote independent Bernoulli

p/2


1 , . . . , M
t be independent
t
variables. Let M1 , . . . , Mt , M


2
(5.18)
Ep rt1p (2Bp )p EM[t] 
M
,
copies of M . We essential estimate the p-th root of Ep ,
j

j=1

2

where Bp is a constant that depends on p.

p 1/p

t

1 


EM[t] 
Mi E M 

t
i=1

Ep1/p =
We need a non-commutative version of Khintchine in- (5.22)
equality due to F. Lust-Piquard [LP86], see also [LPP91]
2
and [Buc01, Theorem 5]. We start with some prelimi

naries; let A Rnn and denote by Cpn the p-th Schat = E g 1 t M
i . We plug this
ten norm spacethe Banach space of linear operators Notice that E M
i=1
M[t] t
(or matrices in our setting) in Rn equipped with the into (5.22) and apply Jensens inequality,
norm
1/p
 n

p 1/p


t
t

p
1  
1
(5.19)
ACnp :=
i (A)
,


1/p
=
EM[t] EM
Mi
Mi 
Ep
i=1
 f[t] t

t
i=1
i=1
2

p 1/p

where i (A) are the singular values of A, see [Bha96,
t
t
1 

1  

Chapter IV, p.92] for a discussion on Schatten norms.

EM[t] EM
Mi
.
Mi 
f[t] 
t

t i=1
Notice that A2 = 1 (A), hence we have the following
i=1
2
inequality
(5.20)

1/p

A2 ACnp (rank (A))

A2 ,

for any p 1. Notice that when p = log2 (rank (A)),


1/ log2 (rank(A))
then rank (A)
= 2. Therefore, in this
case, the Schatten norm is essentially the spectral norm.
We are now ready to state the matrix-valued Khintchine
inequality. See e.g. [Rud99] or [NDT09, Lemma 8].

i is a symmetric matrix-valued
Now, notice that Mi M
random variable for every i [t], i.e., it is distributed
i ). Thus
identically with i (Mi M
Ep1/p

 t
p 1/p
1 


i )
EM[t] EM
i (Mi M
.

f[t] E[t] 
t

i=1

Theorem 5.1. Assume 2 p < . Then there exists


2
a constant Bp such that for any sequence of t symmetric
matrices M1 , . . . , Mt , with Mi Cpn , such that the


i . Then
Denote Y = 1t ti=1 i Mi and Y = 1t ti=1 i M
following inequalities hold
p
p
p
p
Y Y  (Y  + Y ) 2 (Y  + Y p ), and
(5.21)



1/p
 t
p
1/2 
E Y p = E Y p . Thus, we obtain that



 t



2

E[t] 
i M i 
Bp 
Mi



 n

 n
i=1
i=1
 t
p 1/p

Cp
Cp
1 



1/p

2
E
E

M
.
(5.23)
E


M

i
i
p
[t]
[t]
where for every i [t], i is a Bernoulli
random


t

i=1
5 1/4
2
variable. Moreover, Bp is at most 2
/e p.
5 See

Eqn. (17) in [NDT09] or [Buc01].

Now by the Khintchines inequality the following holds

1434

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

for any fixed symmetric matrices M1 , M2 , . . . , Mt .


p p1
 

1 t


E[t] 
j M j 
t

 j=1


almost surely. Summing up all the inequalities we get


that
p1








t

 t



2



(5.25)
Mj  
Mj 

 .
 j=1
 j=1




p


t


1


j M j 
E[t] 

t
j=1

2
Cp
2
2

1/2 
 t

 

It follows that
1


Bp 

Mj2 

p/2


t


t
 j=1



Cp
1p
p
2


rt
(2B
)
E
M
E


p
p
M
j
[t] 
1

 t
2
 j=1


2
(rt)1/p Bp 
 2 

p/2

Mj




t


t




 j=1
1p
p p/2

rt (2Bp ) EM[t] 
Mj 
2

 12

 j=1


 t
2
1/p


(rt) Bp 

p/2
2

(5.24)
=
M
,



j
t


1 

t
rt(2Bp )p

j=1

=
E
Mj 
2
M
[t]
t

tp/2
 j=1

2
taking 1/t outside the expectation and using the left
p/2




t
p
part of Ineq. (5.20), Ineq. (5.21), the right part of



rt(2Bp )
1

1/2


=
E
M

E
M
+
E
M
M[t] 
j
t

2
p/2
t
Ineq. (5.20) and the fact that the matrix

 t j=1
j=1 Mj
2
has rank at most rt.
p2

 p2 p

Now raising Ineq. (5.24) to the p-th power on



t

1 

rt(2Bp )p
both sides and then take expectation with respect to


Mj E M 
E

+ 1



p/2
t
M1 , . . . , Mt , it follows from Ineq. (5.23) that

 t j=1

Ep


p/2


t


rt
2
2p p Bpp EM[t] 
M
.
j

t
j=1


This concludes the proof of Lemma 5.2.

Now we are ready to prove Theorem 1.1. First we can


assume without loss of generality that M  0 almost
surely losing only a constant factor in our bounds.
Indeed, by the spectral decomposition theorem
any

symmetric matrix can be written as M = j j uj u
j .


Set M+ = j 0 j uj uj and M = M M+ . It is
clear that M+ 2 , M 2 M 2 , M+ F , M F
M F and rank (M+ ) , rank (M ) rank (M ). Triangle
inequality tells us that
 t


 t
1 


1 




Mj E M  
(Mj )+ E M+ 

t


t
i=1
i=1
2
2

t

1 


+ 
(Mj ) E M 

t
i=1

p2

p p1

 
p
t

1
rt(2Bp )
+ 1
Mj E M 

E 


p/2
t

 t j=1
2
p

p/2
rt(2Bp )
Ep1/p + 1
,
tp/2

using Lemma 5.2, Ineq. (5.25), Minkowskis inequality,


Jensens inequality, denition of Ep and the assumption
E M 2 1. This implies the following inequality
Ep1/p

(5.26)
using that

2Bp (rt)1/p 1/p

(Ep + 1),
t

1 + x 1+x, x 0. Let ap =

(rt)1/p

.
t
1/p
that Ep

4Bp

Then it follows from the above inequality


1/p
1/p
ap
+ 1). It follows that6 min{Ep , 1} ap . Also
2 (Ep
notice that
1/p

(E min{Z, 1}p )

(5.27)

min(Ep1/p , 1).

Now for any 0 < < 1,

and one can bound each term of the right hand side
separately. Hence, from now on we assume that M  0
a.s.. Now use the fact that for every j [t], Mj2  Mj
since Mj s are positive semi-denite and M 2

1435

P (Z > ) = P (min{Z, 1} > ) .


6 Indeed,

1/p

if Ep

1/p

< 1, then Ep

< ap . Otherwise 1 ap .

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

By the moment method we have that


P (min{Z, 1} > ) = P (min{Z, 1}p > p )

E min{Z, 1}p
inf
p2
p


1/p
min{Ep , 1}p
(5.27)
inf
p2
p
 a p
p
inf
p2


p

4Bp (rt)1/p

= inf
p2
t

p
p(rt)1/p

= inf C2
,
p2
t
where C2 > 0 is an absolute constant.
Now assume that r t and then set p = c2 log t,
where c2 > 0 is a sucient large constant, at the
inmum expression in the above inequality, it follows
that

 t
c2 log t


1

1 
log t(rt) log t



P 
Mi E M  >

C

t
t
i=1
2

We want to make the base of the above exponent smaller


than one. It is easy to see that this is possible if we set
t = C0 /2 log(C0 /2 ) where C0 is suciently large
absolute constant. Hence it implies that the above
probability is at most 1/poly (t). This concludes the
proof.

1436

Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.

You might also like