Professional Documents
Culture Documents
Multiplication
Avner Magen
Anastasios Zouzias
Abstract
In this paper we develop algorithms for approximating
matrix multiplication with respect to the spectral norm. Let
A Rnm and B Rnp be two matrices and > 0. We
e Rtm
approximate the product A B using two sketches A
tp
e
and B R , where t n, such that
e e
A B A B A2 B2
2
1 Introduction
In many scientic applications, data is often naturally
expressed as a matrix, and computational problems on
such data are reduced to standard matrix operations
including matrix multiplication, 2 -regression, and low
rank matrix approximation.
University of Toronto, Department of Computer Science,
Email : avner@cs.toronto.edu.
University of Toronto, Department of Computer Science,
Email : zouzias@cs.toronto.edu.
1422
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
Let R be a t d random
sign matrix rescaled by 1/ t,
Given the above discussion, we resort to a dierent methodology, called matrix-valued Cherno bounds.
These are non-trivial generalizations of the standard
Cherno bounds over the reals and were rst introduced in [AW02]. Part of the contribution of the current work is to show that such inequalities, similarly
to their real-valued ancestors, provide powerful tools
to analyze randomized algorithms. There is a rapidly
growing line of research exploiting the power of such
inequalities including matrix approximation by sparsication [AM07, DZ10]; analysis of algorithms for matrix completion and decomposition of low rank matrices [CR07, Gro09, Rec09]; and semi-denite relaxation and rounding of quadratic maximization problems [Nem07, So09a, So09b].
The quality of these bounds can be measured by the
number of samples needed in order to obtain small error
probability. The original result of [AW02, Theorem 19]
shows that1 if M is distributed according to some
distribution over n n matrices with zero mean2 , and
if M1 , . . . , Mt are independent copies of M then for any
> 0,
t
1
2 t
(1.2)
P
Mi > n exp C 2 ,
t
i=1
2
1423
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
t
poly
(t)
i=1
2
1424
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
References
[WX08]
[Rec09]
[RV07]
Theorem 1.1
Comments
Hoeding
Bernstein
Rank one
Low rank
= RA and B
= RB.
1/ t. Denote by A
where a similar analysis with that in proof of Theorem 3.2 (ii) would allow us to achieve a bound of
(a) If t = (r/2 ) then
(
r2 log(m + p)/2 ) on the number of samples (details omitted). However, as the next theorem indicates
B
A B)y|
P(x Rm , y Rp , |x (A
(proof omitted) we can get linear dependency on the
stable rank of the input matrices gaining from the vari Ax2 By2 ) 1 e(r) .
ance information of the samples; more precisely, this
(b) If t = (
r /4 ) then
can be achieved by applying the matrix-valued Bern+
stein Inequality see e.g. [GLF 09], [Rec09, Theorem 3.2]
r
e
or [Tro10, Theorem 2.10].
P A
B A B A2 B2 1e( 2 ) .
2
(ii) Let pi = A(i) 2 B(i) 2 /S, where S =
n
i=1 A(i) 2 B(i) 2 be a probability distribution
and a
over [n]. If we form a t m matrix A
by taking t = (
t p matrix B
r log(
r /2 )/2 )
i.i.d. (row indices) samples from pi , then
= RA and B
= RB. If
1/ t. Denote by A
t = (
r log(m + p)/2 ) then
P A
B A B A2 B2 1
2
1
.
poly (
r)
(ii) Let pi = A(i) 2 B(i) 2 /S, where S =
n
i=1 A(i) 2 B(i) 2 be a probability distribution
and a t p
over [n]. If we form a t m matrix A
by taking t = (
matrix B
r log(m + p)/2 ) i.i.d.
(row indices) samples from pi , then
P A
B A B A2 B2 1
2
P A
B A B A2 B2 1
1
.
poly (
r)
1
.
poly (
r)
1425
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
b Axopt 2 .
(3.4) xopt x
opt 2
to multiply the matrices by random sign matrices than
min (A)
just sampling their rows. However, the advantage of
part (i) is that the sampling process is oblivious, i.e., Remark 3. The above result can be easily generalized
does not depend on the input matrices.
to the case where b is an n p matrix B of rank
at most r (see proof ). This is known as the gen3.2 2 -regression In this section we present an ap- eralized 2 -regression problem in the literature, i.e.,
proximation algorithm for the least-squares regression arg minXmp AX B where B is an n p rank
2
problem; given an nm, n > m, real matrix A of rank r r matrix.
n
P
(A)
A
(1 + ) A Ak 2 ,
e
A,k
If t = (r/), then with high probability,
2
(3.3)
b A
xopt 2 (1 + ) b Axopt 2 .
1426
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
The theorem below shows that its possible to satisfy the conditions of Lemma 3.1 by randomly projecting A onto (r/2 ) or by non-uniform sampling i.i.d.
(r log(r/2 )/2 ) rows of A as described in parts (i.a)
and (ii), respectively.
Theorem 3.4. Let 0 < < 1/3 and let A = U V
a real n m matrix of rank r with n m.
for every k = 1, . . . , r.
= RA. If t =
rescaled by 1/ t and set A
(k/2 ), then with high probability
rk
) A Ak 2 .
A PA,k
e (A) (2 +
k
2
4 Proofs
4.1 Proof of Theorem 3.2 (Matrix Multiplica2
(ii) Let pi = U(i) 2 /r be a probability distribution tion)
Random Projections - Part (i )
be a t m matrix that is formed
over [n]. Let A
Part (a): In this section we show the rst, to the
(row-by-row) by taking t i.i.d. samples from pi and
best
of our knowledge, non-trivial spectral bound for
2
2
rescaled appropriately. If t = (r log(r/ )/ ),
matrix
multiplication. Although the proof is an immethen with high probability
diate
corollary
of the subspace Johnson-Lindenstrauss
lemma
(Lemma
1.1), this result is powerful enough to
A PA,k
e (A) (1 + ) A Ak 2 ,
2
give, for example, tight bounds for the 2 regression
problem. We prove the following more general theorem
for every k = 1, . . . , r.
from which Theorem 3.2 (i.a) follows by plugging in
2
We should highlight that in part (ii) the probability t = (r/ ).
distribution pi is in general hard to compute. Indeed,
2
Theorem 4.1. Let A Rnm and B Rnp . Assume
computing U(i) 2 requires computing the SVD of
that the ranks of A and B are at most
r. Let R be a
A. In general, these values are known as statistical
t n random sign matrix rescaled by 1/ t. Denote by
leverage scores [DM10]. In the special case where A
= RB. The following inequality holds
A = RA and B
is an edge-vertex matrix of an undirected weighted
graph then pi , the probability distribution over edges
m
p
P
x
R
,
y
R
,
|x
(
A
B)y|
Ax
By
B
A
2
2
(rows), corresponds to the eective-resistance of the ith edge [SS08].
1 cr2 exp(c1 2 t),
Theorem 3.4 gives an (1 + ) approximation algorithm for the special case of low rank matrices. How- where c > 0, c > 1 are constants.
1
2
ever, as discussed in Section 1 such an assumption is
too restrictive for most applications. In the following Proof. (of Theorem 4.1) Let A = UA A VA , B =
theorem, we make a step further and relax the rank UB B VB be the singular value decomposition of A and
condition with a condition that depends on the stable B respectively. Notice that UA RnrA , UB RnrB ,
rank of the residual matrix A Ak . More formally, for where rA and rB is the rank of A and B, respectively.
an integer k 1, we say that a matrix A has a k-low
Let x1 Rm , x2 Rp two arbitrary unit vectors.
stable rank tail i k sr (A Ak ).
Let w1 = Ax1 and w2 = Bx2 . Recall that
Notice that the above denition is useful since it
A R RB A B =
contains the set of matrices whose spectrum follows
2
1427
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
|x
1 (A R RB A B)x2 |.
The
Gaussian distribution is symmetric, so Gij and
tRij |Gij |, where Gij is a Gaussian random variWe will bound the last term for any arbitrary vector. able have the same distribution. By
Jensens inequality and the fact that E |Gij | =
2/, we get that
Denote with V the subspace3 colspan(UA )colspan(UB )
(Rv1 ) Rv2
=
+
=
4
2
2
v
1
2 + v2 2
= v1 v2 + ,
v1 v2 +
2
AF
B
, and B
.
F
40
40
r
2
r
1428
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
i=1
0
A B
BA
0
0
(4.10)
i=1
sr (A) sr (B)
by denition of pi , properties of norms, CauchySchwartz inequality, and arithmetic/geometric mean inequality. Notice that this quantity (since the spectral
norms of both A, B are one) is at most r by assumption. Also notice that every element on the support of
the random variable M , has rank at most two. It is
easy to see that, by setting = r, all the conditions in
Theorem 1.1 are satised, and hence we get i1 , i2 , . . . , it
indices from [n], t = (
r log(
r /2 )/2 ), such that with
high probability
1
t
1 0
pij B(ij ) A(ij )
t j=1 p1i A
(ij ) B(ij ) 0
j
0
BA
2 .
A B 0
2
w2
2
y2
since wcolspan(U )
since U U = I.
RA
xopt =
RA(
xopt xopt ) =
U R RU y
(4.11) U R RU y
n
A(i) B(i) A B
=
F
F
2
2
= w2 + U y2 ,
A B(i)
(i)
sup
p
i
i[n]
2
A
(i) B(i)
= S sup
=S1
i[n] A(i) 2 B(i) 2
= b A(
xopt xopt ) Axopt 2
= w U y22
This implies that E M 2 = A B 2 1. Next notice
that the spectral
norm of the random matrix M is upper
bounded by sr (A) sr (B) almost surely. Indeed,
M 2
(i)
A
(i) B(i)
b A
xopt 2
n
A(i)
0
B(i)
1
pi
A
p
(i) B(i) 0
i=1 i
n
0
B A(i)
4.2
=
=
Rb + w2 =
R(Axopt + w) + w2 =
Rw + w2 =
U R Rw + U R w2 =
U R Rw,
where w2 colspan(R), and used this fact to derive Ineq. (4.11). A crucial observation is that the
colspan(U ) is perpendicular
to w. Set A = B = U
in Theorem 3.2, and set = , and t = (r/2 ). Notice that rank (A) + rank (B) 2r, hence with constant
probability we
that1 i (RU ) 1 + . It
know
follows that U R RU y 2 (1 )2 y2 . A similar
argument (set
guar A = U and
B = w in Theorem 3.2)
antees that U R Rw2 = U R Rw U w2
U 2 w2 = w2 . Recall that U 2 = 1, since
U U = In with high probability. Therefore, taking Euclidean norms on both sides of Equation (4.11) we get
that
y2
w2 4 w2 .
(1 )2
and
pi1 A(i1 )
pi2 A(i2 ) . . .
pit A(it )
t
of
R
A
is
contained
in
the
row-span
of A. Indeed,
1
1
1
1
B
B
.
.
.
B
w
y
=
U
y
=
A(x
opt )2
B =
.
opt x
2
2
2
pi1 (i1 )
pi2 (i2 )
pit (it )
t
x
x
.
opt
opt 2
min(A)
This concludes the theorem.
1429
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
A)
(1 + )i (A A)
(4.12) (1 )i (A A) i (A
Lemma 4.2. Let A = Ak + Urk rk Vrk
, Hk =
Urk rk and R be any t n matrix. If the matrix
(RUk ) has full column rank, then the following inequality holds,
(4.14)
A P(RA),k (A) 2 A Ak +
(RUk ) RHk .
2
(4.15)
P W 2 r k ec0 rk .
k be the projection Notice that the above lemma, reduces the problem of
for all i = 1, . . . , rank (A). Let
i.e., spectral low rank matrix approximation to a problem of
matrix onto the rst k right singular vectors of A,
(RUk ) RHk .
2
First notice that by setting t = (k/2 ) we can
(A)
A PA,k
A
k
e
guarantee that the matrix (RUk ) will have full column
2
2
2
rank
with high probability. Actually, we can say
k )x
=
sup
A(I
something
much stronger; applying Theorem 3.2 (i.a)
2
xRm , x=1
with A = Uk we can guarantee that all the singular
=
sup
Ax22
values are within 1 with high probability. Now by
e
xker k , x=1
conditioning on the above event ( (RUk ) has full column
rank), it follows from Lemma 4.2 that
=
sup
x A Ax
e k , x=1
xker
A P(RA),k (A) 2 A Ak +
)
RH
(RU
k
k
2
(1 + )
sup
x A Ax
2
2
e k , x =1
xker
2
2 A Ak 2 + (RUk ) RHk 2
A)
2
= (1 + )k+1 (A
1
2
(1 + ) k+1 (A A)
RHk 2
2 A Ak 2 +
1
2
2
= (1 + ) A Ak 2 ,
3
2 A Ak 2 + RUrk 2 rk 2
2
k x = x, left side of the
k implies
using that x ker
A
(see Eqn. (5.17)), using the sub-multiplicative property of matrix norms,
hypothesis, Courant-Fischer on A
Eqn. (4.12), and properties of singular values, respec- and that < 1/3. Now, it suces to bound the norm
of W := RUrk . Recall that R = 1t G where G
tively.
is a t n random Gaussian matrix, It is well-known
Proof of Theorem 3.4 (i ):
that the distribution of the random matrix GUrk (by
Part (a): Now we are ready to prove our rst rotational invariance of the Gaussian distribution) has
corollary of our matrix multiplication result to the entries which are also i.i.d. Gaussian random variables.
problem of computing an approximate low rank matrix Now, we can use the following fact about random subapproximation of a matrix with respect to the spectral Gaussian matrices to give a bound on the spectral norm
norm (Theorem 3.4).
of W . Indeed, we have the following
1430
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
2 + c4
x A (S S )Ax
k
sup
x A Ax
xRm , Ax=0
x (A S SA A A)x
where c4 > 0 is an absolute constant. Rescaling by c4
sup
x A Ax
xRm , Ax=0
concludes Theorem 3.4 (i.b).
Proof of Theorem 3.4 (ii ) Here we prove that
x (A A A A)x
we can achieve the same relative error bound as with
,
sup
x A Ax
xRm , Ax=0
random projections by just sampling rows of A through
a judiciously selected distribution. However, there is a
price to pay and thats an extra logarithmic factor on since x ker implies x Im (A), Im (A) Im (),
the number of samples, as is stated in Theorem 3.4, part and A = A. By re-arranging terms we get Equation (4.13) and so the claim follows.
(ii).
Proof. (of Theorem 3.4 (ii)) The proof follows closely
the proof of [SS08]. Similar with the proof of part (a).
Let A = U V be the singular value decomposition
of A. Dene the projector matrix = U U of size
n n. Clearly, the rank of is equal to the rank of A
and has the same image with A since every element
in the image of A and is a linear combination of
columns of U . Recall that for any projection matrix, the
following holds 2 = and hence sr () = rank (A) =
2
n
r. Moreover,
= tr () =
i=1 U(i) 2 = tr U U
2
2
tr = r. Let pi = (i, i)/r = U(i) 2 /r be a
probability distribution on [n], where Ui is the i-th row
of U .
Dene a t n random matrix S as follows: Pick t
samples from pi ; if the i-th sample is equal to j( [n])
Acknowledgments
S S .
2
1431
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
References
1432
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
Appendix
The next lemma states that if a symmetric positive
approximates the Rayleigh quosemi-denite matrix A
tient of a symmetric positive semi-denite matrix A,
also approximate the eigenthen the eigenvalues of A
values of A.
are n n
Lemma 5.1. Let 0 < < 1. Assume A, A
symmetric positive semi-definite matrices, such that the
following inequality holds
(1 + )x Ax,
(1 )x Ax x Ax
x Rn .
are
Then, for i = 1, . . . , n the eigenvalues of A and A
the same up-to an error factor , i.e.,
(1 + )i (A).
(1 )i (A) i (A)
(5.17)
xS
x Ax
,
x x
xS
x Ax
x Ax
x Ax
maxi
(1+)i (A).
x x
xS0 x Ax x x
and similarly,
x Ax
x Ax x Ax
i (A)
max
.
i
x x
x x
1
xS1 x Ax
(1 + )i (A).
(1 )i (A) i (A)
Proof of Theorem
1.1 For notational convenience,
t
let Z = 1t i=1 Mi E M and dene Ep :=
2
Moreover, let X1 , X2 , . . . , Xn be
EM1 ,M2 ,...,Mt Z p .
copies of a (matrix-valued) random variables X, we will
denote EX1 ,X2 ,...,Xn by EX[n] . Our goal is to give sharp
bounds on the moments of the non-negative random
1433
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
p p1
p p1
t
t
1
1
Lemma 5.2. Let M1 , . . . , Mt be i.i.d. copies of M ,
EM[t]
Mi E M
2 EM[t] E[t]
i M i
.
t
t
where M is a symmetric matrix-valued random variable
i=1
i=1
2
2
that has rank at most r almost surely. Then for every
p2
Indeed, let 1 , 2 , . . . , t denote independent Bernoulli
p/2
1 , . . . , M
t be independent
t
variables. Let M1 , . . . , Mt , M
2
(5.18)
Ep rt1p (2Bp )p EM[t]
M
,
copies of M . We essential estimate the p-th root of Ep ,
j
j=1
2
p 1/p
t
1
EM[t]
Mi E M
t
i=1
Ep1/p =
We need a non-commutative version of Khintchine in- (5.22)
equality due to F. Lust-Piquard [LP86], see also [LPP91]
2
and [Buc01, Theorem 5]. We start with some prelimi
naries; let A Rnn and denote by Cpn the p-th Schat = E g 1 t M
i . We plug this
ten norm spacethe Banach space of linear operators Notice that E M
i=1
M[t] t
(or matrices in our setting) in Rn equipped with the into (5.22) and apply Jensens inequality,
norm
1/p
n
p 1/p
t
t
p
1
1
(5.19)
ACnp :=
i (A)
,
1/p
=
EM[t] EM
Mi
Mi
Ep
i=1
f[t] t
t
i=1
i=1
2
p 1/p
where i (A) are the singular values of A, see [Bha96,
t
t
1
1
Chapter IV, p.92] for a discussion on Schatten norms.
EM[t] EM
Mi
.
Mi
f[t]
t
t i=1
Notice that A2 = 1 (A), hence we have the following
i=1
2
inequality
(5.20)
1/p
A2 ,
i is a symmetric matrix-valued
Now, notice that Mi M
random variable for every i [t], i.e., it is distributed
i ). Thus
identically with i (Mi M
Ep1/p
t
p 1/p
1
i )
EM[t] EM
i (Mi M
.
f[t] E[t]
t
i=1
1/p
t
p
1/2
E Y p = E Y p . Thus, we obtain that
t
2
E[t]
i M i
Bp
Mi
n
n
i=1
i=1
t
p 1/p
Cp
Cp
1
1/p
2
E
E
M
.
(5.23)
E
M
i
i
p
[t]
[t]
where for every i [t], i is a Bernoulli
random
t
i=1
5 1/4
2
variable. Moreover, Bp is at most 2
/e p.
5 See
1434
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
p p1
1 t
E[t]
j M j
t
j=1
2
(5.25)
Mj
Mj
.
j=1
j=1
p
t
1
j M j
E[t]
t
j=1
2
Cp
2
2
1/2
t
It follows that
1
Bp
Mj2
p/2
t
t
j=1
Cp
1p
p
2
rt
(2B
)
E
M
E
p
p
M
j
[t]
1
t
2
j=1
2
(rt)1/p Bp
2
p/2
Mj
t
t
j=1
1p
p p/2
rt (2Bp ) EM[t]
Mj
2
12
j=1
t
2
1/p
(rt) Bp
p/2
2
(5.24)
=
M
,
j
t
1
t
rt(2Bp )p
j=1
=
E
Mj
2
M
[t]
t
tp/2
j=1
2
taking 1/t outside the expectation and using the left
p/2
t
p
part of Ineq. (5.20), Ineq. (5.21), the right part of
rt(2Bp )
1
1/2
=
E
M
E
M
+
E
M
M[t]
j
t
2
p/2
t
Ineq. (5.20) and the fact that the matrix
t j=1
j=1 Mj
2
has rank at most rt.
p2
p2 p
Now raising Ineq. (5.24) to the p-th power on
t
1
rt(2Bp )p
both sides and then take expectation with respect to
Mj E M
E
+ 1
p/2
t
M1 , . . . , Mt , it follows from Ineq. (5.23) that
t j=1
Ep
p/2
t
rt
2
2p p Bpp EM[t]
M
.
j
t
j=1
p2
p p1
p
t
1
rt(2Bp )
+ 1
Mj E M
E
p/2
t
t j=1
2
p
p/2
rt(2Bp )
Ep1/p + 1
,
tp/2
(5.26)
using that
(Ep + 1),
t
1 + x 1+x, x 0. Let ap =
(rt)1/p
.
t
1/p
that Ep
4Bp
(E min{Z, 1}p )
(5.27)
min(Ep1/p , 1).
and one can bound each term of the right hand side
separately. Hence, from now on we assume that M 0
a.s.. Now use the fact that for every j [t], Mj2 Mj
since Mj s are positive semi-denite and M 2
1435
1/p
if Ep
1/p
< 1, then Ep
< ap . Otherwise 1 ap .
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.
E min{Z, 1}p
inf
p2
p
1/p
min{Ep , 1}p
(5.27)
inf
p2
p
a p
p
inf
p2
p
4Bp (rt)1/p
= inf
p2
t
p
p(rt)1/p
= inf C2
,
p2
t
where C2 > 0 is an absolute constant.
Now assume that r t and then set p = c2 log t,
where c2 > 0 is a sucient large constant, at the
inmum expression in the above inequality, it follows
that
t
c2 log t
1
1
log t(rt) log t
P
Mi E M >
C
t
t
i=1
2
1436
Copyright by SIAM.
Unauthorized reproduction of this article is prohibited.