Professional Documents
Culture Documents
Stephen Wright
University of Wisconsin-Madison
1 / 58
Sparse Optimization
Motivation for Sparse Optimization
Applications of Sparse Optimization
Formulating Sparse Optimization Problems
Compressed Sensing
Matrix Completion
Conclusions
2 / 58
3 / 58
4 / 58
500
1000
1500
2000
2500
3000
3500
4000
5 / 58
min P(u) := ku f k22 +
u
ui,j+1 ui,j
2
2
i,j
6 / 58
7 / 58
8 / 58
9 / 58
F
X
(1) (2)
(N)
f =1
Rank of a tensor is the smallest F for which exact equality holds. However
things are much more complicated than in the matrix case (N = 2):
F may be different over R and C.
Finding F is NP-hard.
Maximum and typical ranks of random tensors may be different.
Minimum-rank decompositions are nonunique for matrices, but often
unique for tensors.
Can have a sequence of rank-F tensors approaching a rank-(F + 1)
tensor.
There is interest in solving tensor completion problems where we find a
rank-F tensor that closely approximates the observations in a given tensor.
Stephen Wright (UW-Madison)
10 / 58
11 / 58
12 / 58
13 / 58
14 / 58
15 / 58
16 / 58
17 / 58
w T xi b <= 1 yi = 1
m
X
T
w w+
max 1 yi [w T xi b], 0 .
2
i=1
Dual formulation:
1
max e T T Y T KY subject to T y = 0, 0 C 1,
2
where y = (y1 , y2 , . . . , ym )T , Y = diag(y ), Kij = xiT xj is the kernel.
Stephen Wright (UW-Madison)
18 / 58
19 / 58
2
where Kij = k(xi , xj ) is the kernel. (Can get a primal formulation too.)
Where does sparsity come in? Can formulate approximate versions of these
problems in which few of the are allowed to be nonzero. (In fact, these
are more tractable when m is very large.)
Stephen Wright (UW-Madison)
20 / 58
p(x)
1 p(x)
=
N
X
al Bl (x),
l=0
1 X
[yi log p(xi ) + (1 yi ) log(1 p(xi ))] .
m
i=1
21 / 58
for some parameter T > 0. In fact, can trace the solution x as a function
of T . Generally higher T leads to less sparse x.
Can extend to group lasso, where x is broken into disjoint subvectors x[l] ,
l = 1, 2, . . . , K , and we impose the constraint:
K
X
kx[l] k T
l=1
or
K
X
kx[l] k2 T .
l=1
22 / 58
23 / 58
if x < 0
f (x)
f (x) = f 0 (0) + [1, 1] if x = 0
0
f (x) +
if x > 0.
Introduces nonsmoothness at the kink x = 0, making it more likely that
0 will be chosen as the solution.
The likelihood increases as increases.
Stephen Wright (UW-Madison)
24 / 58
Effect of
f(x)
25 / 58
Effect of
f(x)+|x|
25 / 58
Effect of
f(x)+2|x|
25 / 58
Higher Dimensions
In higher dimensions, if we design nonsmooth functions c(x) that have
their kinks at points that are sparse according to our definition, then
they are suitable regularizers for our problem.
Examples:
c(x) = kxk1 will tend to produce x with few nonzeros.
c(x) = kxk1 is less interesting kink only when all components are
zero (all or nothing).
c(x) = kxk has kinks where components of x are equal also may
not be interesting for sparsity.
P
c(x) = K
l=1 kx[l] k2 has kinks where x[l] = 0 for some l suitable for
group sparsity.
Total Variation norm: Has kinks where ui,j = ui+1,j = ui,j+1 for some
i, j, i.e. where spatial gradient is zero.
Stephen Wright (UW-Madison)
26 / 58
Compressed Sensing
Ax y ,
A Rnk , n k.
for all c RS .
27 / 58
28 / 58
1
kAx y k22
2
subject to kxk1 .
subject to kAx y k2 .
1
kAx y k22 + kxk1 .
2
29 / 58
Algorithms
Interior-point:
Primal-dual: SparseLab / PDCO [Saunders et al 98, 02] l1 ls [Kim et
al 07]
SOCP: `1 -magic [Cand`es, Romberg 05]
30 / 58
1
q(x) := kAx y k22 ,
q(x) = AT r , where r := Ax b.
2
OMP chooses elements of q one at a time, allowing the corresponding
components of x to move away from 0 and adjust r accordingly.
Given A, y , set t = 1, r0 = 0, and 0 = .
1
31 / 58
32 / 58
SpaRSA
1
min (x) := kAx y k22 + kxk1 .
2
Define q(x) := (1/2)kAx y k22 . From iterate x k , get step d by solving
1
min q(x k )T d + k d T d + kx k + dk1 .
d
2
Can view the k term as an approximation to the Hessian:
k I 2 q = AT A. (When RIP holds, this approximation is good, for
small principal submatrices of AAT .)
Subproblem is trivial to solve in O(n) operations, since it is separable in
the components of d. Equivalent to the shrinkage operator:
min
z
1
kz u k k22 +
kzk1 ,
2
k
with u k := x k
1
q(x k ).
k
33 / 58
Choosing k
34 / 58
35 / 58
Matrix Completion
Seek low-rank matrix X Rn1 n2 such that A(X ) b, where A is a linear
mapping on elements of X and b is the vector of observations.
In some sense, extends matrix completion is compressed sensing on matrix
variables. Linear algebra is more complicated.
Can formulate as
min rank(X ) s.t. A(X ) = b
X
36 / 58
(1)
and
1
min kX k + kA(X ) bk22 .
X
2
Sparse Optimization Methods
37 / 58
kZ k + kZ Y k k2F ,
k
2
where
Y k := X k
1
A (A(X k ) b).
k
e.g. [Ma, Goldfarb, Chen, 08]. Can prove convergence for k sufficiently
large (uniformly greater than max (A A)/2).
Can enhance by similar strategies as in compressed sensing:
Continuation
Barzilai-Borwein k , nonmonotonicity
Debiasing.
Stephen Wright (UW-Madison)
38 / 58
39 / 58
Explicit Parametrization of X
From [Recht, Fazel, Parrilo 07] and earlier work of Burer, Monteiro, Choi
in SDP setting.
Choose target rank r and approximate X by LR T , where L Rn1 r and
R Rn1 r are the unknowns in the problem. For the formulation:
min kX k s.t. A(X ) b,
X
1
kLk2F + kRk2F s.t. A(LR T ) = b.
2
40 / 58
2
1
min kLk2F + kRk2F +
A(LR T ) b
.
L,R
2
2
Again, can use nonlinear conjugate gradient with exact line search, and
can do continuation on .
No need for SVD. Implementations are easy.
Local minima? [Burer 06] shows that (in an SDP setting) provided r
is chosen large enough, the method should not converge to a local
solution only the global solution.
Performance degrades when rank is overestimated, probably because
of degeneracy.
Investigations of this approach are ongoing.
41 / 58
Optimization Tools:
Operator splitting (the basis of IST)
Gradient projection
(Optimal) gradient and subgradient methods
Augmented Lagrangian
Algorithms for large-scale nonlinear unconstrained problems (nonlinear
CG, L-BFGS)
Semidefinite programming
Handling of degeneracy
42 / 58
43 / 58
Examples
Compressed Sensing
min
x
1
kAx bk22 + kxk1
2
44 / 58
`1 Penalty Function:
min f (x) s.t. r (x) 0, x C
`1 penalty is
min f (x) + kr (x)+ k1
xC
Set
f (x)
c(x) = r (x) ,
x
h(c) = c1 +
nX
c +1
j=2
Nonlinear Approximation
min kc(x)k,
where k k is `1 , `2 , ` , or Huber function, for example.
45 / 58
Nonconvex Examples
Alternative to `1 regularization whre the penalty for large |xi | is attenuated:
min f (x) + |x| , where |x| =
x
n
X
(1 e |x|i ),
i=1
46 / 58
2
|d| ,
2
47 / 58
d:x+dC
f (x)T d +
kdk22
2
48 / 58
Assumption: Prox-Regularity of h
49 / 58
Related Work
SLQP. An approach that uses a similar first-order step (with a different
trust region e.g. box-shaped) has been proposed for nonlinear
programming / composite nonsmooth minimization [Fletcher, Sainz de la
Maza, 1989] [Byrd et al., 2004] [Yuan, 1980s].
Proximal Point. Obtain step from
min h(c(xk + d)) +
d
kdk22 .
2
50 / 58
51 / 58
52 / 58
53 / 58
h(
c ) null (c(
x ) ) 6=
by strict criticality:
ri h(
c ) null (c(
x ) ) 6= ,
54 / 58
At iteration k:
Find a local solution of PLS at xk and the current that improves on
d = 0;
efficiently project xk + d onto the domain of h to get xk+ (require
(xk+ xk ) d).
Increase as necessary until a sufficient decrease test is satisfied:
h
i
h(c(xk )) h(c(xk+ )) .01 h(c(xk )) h(c(xk ) + c(xk )T d)
Decrease (but enforce min ) in preparation for next iteration.
55 / 58
Convergence
Roughly:
The algorithm can step away from a non-stationary point: The
solution of PLS(x, ) is accepted for large enough.
Cannot have accumulation at noncritical points that are nice (i.e.
where h is subdifferentially regular and transversality holds).
See paper for details.
56 / 58
Extensions
57 / 58
Conclusions
58 / 58