Subgradient and Bundle Methods

Subgradient and Bundle Methods
for optimization of convex non-smooth functions
April 1, 2009

Motivation
Many naturally occuring problems are nonsmooth

Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but “numerically nonsmooth”

Motivation

Hinge loss

Motivation

Hinge loss

Motivation

Hinge loss

Motivation

Hinge loss

Methods for nonsmooth optimizations
Approximate by a series of smooth functions

Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition


Subgradient Methods
Bundle Methods
U V decomposition


Subgradient Methods
Bundle Methods
U V decomposition


Subgradient Methods
Bundle Methods
U V decomposition


Subgradient Methods
Bundle Methods
U V decomposition


Subgradient Methods
Bundle Methods
U V decomposition


Subgradient Methods
Bundle Methods
U V decomposition

Definition
An extension of gradients
For a convex differentiable function f (x), ∀x, y
f (y ) ≥ f (x) + ∇f (x)T (y − x) (1)
So, a subgradient is defined as any g ∈ Rn such that ∀y
f (y ) ≥ f (x) + g T (y − x) (2)
The set of all subgradients of f at x is denoted ∂f (x)

Definition
f (y ) ≥ f (x) + ∇f (x)T (y − x) (1)
f (y ) ≥ f (x) + g T (y − x) (2)

Definition
f (y ) ≥ f (x) + ∇f (x)T (y − x) (1)
f (y ) ≥ f (x) + g T (y − x) (2)

Some Facts
From Convex Analysis
A convex function is always subdifferentiable i.e. the

Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. ∂f (x) = {∇f (x)}
Subgradients are lower bounds for directional derivatives.
f 0 (x; d) = supg∈∂f (x) hg, di
Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Some Facts


Some Facts


Some Facts


Some Facts


Some Facts


Properties
Without Proof
∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)

∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)
Local minima ⇒ 0 ∈ ∂f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to find minima

Properties
Without Proof
∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)

∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)

Properties
Without Proof
∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)

∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)

Properties
Without Proof
∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)

∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)

Properties
Without Proof
∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)

∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)

Properties
Without Proof
∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)

∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)

Subgradient Method
Algorithm
Subgradient Method is NOT a descent method!

x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x)
(k ) (k −1)
fbest = min{fbest , f (x (k ) )}
Line search is not performed. Step lengths αk usually fixed
ahead of time

Subgradient Method
Algorithm

(k ) (k −1)
ahead of time

Subgradient Method
Algorithm

(k ) (k −1)
ahead of time

Subgradient Method
Algorithm

(k ) (k −1)
ahead of time

Step Lengths
Commonly used Step lengths

Constant step size: αk = α
Constant step length: αk = αk = γ/kg (k ) k2
Square summable
P∞ but not summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
Nonsummable diminishing: P∞
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
Nonsummable diminishing step lengths:
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Step Lengths

Square summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Step Lengths

Square summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Step Lengths

Square summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Step Lengths

Square summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Convergence Result
Assume that ∃G such that the norm of the subgradients is

bounded i.e. ||g (k ) ||2 ≤ G
(For example, Suppose f is Lipshitz continuous)
2
dist x (1) , X ∗ + G2 ki=1 αi2
P
k ∗
Result fbest − f ≤
2 ki=1 αi
P
Proof is through proving ||x − x ∗ || decreases

Convergence Result

2
dist x (1) , X ∗ + G2 ki=1 αi2
P
k ∗
2 ki=1 αi
P

Convergence Result

2
dist x (1) , X ∗ + G2 ki=1 αi2
P
k ∗
2 ki=1 αi
P

Convergence Result

2
dist x (1) , X ∗ + G2 ki=1 αi2
P
k ∗
2 ki=1 αi
P

Convergence for Commonly used Step lengths
(k ) G2 h
Constant step size: fbest within of optimal
2
(k )
Constant step length: fbest within Gh of optimal
(k )
Square summable but not summable step size: fbest → f ∗
(k )
Nonsummable diminishing: fbest → f ∗
(k )
Nonsummable diminishing step lengths: fbest → f ∗
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P
R/G
So, optimal αi are √ and converges in (RG/)2 steps
k

(k ) G2 h
2
(k )
(k )
(k )
(k )
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P
R/G
k

(k ) G2 h
2
(k )
(k )
(k )
(k )
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P
R/G
k

(k ) G2 h
2
(k )
(k )
(k )
(k )
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P
R/G
k

(k ) G2 h
2
(k )
(k )
(k )
(k )
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P
R/G
k

(k ) G2 h
2
(k )
(k )
(k )
(k )
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P
R/G
k

(k ) G2 h
2
(k )
(k )
(k )
(k )
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P
R/G
k

Variations
If optimal value is known eg. if the optimal value is known

to be 0, but the point is not known
f (x (k ) ) − f ∗
αk =
||g (k ) ||2
Projected Subgradient: minimize f (x) s.t. x ∈ C
x (k +1) = P(x (k ) + αk g (k ) )
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Variations

f (x (k ) ) − f ∗
αk =
||g (k ) ||2
x (k +1) = P(x (k ) + αk g (k ) )
convex sets
Heavy Ball method:
x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Variations

f (x (k ) ) − f ∗
αk =
||g (k ) ||2
x (k +1) = P(x (k ) + αk g (k ) )
convex sets
Heavy Ball method:
x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Variations

f (x (k ) ) − f ∗
αk =
||g (k ) ||2
x (k +1) = P(x (k ) + αk g (k ) )
convex sets
Heavy Ball method:
x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods

Pros
Low memory usage
is decomposible
Cons

Pros
Low memory usage
is decomposible
Cons

Pros
Low memory usage
is decomposible
Cons

Pros
Low memory usage
is decomposible
Cons

Pros
Low memory usage
is decomposible
Cons

Cutting Plane Method
Again, Consider the problem: minimize f (x) subject to

x ∈C
Construct an Approximate Model:
f (x) = maxi∈I (f̂ (xi ) + gi T (x − xi )
Minimize model over x and find f (x) and g
Update model and repeat till desired accuracy
Numerically unstable


x ∈C


x ∈C


x ∈C


x ∈C

Idea: solve a series of smooth convex problems to

minimize f (x)

λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2

λ
p(x) = argminy ∈Rn f (y ) + ||y − x||2
2
F (x) is differentiable!
∇F (x) = λ(x − p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods


minimize f (x)

λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2

λ
2
∇F (x) = λ(x − p(x))
Bundle Methods


minimize f (x)

λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2

λ
2
∇F (x) = λ(x − p(x))
Bundle Methods


minimize f (x)

λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2

λ
2
∇F (x) = λ(x − p(x))
Bundle Methods


minimize f (x)

λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2

λ
2
∇F (x) = λ(x − p(x))
Bundle Methods


minimize f (x)

λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2

λ
2
∇F (x) = λ(x − p(x))
Bundle Methods


minimize f (x)

λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2

λ
2
∇F (x) = λ(x − p(x))
Bundle Methods

Elementary Bundle Method
As before, f is assumed to be Lipshitz continuous

At a generic iteration we maintain a “bundle”
< yi , f (yi ), si , αi >

As before, f is assumed to be Lipshitz continuous

At a generic iteration we maintain a “bundle”
< yi , f (yi ), si , αi >

Follow Cutting Plane Method, but use M-Y Regularization

for building the model
µk
yk +1 = argminy ∈Rn f̂k (y ) + ||y − x̂ k ||2
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk
Serious Step x̂ k +1 = y k +1
else Null Step x̂ k +1 = x̂ k
f̂k +1 (y ) = max{f̂k (y ), f (y k +1 ) + sk +1 , y − y k +1 }



µk
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk



µk
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk



µk
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk



µk
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk



µk
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk


Convergence
The Algorithm either makes a finite number of Serious

Steps and then only makes Null steps
Then, If k0 is the last Serious Step, and µk is
nondecreasing, then δk → 0
Or it makes an infinite number of Serious steps
P f (x̂0 ) − f ∗
Then, k ∈Ks δk ≤ so δk → 0
m

Convergence

P f (x̂0 ) − f ∗
m

Convergence

P f (x̂0 ) − f ∗
m

Convergence

P f (x̂0 ) − f ∗
m

Convergence

P f (x̂0 ) − f ∗
m

Convergence

P f (x̂0 ) − f ∗
m

Variations
Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable

Conjuguate Gradient methods are achieved as a slight
modification of the algorithm (Refer [5])
Variable Metric Methods [10]
Mk = uk I for Diagonal Variable Metric Methods
Bundle-Newton Methods

Variations


Variations


Variations


Variations


Summary
Nonsmooth convex optimization has been explored since

1960’s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.

Summary

more recently.

Summary

more recently.

Summary

more recently.

Summary

more recently.

For Further Reading I
Naum Z. Shor
Minimization Methods for non-differentiable functions.
Springer-Verlag, 1985.
Boyd and Vanderberge
Convex Optimization.
Cambridge University Press
A. Ruszczyinski
Nonlinear Optimization
Princeton University Press
Wikipedia
en.wikipedia.org/wiki/Subgradient_method

For Further Reading II
Marko Makela
Survey of Bundle Methods, 2009
http://www.informaworld.com/smpp/
content~db=all~content=a713741700
Alexandre Belloni
An Introduction to Bundle Methods
http://web.mit.edu/belloni/www/
LecturesIntroBundle.pdf
John E. Mitchell
Cutting Plane and Subgradient Methods, 2005
http://www.optimization-online.org/DB_HTML/
2009/05/2298.html

For Further Reading III
Lecture Notes on Subgradient methods by Stephen Boyd

http://www.stanford.edu/class/ee392o/
subgrad_method.pdf
Alexander J. Smola, S.V. N. Vishwanathan, Quoc V. Le
Bundle Methods for Machine Learning, 2007
http://books.nips.cc/papers/files/nips20/
NIPS2007_0470.pdf
C Lemarechal Variable metric bundle methods, 1997.
http://www.springerlink.com/index/
3515WK428153171N.pdf
Quoc Le, Alexander Smola
Direct Optimization of Ranking Measures, 2007
http://arxiv.org/abs/0704.3359

For Further Reading IV
SVN Vishwanathan, A. Smola

Quasi-Newton Methods for Efficient Large-Scale Machine
Learning
http://portal.acm.org/ft_gateway.cfm?id=
1390309&type=pdf
and
www.stat.purdue.edu/~vishy/talks/LBFGS.pdf

Subgradient and Bundle Methods

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Subgradient and Bundle Methods

Uploaded by

Copyright:

Available Formats

Subgradient and Bundle Methods

for optimization of convex non-smooth functions