You are on page 1of 98

Subgradient and Bundle Methods

for optimization of convex non-smooth functions

April 1, 2009

Subgradient and Bundle Methods


Motivation

Many naturally occuring problems are nonsmooth


Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods


Motivation

Many naturally occuring problems are nonsmooth


Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods


Motivation

Many naturally occuring problems are nonsmooth


Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods


Motivation

Many naturally occuring problems are nonsmooth


Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods


Motivation

Many naturally occuring problems are nonsmooth


Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but “numerically nonsmooth”

Subgradient and Bundle Methods


Methods for nonsmooth optimizations

Approximate by a series of smooth functions


Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition

Subgradient and Bundle Methods


Methods for nonsmooth optimizations

Approximate by a series of smooth functions


Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition

Subgradient and Bundle Methods


Methods for nonsmooth optimizations

Approximate by a series of smooth functions


Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition

Subgradient and Bundle Methods


Methods for nonsmooth optimizations

Approximate by a series of smooth functions


Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition

Subgradient and Bundle Methods


Methods for nonsmooth optimizations

Approximate by a series of smooth functions


Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition

Subgradient and Bundle Methods


Methods for nonsmooth optimizations

Approximate by a series of smooth functions


Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition

Subgradient and Bundle Methods


Methods for nonsmooth optimizations

Approximate by a series of smooth functions


Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition

Subgradient and Bundle Methods


Definition
An extension of gradients

For a convex differentiable function f (x), ∀x, y

f (y ) ≥ f (x) + ∇f (x)T (y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y

f (y ) ≥ f (x) + g T (y − x) (2)

The set of all subgradients of f at x is denoted ∂f (x)

Subgradient and Bundle Methods


Definition
An extension of gradients

For a convex differentiable function f (x), ∀x, y

f (y ) ≥ f (x) + ∇f (x)T (y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y

f (y ) ≥ f (x) + g T (y − x) (2)

The set of all subgradients of f at x is denoted ∂f (x)

Subgradient and Bundle Methods


Definition
An extension of gradients

For a convex differentiable function f (x), ∀x, y

f (y ) ≥ f (x) + ∇f (x)T (y − x) (1)

So, a subgradient is defined as any g ∈ Rn such that ∀y

f (y ) ≥ f (x) + g T (y − x) (2)

The set of all subgradients of f at x is denoted ∂f (x)

Subgradient and Bundle Methods


Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the


Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. ∂f (x) = {∇f (x)}
Subgradients are lower bounds for directional derivatives.
f 0 (x; d) = supg∈∂f (x) hg, di
Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods


Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the


Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. ∂f (x) = {∇f (x)}
Subgradients are lower bounds for directional derivatives.
f 0 (x; d) = supg∈∂f (x) hg, di
Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods


Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the


Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. ∂f (x) = {∇f (x)}
Subgradients are lower bounds for directional derivatives.
f 0 (x; d) = supg∈∂f (x) hg, di
Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods


Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the


Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. ∂f (x) = {∇f (x)}
Subgradients are lower bounds for directional derivatives.
f 0 (x; d) = supg∈∂f (x) hg, di
Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods


Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the


Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. ∂f (x) = {∇f (x)}
Subgradients are lower bounds for directional derivatives.
f 0 (x; d) = supg∈∂f (x) hg, di
Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods


Some Facts
From Convex Analysis

A convex function is always subdifferentiable i.e. the


Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. ∂f (x) = {∇f (x)}
Subgradients are lower bounds for directional derivatives.
f 0 (x; d) = supg∈∂f (x) hg, di
Further, d is a descent direction iff g T d < 0 ∀g ∈ ∂f (x)

Subgradient and Bundle Methods


Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)


∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)
Local minima ⇒ 0 ∈ ∂f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods


Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)


∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)
Local minima ⇒ 0 ∈ ∂f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods


Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)


∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)
Local minima ⇒ 0 ∈ ∂f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods


Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)


∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)
Local minima ⇒ 0 ∈ ∂f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods


Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)


∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)
Local minima ⇒ 0 ∈ ∂f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods


Properties
Without Proof

∂(f1 + f2 )(x) = ∂f1 (x) + ∂f2 (x)


∂αf (x) = α∂f (x)
g(x) = f (Ax + b) ⇒ ∂g(x) = AT ∂f (Ax + b)
Local minima ⇒ 0 ∈ ∂f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to find minima

Subgradient and Bundle Methods


Subgradient Method
Algorithm

Subgradient Method is NOT a descent method!


x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x)
(k ) (k −1)
fbest = min{fbest , f (x (k ) )}
Line search is not performed. Step lengths αk usually fixed
ahead of time

Subgradient and Bundle Methods


Subgradient Method
Algorithm

Subgradient Method is NOT a descent method!


x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x)
(k ) (k −1)
fbest = min{fbest , f (x (k ) )}
Line search is not performed. Step lengths αk usually fixed
ahead of time

Subgradient and Bundle Methods


Subgradient Method
Algorithm

Subgradient Method is NOT a descent method!


x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x)
(k ) (k −1)
fbest = min{fbest , f (x (k ) )}
Line search is not performed. Step lengths αk usually fixed
ahead of time

Subgradient and Bundle Methods


Subgradient Method
Algorithm

Subgradient Method is NOT a descent method!


x (k +1) = x (k ) − αk g k for αk ≥ 0 and g k ∈ ∂f (x)
(k ) (k −1)
fbest = min{fbest , f (x (k ) )}
Line search is not performed. Step lengths αk usually fixed
ahead of time

Subgradient and Bundle Methods


Step Lengths

Commonly used Step lengths


Constant step size: αk = α
Constant step length: αk = αk = γ/kg (k ) k2
Square summable
P∞ but not summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
Nonsummable diminishing: P∞
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
Nonsummable diminishing step lengths:
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods


Step Lengths

Commonly used Step lengths


Constant step size: αk = α
Constant step length: αk = αk = γ/kg (k ) k2
Square summable
P∞ but not summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
Nonsummable diminishing: P∞
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
Nonsummable diminishing step lengths:
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods


Step Lengths

Commonly used Step lengths


Constant step size: αk = α
Constant step length: αk = αk = γ/kg (k ) k2
Square summable
P∞ but not summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
Nonsummable diminishing: P∞
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
Nonsummable diminishing step lengths:
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods


Step Lengths

Commonly used Step lengths


Constant step size: αk = α
Constant step length: αk = αk = γ/kg (k ) k2
Square summable
P∞ but not summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
Nonsummable diminishing: P∞
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
Nonsummable diminishing step lengths:
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods


Step Lengths

Commonly used Step lengths


Constant step size: αk = α
Constant step length: αk = αk = γ/kg (k ) k2
Square summable
P∞ but not summable
P∞ step size:
αk ≥ 0, 2
k =1 αk < ∞, k =1 αk = ∞.
Nonsummable diminishing: P∞
αk ≥ 0, limk →∞ αk = 0, k =1 αk = ∞.
Nonsummable diminishing step lengths:
P∞
γk ≥ 0, limk →∞ γk = 0, k =1 γk = ∞.

Subgradient and Bundle Methods


Convergence Result

Assume that ∃G such that the norm of the subgradients is


bounded i.e. ||g (k ) ||2 ≤ G
(For example, Suppose f is Lipshitz continuous)
2
dist x (1) , X ∗ + G2 ki=1 αi2
P
k ∗
Result fbest − f ≤
2 ki=1 αi
P

Proof is through proving ||x − x ∗ || decreases

Subgradient and Bundle Methods


Convergence Result

Assume that ∃G such that the norm of the subgradients is


bounded i.e. ||g (k ) ||2 ≤ G
(For example, Suppose f is Lipshitz continuous)
2
dist x (1) , X ∗ + G2 ki=1 αi2
P
k ∗
Result fbest − f ≤
2 ki=1 αi
P

Proof is through proving ||x − x ∗ || decreases

Subgradient and Bundle Methods


Convergence Result

Assume that ∃G such that the norm of the subgradients is


bounded i.e. ||g (k ) ||2 ≤ G
(For example, Suppose f is Lipshitz continuous)
2
dist x (1) , X ∗ + G2 ki=1 αi2
P
k ∗
Result fbest − f ≤
2 ki=1 αi
P

Proof is through proving ||x − x ∗ || decreases

Subgradient and Bundle Methods


Convergence Result

Assume that ∃G such that the norm of the subgradients is


bounded i.e. ||g (k ) ||2 ≤ G
(For example, Suppose f is Lipshitz continuous)
2
dist x (1) , X ∗ + G2 ki=1 αi2
P
k ∗
Result fbest − f ≤
2 ki=1 αi
P

Proof is through proving ||x − x ∗ || decreases

Subgradient and Bundle Methods


Convergence for Commonly used Step lengths

(k ) G2 h
Constant step size: fbest within of optimal
2
(k )
Constant step length: fbest within Gh of optimal
(k )
Square summable but not summable step size: fbest → f ∗
(k )
Nonsummable diminishing: fbest → f ∗
(k )
Nonsummable diminishing step lengths: fbest → f ∗
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P

R/G
So, optimal αi are √ and converges in (RG/)2 steps
k

Subgradient and Bundle Methods


Convergence for Commonly used Step lengths

(k ) G2 h
Constant step size: fbest within of optimal
2
(k )
Constant step length: fbest within Gh of optimal
(k )
Square summable but not summable step size: fbest → f ∗
(k )
Nonsummable diminishing: fbest → f ∗
(k )
Nonsummable diminishing step lengths: fbest → f ∗
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P

R/G
So, optimal αi are √ and converges in (RG/)2 steps
k

Subgradient and Bundle Methods


Convergence for Commonly used Step lengths

(k ) G2 h
Constant step size: fbest within of optimal
2
(k )
Constant step length: fbest within Gh of optimal
(k )
Square summable but not summable step size: fbest → f ∗
(k )
Nonsummable diminishing: fbest → f ∗
(k )
Nonsummable diminishing step lengths: fbest → f ∗
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P

R/G
So, optimal αi are √ and converges in (RG/)2 steps
k

Subgradient and Bundle Methods


Convergence for Commonly used Step lengths

(k ) G2 h
Constant step size: fbest within of optimal
2
(k )
Constant step length: fbest within Gh of optimal
(k )
Square summable but not summable step size: fbest → f ∗
(k )
Nonsummable diminishing: fbest → f ∗
(k )
Nonsummable diminishing step lengths: fbest → f ∗
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P

R/G
So, optimal αi are √ and converges in (RG/)2 steps
k

Subgradient and Bundle Methods


Convergence for Commonly used Step lengths

(k ) G2 h
Constant step size: fbest within of optimal
2
(k )
Constant step length: fbest within Gh of optimal
(k )
Square summable but not summable step size: fbest → f ∗
(k )
Nonsummable diminishing: fbest → f ∗
(k )
Nonsummable diminishing step lengths: fbest → f ∗
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P

R/G
So, optimal αi are √ and converges in (RG/)2 steps
k

Subgradient and Bundle Methods


Convergence for Commonly used Step lengths

(k ) G2 h
Constant step size: fbest within of optimal
2
(k )
Constant step length: fbest within Gh of optimal
(k )
Square summable but not summable step size: fbest → f ∗
(k )
Nonsummable diminishing: fbest → f ∗
(k )
Nonsummable diminishing step lengths: fbest → f ∗
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P

R/G
So, optimal αi are √ and converges in (RG/)2 steps
k

Subgradient and Bundle Methods


Convergence for Commonly used Step lengths

(k ) G2 h
Constant step size: fbest within of optimal
2
(k )
Constant step length: fbest within Gh of optimal
(k )
Square summable but not summable step size: fbest → f ∗
(k )
Nonsummable diminishing: fbest → f ∗
(k )
Nonsummable diminishing step lengths: fbest → f ∗
R 2 + G2 ki=1 αi2
P
(k ) ∗
fbest − f ≤
2 ki=1 αi
P

R/G
So, optimal αi are √ and converges in (RG/)2 steps
k

Subgradient and Bundle Methods


Variations

If optimal value is known eg. if the optimal value is known


to be 0, but the point is not known
f (x (k ) ) − f ∗
αk =
||g (k ) ||2
Projected Subgradient: minimize f (x) s.t. x ∈ C
x (k +1) = P(x (k ) + αk g (k ) )
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Subgradient and Bundle Methods


Variations

If optimal value is known eg. if the optimal value is known


to be 0, but the point is not known
f (x (k ) ) − f ∗
αk =
||g (k ) ||2
Projected Subgradient: minimize f (x) s.t. x ∈ C
x (k +1) = P(x (k ) + αk g (k ) )
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Subgradient and Bundle Methods


Variations

If optimal value is known eg. if the optimal value is known


to be 0, but the point is not known
f (x (k ) ) − f ∗
αk =
||g (k ) ||2
Projected Subgradient: minimize f (x) s.t. x ∈ C
x (k +1) = P(x (k ) + αk g (k ) )
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Subgradient and Bundle Methods


Variations

If optimal value is known eg. if the optimal value is known


to be 0, but the point is not known
f (x (k ) ) − f ∗
αk =
||g (k ) ||2
Projected Subgradient: minimize f (x) s.t. x ∈ C
x (k +1) = P(x (k ) + αk g (k ) )
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x (k +1) = x (k ) − αk g (k ) + βk (x (k ) − x (k −1))

Subgradient and Bundle Methods


Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods

Subgradient and Bundle Methods


Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods

Subgradient and Bundle Methods


Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods

Subgradient and Bundle Methods


Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods

Subgradient and Bundle Methods


Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods

Subgradient and Bundle Methods


Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods

Subgradient and Bundle Methods


Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to


x ∈C
Construct an Approximate Model:
f (x) = maxi∈I (f̂ (xi ) + gi T (x − xi )
Minimize model over x and find f (x) and g
Update model and repeat till desired accuracy
Numerically unstable

Subgradient and Bundle Methods


Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to


x ∈C
Construct an Approximate Model:
f (x) = maxi∈I (f̂ (xi ) + gi T (x − xi )
Minimize model over x and find f (x) and g
Update model and repeat till desired accuracy
Numerically unstable

Subgradient and Bundle Methods


Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to


x ∈C
Construct an Approximate Model:
f (x) = maxi∈I (f̂ (xi ) + gi T (x − xi )
Minimize model over x and find f (x) and g
Update model and repeat till desired accuracy
Numerically unstable

Subgradient and Bundle Methods


Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to


x ∈C
Construct an Approximate Model:
f (x) = maxi∈I (f̂ (xi ) + gi T (x − xi )
Minimize model over x and find f (x) and g
Update model and repeat till desired accuracy
Numerically unstable

Subgradient and Bundle Methods


Cutting Plane Method

Again, Consider the problem: minimize f (x) subject to


x ∈C
Construct an Approximate Model:
f (x) = maxi∈I (f̂ (xi ) + gi T (x − xi )
Minimize model over x and find f (x) and g
Update model and repeat till desired accuracy
Numerically unstable

Subgradient and Bundle Methods


Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to


minimize f (x)
 
λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2
 
λ
p(x) = argminy ∈Rn f (y ) + ||y − x||2
2
F (x) is differentiable!
∇F (x) = λ(x − p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods

Subgradient and Bundle Methods


Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to


minimize f (x)
 
λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2
 
λ
p(x) = argminy ∈Rn f (y ) + ||y − x||2
2
F (x) is differentiable!
∇F (x) = λ(x − p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods

Subgradient and Bundle Methods


Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to


minimize f (x)
 
λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2
 
λ
p(x) = argminy ∈Rn f (y ) + ||y − x||2
2
F (x) is differentiable!
∇F (x) = λ(x − p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods

Subgradient and Bundle Methods


Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to


minimize f (x)
 
λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2
 
λ
p(x) = argminy ∈Rn f (y ) + ||y − x||2
2
F (x) is differentiable!
∇F (x) = λ(x − p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods

Subgradient and Bundle Methods


Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to


minimize f (x)
 
λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2
 
λ
p(x) = argminy ∈Rn f (y ) + ||y − x||2
2
F (x) is differentiable!
∇F (x) = λ(x − p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods

Subgradient and Bundle Methods


Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to


minimize f (x)
 
λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2
 
λ
p(x) = argminy ∈Rn f (y ) + ||y − x||2
2
F (x) is differentiable!
∇F (x) = λ(x − p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods

Subgradient and Bundle Methods


Moreau-Yosida Regularization

Idea: solve a series of smooth convex problems to


minimize f (x)
 
λ 2
F (x) = miny ∈Rn f (y ) + ||y − x||
2
 
λ
p(x) = argminy ∈Rn f (y ) + ||y − x||2
2
F (x) is differentiable!
∇F (x) = λ(x − p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods

Subgradient and Bundle Methods


Elementary Bundle Method

As before, f is assumed to be Lipshitz continuous


At a generic iteration we maintain a “bundle”
< yi , f (yi ), si , αi >

Subgradient and Bundle Methods


Elementary Bundle Method

As before, f is assumed to be Lipshitz continuous


At a generic iteration we maintain a “bundle”
< yi , f (yi ), si , αi >

Subgradient and Bundle Methods


Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization


for building the model
µk
yk +1 = argminy ∈Rn f̂k (y ) + ||y − x̂ k ||2
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk
Serious Step x̂ k +1 = y k +1
else Null Step x̂ k +1 = x̂ k
f̂k +1 (y ) = max{f̂k (y ), f (y k +1 ) + sk +1 , y − y k +1 }

Subgradient and Bundle Methods


Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization


for building the model
µk
yk +1 = argminy ∈Rn f̂k (y ) + ||y − x̂ k ||2
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk
Serious Step x̂ k +1 = y k +1
else Null Step x̂ k +1 = x̂ k
f̂k +1 (y ) = max{f̂k (y ), f (y k +1 ) + sk +1 , y − y k +1 }

Subgradient and Bundle Methods


Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization


for building the model
µk
yk +1 = argminy ∈Rn f̂k (y ) + ||y − x̂ k ||2
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk
Serious Step x̂ k +1 = y k +1
else Null Step x̂ k +1 = x̂ k
f̂k +1 (y ) = max{f̂k (y ), f (y k +1 ) + sk +1 , y − y k +1 }

Subgradient and Bundle Methods


Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization


for building the model
µk
yk +1 = argminy ∈Rn f̂k (y ) + ||y − x̂ k ||2
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk
Serious Step x̂ k +1 = y k +1
else Null Step x̂ k +1 = x̂ k
f̂k +1 (y ) = max{f̂k (y ), f (y k +1 ) + sk +1 , y − y k +1 }

Subgradient and Bundle Methods


Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization


for building the model
µk
yk +1 = argminy ∈Rn f̂k (y ) + ||y − x̂ k ||2
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk
Serious Step x̂ k +1 = y k +1
else Null Step x̂ k +1 = x̂ k
f̂k +1 (y ) = max{f̂k (y ), f (y k +1 ) + sk +1 , y − y k +1 }

Subgradient and Bundle Methods


Elementary Bundle Method

Follow Cutting Plane Method, but use M-Y Regularization


for building the model
µk
yk +1 = argminy ∈Rn f̂k (y ) + ||y − x̂ k ||2
2
µk k +1
δk = f (x̂ k ) − [f̂k (y k +1 ) + ||y − x̂ k ||2 ] ≥ 0
2
if δk < δ stop
If f (x̂ k ) − f (y k +1 ) ≥ mδk
Serious Step x̂ k +1 = y k +1
else Null Step x̂ k +1 = x̂ k
f̂k +1 (y ) = max{f̂k (y ), f (y k +1 ) + sk +1 , y − y k +1 }

Subgradient and Bundle Methods


Convergence

The Algorithm either makes a finite number of Serious


Steps and then only makes Null steps
Then, If k0 is the last Serious Step, and µk is
nondecreasing, then δk → 0
Or it makes an infinite number of Serious steps
P f (x̂0 ) − f ∗
Then, k ∈Ks δk ≤ so δk → 0
m

Subgradient and Bundle Methods


Convergence

The Algorithm either makes a finite number of Serious


Steps and then only makes Null steps
Then, If k0 is the last Serious Step, and µk is
nondecreasing, then δk → 0
Or it makes an infinite number of Serious steps
P f (x̂0 ) − f ∗
Then, k ∈Ks δk ≤ so δk → 0
m

Subgradient and Bundle Methods


Convergence

The Algorithm either makes a finite number of Serious


Steps and then only makes Null steps
Then, If k0 is the last Serious Step, and µk is
nondecreasing, then δk → 0
Or it makes an infinite number of Serious steps
P f (x̂0 ) − f ∗
Then, k ∈Ks δk ≤ so δk → 0
m

Subgradient and Bundle Methods


Convergence

The Algorithm either makes a finite number of Serious


Steps and then only makes Null steps
Then, If k0 is the last Serious Step, and µk is
nondecreasing, then δk → 0
Or it makes an infinite number of Serious steps
P f (x̂0 ) − f ∗
Then, k ∈Ks δk ≤ so δk → 0
m

Subgradient and Bundle Methods


Convergence

The Algorithm either makes a finite number of Serious


Steps and then only makes Null steps
Then, If k0 is the last Serious Step, and µk is
nondecreasing, then δk → 0
Or it makes an infinite number of Serious steps
P f (x̂0 ) − f ∗
Then, k ∈Ks δk ≤ so δk → 0
m

Subgradient and Bundle Methods


Convergence

The Algorithm either makes a finite number of Serious


Steps and then only makes Null steps
Then, If k0 is the last Serious Step, and µk is
nondecreasing, then δk → 0
Or it makes an infinite number of Serious steps
P f (x̂0 ) − f ∗
Then, k ∈Ks δk ≤ so δk → 0
m

Subgradient and Bundle Methods


Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable


Conjuguate Gradient methods are achieved as a slight
modification of the algorithm (Refer [5])
Variable Metric Methods [10]
Mk = uk I for Diagonal Variable Metric Methods
Bundle-Newton Methods

Subgradient and Bundle Methods


Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable


Conjuguate Gradient methods are achieved as a slight
modification of the algorithm (Refer [5])
Variable Metric Methods [10]
Mk = uk I for Diagonal Variable Metric Methods
Bundle-Newton Methods

Subgradient and Bundle Methods


Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable


Conjuguate Gradient methods are achieved as a slight
modification of the algorithm (Refer [5])
Variable Metric Methods [10]
Mk = uk I for Diagonal Variable Metric Methods
Bundle-Newton Methods

Subgradient and Bundle Methods


Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable


Conjuguate Gradient methods are achieved as a slight
modification of the algorithm (Refer [5])
Variable Metric Methods [10]
Mk = uk I for Diagonal Variable Metric Methods
Bundle-Newton Methods

Subgradient and Bundle Methods


Variations

Replace ||y − x||2 by (y − x)T Mk (y − x) : Still differentiable


Conjuguate Gradient methods are achieved as a slight
modification of the algorithm (Refer [5])
Variable Metric Methods [10]
Mk = uk I for Diagonal Variable Metric Methods
Bundle-Newton Methods

Subgradient and Bundle Methods


Summary

Nonsmooth convex optimization has been explored since


1960’s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.

Subgradient and Bundle Methods


Summary

Nonsmooth convex optimization has been explored since


1960’s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.

Subgradient and Bundle Methods


Summary

Nonsmooth convex optimization has been explored since


1960’s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.

Subgradient and Bundle Methods


Summary

Nonsmooth convex optimization has been explored since


1960’s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.

Subgradient and Bundle Methods


Summary

Nonsmooth convex optimization has been explored since


1960’s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.

Subgradient and Bundle Methods


For Further Reading I

Naum Z. Shor
Minimization Methods for non-differentiable functions.
Springer-Verlag, 1985.
Boyd and Vanderberge
Convex Optimization.
Cambridge University Press
A. Ruszczyinski
Nonlinear Optimization
Princeton University Press
Wikipedia
en.wikipedia.org/wiki/Subgradient_method

Subgradient and Bundle Methods


For Further Reading II

Marko Makela
Survey of Bundle Methods, 2009
http://www.informaworld.com/smpp/
content~db=all~content=a713741700
Alexandre Belloni
An Introduction to Bundle Methods
http://web.mit.edu/belloni/www/
LecturesIntroBundle.pdf
John E. Mitchell
Cutting Plane and Subgradient Methods, 2005
http://www.optimization-online.org/DB_HTML/
2009/05/2298.html

Subgradient and Bundle Methods


For Further Reading III

Lecture Notes on Subgradient methods by Stephen Boyd


http://www.stanford.edu/class/ee392o/
subgrad_method.pdf
Alexander J. Smola, S.V. N. Vishwanathan, Quoc V. Le
Bundle Methods for Machine Learning, 2007
http://books.nips.cc/papers/files/nips20/
NIPS2007_0470.pdf
C Lemarechal Variable metric bundle methods, 1997.
http://www.springerlink.com/index/
3515WK428153171N.pdf
Quoc Le, Alexander Smola
Direct Optimization of Ranking Measures, 2007
http://arxiv.org/abs/0704.3359

Subgradient and Bundle Methods


For Further Reading IV

SVN Vishwanathan, A. Smola


Quasi-Newton Methods for Efficient Large-Scale Machine
Learning
http://portal.acm.org/ft_gateway.cfm?id=
1390309&type=pdf
and
www.stat.purdue.edu/~vishy/talks/LBFGS.pdf

Subgradient and Bundle Methods

You might also like