Math

Newton’s Method and Gradient Descent Method
As already stated, one of the basic tasks of optimization is to solve

(
min f (x)
x∈S
where S is a closed set in Rd and f : Rd → R a nice function. There are two general algorithms
we can used:
(1) Since the minimum will most likely be also a local minimum and thus a critical point, we
can look for the points where ∇ f = 0. This can be done by way of Newton’s method.
(2) We can address the minimization directly by “crawling” S by a trajectory of points at
which f is gradually smaller. This is the basis of Gradient descent method.
We will now describe these methods is some detail.
N EWTON ’ S METHOD
Algorithm: This is a method for finding roots of a function f , i.e., points x̂ where f (x̂) = 0.
The best way to describe its algorithm is as follows: We pick a point x1 , find the tangent line
to y = f (x) at x = x1 , look for the intersection of the tangent with the x axis and call this point x2 .
Then we repeat this starting from x2 instead of x1 and so on.
This result in a sequence of points {xn } such that xn+1 is the intersection of the tangent line
y = f (xn ) + f 0 (xn )(x − xn )
with x-axis y = 0. A bit of algebra gives
f (xn )
xn+1 = xn −
f 0 (xn )
which is well defined whenever f 0 (xn ) 6= 0. As we assume to start close to a root x̂, we may
guarantee this by requiring that f 0 (x̂) 6= 0 and assuming that f 0 is continuous (and thus non-zero
even in a neighborhood of x̂).
Convergence rate: It is worthwhile to study the convergence rate of the method. For this we
subtract x̂ on both sides of the above equation to get
f (xn ) f (xn ) + (x̂ − xn ) f 0 (xn )
xn+1 − x̂ = xn − x̂ − = −
f 0 (xn ) f 0 (xn )
and so, assuming that | f 0 (xn )| ≥ c1 > 0, we get
1
f (xn ) + (x̂ − xn ) f 0 (xn )

|xn+1 − x̂| ≤
c1
1
2
Taylor’s theorem with remainder gives us

Z x̂
0
f (x̂) − f (xn ) + (x̂ − xn ) f (xn ) = (x − xn ) f 00 (s)ds

xn
Assuming that | f 00 (s)|
≤ c2 for s between x̂ and xn , and using that f (x̂) = 0, we get
Z x̂
f (xn ) + (x̂ − xn ) f 0 (xn ) ≤ c2 (s − xn )ds = c2 |xn − x̂|2

xn 2
Using this in the above equation we obtain
c2
|xn+1 − x̂| ≤ C|xn − x̂|2 where C :=
2c1
This is referred as quadratic convergence, although as we will show next, the decay of |xn − x̂| is
in fact doubly exponential.
Decay estimate: In analyzing the consequence of this recursive bound, note that
C|x1 − x̂| < 1 ⇒ |xn+1 − x̂| < |xn − x̂| and so C|xn − x̂| < 1
So starting the iterations anywhere in the set {x : |x − x̂| < 1/C}, the sequence will never leave
this set. Now if |x1 − x̂| < 1/C, then there is κ > 0 such that |x1 − x̂| = C1 e −2κ . By induction we
then show
1 1 n
|x1 − x̂| = e −2κ ⇒ |xn − x̂| ≤ e −2 κ
C C
This is an extremely fast approach to the limit. Indeed, if, for instance C = 1 and κ = 1/2, then
|x2 − x̂| ≤ 1/e −2 ≈ 0.1, |x4 − x̂| ≤ 1/e −8 ≈ 0.0001 and |x6 − x̂| ≤ 1/e −32 ≈ 10−15 , etc.’
Failure if started too far: The method understandably fails if x1 is too far from the root x̂. An
example for this is f (x) = tanh(x) which has a unique root at x = 0. Since f 0 (x) = cosh(x)−2 , the
iterations then correspond to
1
xn+1 = xn − sinh(2xn ).
2
As 12 sinh(2xn ) ≥ xn for xn > 0 (and opposite inequality holds for xn < 0) the signs of xn alternate.
For the sequence of absolute values we then get
1
|xn+1 | = sinh(2|xn |) − |xn |
2
1
Letting h(a) = 2 sinh(2a) − a, we have |xn+1 | = h(|xn |). The equation h(a) = a has two non-
negative solutions: a = 0 and a = a? > 0 such that sinh(2a? ) = 4a? . The graphical analysis of the
trajectory shows that
|x1 | < a? ⇒ xn → 0 (the method does find the root)
|x1 | > a? ⇒ |xn | → ∞ (the method fails)
Multivariate version: We have so far only addressed one function of one variable. The appli-
cation (finding critical points in unconstraint optimization) with require finding a common root
of d-functions of d variables,
fi (x̂) = 0, i = 1, . . . , d
3
(These functions will themselves be components of the gradient of the function we are minimiz-
ing.) Starting from a point xn ∈ Rd , we thus get d tangent lines,
yi = fi (xn ) − (x − xn ) · ∇ fi (xn )
that intersect the set where yi = 0 for all i = 1, . . . , d at the point xn+1 ∈ Rd with equations
0 = fi (xn ) − (xn+1 − xn ) · ∇ fi (xn ), i = 1, . . . , d
or
(xn+1 − xn ) · ∇ fi (xn ) = − fi (xn ), i = 1, . . . , d.
We can write these as a matrix equation: Let ∇~f (x) be the matrix with i-th column being the
vector ∇ fi (x) and ~f (x) being the row vector with i-th entry being fi (x) then
(xn+1 − xn )T ∇~f (xn ) = −~f (xn )T
Multiplying on the left by the inverse of ∇~f (xn ), we get
−1
(xn+1 − xn )T = −~f (xn )T ∇~f (xn )

If we prefer to write this using transposed vectors, this reads

−T
xn+1 = xn − ∇~f (xn ) ~f (xn )

This is the recursion corresponding to multivariate Newthon’s method. Word of advice: Derive
this in each problem again to make sure all transposes are done right.
G RADIENT D ESCENT M ETHOD
Algorithm: The goal here is to address directly the process of minimizing function f . We
will only discuss the unconstrained version, where all directions are feasible. As −∇ f (x) is the
direction of steepest descent of f at x, we set
xn+1 = xn − h∇ f (xn )
for some h > 0.
Picking the step length: The step length h was chosen to be independent of n, although one can
play with other choices as well. The question is how to select h in order to make the best gain of
the method. Let us discuss this first in the case of a function of a single variable. Then
xn+1 = xn − h f 0 (xn )
and so f (xn+1 ) = f (xn − h f 0 (xn )). To turn the right-hand side into a more manageable form, we
gain invoke Taylor’s theorem:
Z x+t
f (x + t) = f (x) + t f 0 (x) + (s − x) f 00 (s)ds
x
4
Assuming that f 00 (s) ≤ L, this gives us

t2
f (x + t) ≤ f (x) + t f 0 (x) + L
2
Using this for x = xn and t = −h f 0 (xn ), we thus get
f (xn+1 ) = f xn − h f 0 (xn )

1 2
≤ f (xn ) − h f 0 (xn ) f 0 (xn ) + L h f 0 (xn )
2
0 2
L 2
= f (xn ) − [ f (xn )] h − h .
2
L 2
The gain from the method will be best if h − 2 h is maximal. This happens at the point
1
h=
L
In the situation when f is a function of many variables, the same derivation applies except that
f 0 (xn ) has to be replaced by ∇ f (xn ) and L by
vT Hess f (x) v
L := sup sup
x : f (x)≤ f (x1 ) v∈Rd \{0} |v|2
where |v| is the Euclidean length of vector v. The supremum over v achieved by the largest
eigenvalue of the Hessian. The supremum over x can be reduced to the set where f (x) ≤ f (x1 )
because the function f decreases along each linear segment connecting xn to xn+1 .
Remarks on convergence: It is obvious that the method defines a sequence of points {xn } along
which f (xn ) decreases. If f is bounded from below and the level sets of f are bounded, f (xn )
converges. But the above derivation (we use for convenience only the one-dimensional version)
shows
1 0 2
f (xn+1 ) ≤ f (xn ) − f (xn )
2L
or, after some algebra,
0 2
f (xn ) ≤ 2L f (xn ) − f (xn+1 )
Since f (xn ) − f (xn+1 ) → 0, also f 0 (xn ) → 0. Now if the level sets of f are also bounded, {xn }
contains a subsequence that converges to a point x̂. By continuity of f 0 , we then have f 0 (x̂) = 0,
i.e., x̂ is a critical point. The method thus generally finds a critical point but that could still be a
local minimum or a saddle point. Which it is cannot be decided at this level of generality.

Math

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Math

Uploaded by

Copyright:

Available Formats

Newton’s Method and Gradient Descent Method

As already stated, one of the basic tasks of optimization is to solve

Taylor’s theorem with remainder gives us

If we prefer to write this using transposed vectors, this reads

G RADIENT D ESCENT M ETHOD

Assuming that f 00 (s) ≤ L, this gives us

You might also like