You are on page 1of 4

Newton’s Method and Gradient Descent Method

As already stated, one of the basic tasks of optimization is to solve


(
min f (x)
x∈S

where S is a closed set in Rd and f : Rd → R a nice function. There are two general algorithms
we can used:
(1) Since the minimum will most likely be also a local minimum and thus a critical point, we
can look for the points where ∇ f = 0. This can be done by way of Newton’s method.
(2) We can address the minimization directly by “crawling” S by a trajectory of points at
which f is gradually smaller. This is the basis of Gradient descent method.
We will now describe these methods is some detail.

N EWTON ’ S METHOD

Algorithm: This is a method for finding roots of a function f , i.e., points x̂ where f (x̂) = 0.
The best way to describe its algorithm is as follows: We pick a point x1 , find the tangent line
to y = f (x) at x = x1 , look for the intersection of the tangent with the x axis and call this point x2 .
Then we repeat this starting from x2 instead of x1 and so on.
This result in a sequence of points {xn } such that xn+1 is the intersection of the tangent line
y = f (xn ) + f 0 (xn )(x − xn )
with x-axis y = 0. A bit of algebra gives

f (xn )
xn+1 = xn −
f 0 (xn )

which is well defined whenever f 0 (xn ) 6= 0. As we assume to start close to a root x̂, we may
guarantee this by requiring that f 0 (x̂) 6= 0 and assuming that f 0 is continuous (and thus non-zero
even in a neighborhood of x̂).
Convergence rate: It is worthwhile to study the convergence rate of the method. For this we
subtract x̂ on both sides of the above equation to get
f (xn ) f (xn ) + (x̂ − xn ) f 0 (xn )
xn+1 − x̂ = xn − x̂ − = −
f 0 (xn ) f 0 (xn )
and so, assuming that | f 0 (xn )| ≥ c1 > 0, we get
1
f (xn ) + (x̂ − xn ) f 0 (xn )

|xn+1 − x̂| ≤
c1
1
2

Taylor’s theorem with remainder gives us


 Z x̂
0
f (x̂) − f (xn ) + (x̂ − xn ) f (xn ) = (x − xn ) f 00 (s)ds

xn
Assuming that | f 00 (s)|
≤ c2 for s between x̂ and xn , and using that f (x̂) = 0, we get
Z x̂
f (xn ) + (x̂ − xn ) f 0 (xn ) ≤ c2 (s − xn )ds = c2 |xn − x̂|2

xn 2
Using this in the above equation we obtain
c2
|xn+1 − x̂| ≤ C|xn − x̂|2 where C :=
2c1
This is referred as quadratic convergence, although as we will show next, the decay of |xn − x̂| is
in fact doubly exponential.
Decay estimate: In analyzing the consequence of this recursive bound, note that
C|x1 − x̂| < 1 ⇒ |xn+1 − x̂| < |xn − x̂| and so C|xn − x̂| < 1
So starting the iterations anywhere in the set {x : |x − x̂| < 1/C}, the sequence will never leave
this set. Now if |x1 − x̂| < 1/C, then there is κ > 0 such that |x1 − x̂| = C1 e −2κ . By induction we
then show
1 1 n
|x1 − x̂| = e −2κ ⇒ |xn − x̂| ≤ e −2 κ
C C
This is an extremely fast approach to the limit. Indeed, if, for instance C = 1 and κ = 1/2, then
|x2 − x̂| ≤ 1/e −2 ≈ 0.1, |x4 − x̂| ≤ 1/e −8 ≈ 0.0001 and |x6 − x̂| ≤ 1/e −32 ≈ 10−15 , etc.’
Failure if started too far: The method understandably fails if x1 is too far from the root x̂. An
example for this is f (x) = tanh(x) which has a unique root at x = 0. Since f 0 (x) = cosh(x)−2 , the
iterations then correspond to
1
xn+1 = xn − sinh(2xn ).
2
As 12 sinh(2xn ) ≥ xn for xn > 0 (and opposite inequality holds for xn < 0) the signs of xn alternate.
For the sequence of absolute values we then get
1
|xn+1 | = sinh(2|xn |) − |xn |
2
1
Letting h(a) = 2 sinh(2a) − a, we have |xn+1 | = h(|xn |). The equation h(a) = a has two non-
negative solutions: a = 0 and a = a? > 0 such that sinh(2a? ) = 4a? . The graphical analysis of the
trajectory shows that
|x1 | < a? ⇒ xn → 0 (the method does find the root)
|x1 | > a? ⇒ |xn | → ∞ (the method fails)

Multivariate version: We have so far only addressed one function of one variable. The appli-
cation (finding critical points in unconstraint optimization) with require finding a common root
of d-functions of d variables,
fi (x̂) = 0, i = 1, . . . , d
3

(These functions will themselves be components of the gradient of the function we are minimiz-
ing.) Starting from a point xn ∈ Rd , we thus get d tangent lines,
yi = fi (xn ) − (x − xn ) · ∇ fi (xn )
that intersect the set where yi = 0 for all i = 1, . . . , d at the point xn+1 ∈ Rd with equations
0 = fi (xn ) − (xn+1 − xn ) · ∇ fi (xn ), i = 1, . . . , d
or
(xn+1 − xn ) · ∇ fi (xn ) = − fi (xn ), i = 1, . . . , d.
We can write these as a matrix equation: Let ∇~f (x) be the matrix with i-th column being the
vector ∇ fi (x) and ~f (x) being the row vector with i-th entry being fi (x) then
(xn+1 − xn )T ∇~f (xn ) = −~f (xn )T
Multiplying on the left by the inverse of ∇~f (xn ), we get
−1
(xn+1 − xn )T = −~f (xn )T ∇~f (xn )


If we prefer to write this using transposed vectors, this reads


−T
xn+1 = xn − ∇~f (xn ) ~f (xn )


This is the recursion corresponding to multivariate Newthon’s method. Word of advice: Derive
this in each problem again to make sure all transposes are done right.

G RADIENT D ESCENT M ETHOD

Algorithm: The goal here is to address directly the process of minimizing function f . We
will only discuss the unconstrained version, where all directions are feasible. As −∇ f (x) is the
direction of steepest descent of f at x, we set
xn+1 = xn − h∇ f (xn )
for some h > 0.

Picking the step length: The step length h was chosen to be independent of n, although one can
play with other choices as well. The question is how to select h in order to make the best gain of
the method. Let us discuss this first in the case of a function of a single variable. Then
xn+1 = xn − h f 0 (xn )
and so f (xn+1 ) = f (xn − h f 0 (xn )). To turn the right-hand side into a more manageable form, we
gain invoke Taylor’s theorem:
Z x+t
f (x + t) = f (x) + t f 0 (x) + (s − x) f 00 (s)ds
x
4

Assuming that f 00 (s) ≤ L, this gives us


t2
f (x + t) ≤ f (x) + t f 0 (x) + L
2
Using this for x = xn and t = −h f 0 (xn ), we thus get
f (xn+1 ) = f xn − h f 0 (xn )


1  2
≤ f (xn ) − h f 0 (xn ) f 0 (xn ) + L h f 0 (xn )
2
0 2
 L 2
= f (xn ) − [ f (xn )] h − h .
2
L 2
The gain from the method will be best if h − 2 h is maximal. This happens at the point
1
h=
L
In the situation when f is a function of many variables, the same derivation applies except that
f 0 (xn ) has to be replaced by ∇ f (xn ) and L by
vT Hess f (x) v
L := sup sup
x : f (x)≤ f (x1 ) v∈Rd \{0} |v|2

where |v| is the Euclidean length of vector v. The supremum over v achieved by the largest
eigenvalue of the Hessian. The supremum over x can be reduced to the set where f (x) ≤ f (x1 )
because the function f decreases along each linear segment connecting xn to xn+1 .

Remarks on convergence: It is obvious that the method defines a sequence of points {xn } along
which f (xn ) decreases. If f is bounded from below and the level sets of f are bounded, f (xn )
converges. But the above derivation (we use for convenience only the one-dimensional version)
shows
1 0 2
f (xn+1 ) ≤ f (xn ) − f (xn )
2L
or, after some algebra,
 0 2  
f (xn ) ≤ 2L f (xn ) − f (xn+1 )
Since f (xn ) − f (xn+1 ) → 0, also f 0 (xn ) → 0. Now if the level sets of f are also bounded, {xn }
contains a subsequence that converges to a point x̂. By continuity of f 0 , we then have f 0 (x̂) = 0,
i.e., x̂ is a critical point. The method thus generally finds a critical point but that could still be a
local minimum or a saddle point. Which it is cannot be decided at this level of generality.

You might also like