You are on page 1of 31

Chapter 2

The Basic Concepts of Algorithms


The synthesis and analysis of algorithms is a rather complicated field in computer science. Since we
shall model every problem in molecular biology as an algorithm problem in this book, we shall give
a brief introduction of some basic concepts of algorithms in this chapter in the hope that the reader
can have a basic understanding of issues related to algorithms. It is quite difficult to present all of
the terms in the field of algorithms formally. We will therefore take an informal approach. That is,
all of the basic concepts will be presented in an informal way.
2.1 The Minimal Spanning Tree Problem
Every computer program is based upon some kind of algorithms. Efficient algorithms will produce
efficient programs. et us consider the minimal spanning tree problem. We believe that the reader
will quickly understand why it is so important to study algorithm design.
There are two versions of spanning tree problem. We will introduce one version of them first.
!onsider "igure #.$.
"igure #.$% & Set of 'lanar 'oints
& tree is a graph without cycles and a spanning tree of a set of points is a tree consisting all of the
points. In "igure #.#, we show three spanning trees of the set of points in "igure #.$.
#($
"igure #.#% Three Spanning Trees of the Set of 'oints in "igure #.$
&mong the three spanning trees, the tree in "igure #.#)a* is the shortestand this is what we are
interested. Thus the minimal spanning tree problem is defined as follows% We are given a set of
points and we are asked to find a spanning tree with the shortest total length.
+ow can we find a minimal spanning tree, & very straightforward algorithm is to enumerate all
possible spanning trees and one of them must be what we are looking for. "igure #.- shows all of
the possible spanning trees for three points. &s can be seen, there are only three of them.
"igure #.-% &ll 'ossible Spanning Trees for Three 'oints
"or four points, as shown in "igure #.., there are si/teen $0 possible spanning trees. In general, it
can be shown that given n points, there are
# n
n
possible spanning trees for them. Thus if we have
1 points, there are already
-
1
2 $#1 possible spanning trees. If n 2 $33, there will be
45
$33

possible spanning trees. Even if we have an algorithm to generate all spanning trees, time will not
allow us to do so. 6o computer can finish this enumeration within any reasonable time.
#(#
"igure #..% &ll 'ossible Spanning Tree for "our 'oints
7et, there is an efficient algorithm to solve this minimal spanning tree problem. et us first
introduce the 'rim8s &lgorithm.
2.1.1 Prim's Algorithm
!onsider the points in "igure #.$ again. Suppose we start with any point, say point b. he nearest
neighbor of point b is point a. We now connect point a with point b , as shown in "igure #.1)a*. et
us denote the set { } b a, as X and the set of the rest of points as Y. We now find the shortest distance
between points in X and Y which is that between between b and e, We add e to the minimal
spanning tree by connecting b with e, as shown in "igure #.1)b*. 6ow X 2 { } e b a , , and Y 2
{ } f d c , ,
. 9uring the whole process, we continuously create two sets, namely X and Y. X consists
of all of the points in the partially created minimal spanning tree and 7 consists of all of the
remaining points. In each step of 'rim8s algorithm, we find a shortest distance between X and Y and
add a new point to the tree until Y is empty. "or the points in "igure #.$, the process of constructing
a minimal spanning tree through this method is shown in "igure #.1.
"igure #.1% The 'rocess of !onstructing a :inimal Spanning Thee ;ased upon 'rim<s &lgorithm
In the above, we assumed that the input is a set of planar points. We can generali=e it so that the
input is a connected graph where each edge is associated with a positive weight. It can be easily
seen that a set of planar points corresponds to a graph where there is an edge between every two
#(-
vertices and the weight associated with each edge is simply the Euclidean distance between the two
pints. If there is an edge between every pair of vertices, we shall call this kind of graph a complete
graph. Thus a set of planar points correspond to a complete graph. 6ote that in a general graph, it
is possible that there is no edge between two vertices. & typical graph is now shown in "igure #.0.
Throughout the entire book, we shall always denote a graph by G 2 (V,E) where V is the set of
vertices in G and E is the set of edges in G.
"igure #.0% & >eneral >raph
We now present 'rim8s algorithm as follows%
Algorithm 2.1 'rim8s &lgorithm to !onstruct a :inimal Spanning Tree
Inpt! & weighted, connected and undirected graph G 2 )V,E*.
"tpt! & minimal spanning tree of G.
Step 1! et x be any verte/ in V. et X 2 { } x and Y 2 V ? { } x
Step 2! Select an edge )u,v* from E such that X u , Y v and )u,v*has the smallest weight among
edges between X and Y.
Step #! !onnect u to v. et { } v X X and
{ } v Y Y ?
.
Step $! If Y is empty, terminate and the resulting tree is a minimal spanning tree. @therwise, go to
Step #.
et us consider the graph in "igure #.A. he process of applying 'rim8s algorithm to this graph is now
illustrated in "igure #.5.
#(.
"igure #.A% & >eneral >raph
"igure #.5% The 'rocess of &pplying 'rim<s &lgorithm to the >raph in "igure #.A
We will not formally prove the correctness of 'rim8s algorithm. The reader can find the proof in
almost every te/tbook on algorithms. 7et, now we can easily see the importance of algorithms. If
one does not know the e/istence of such an efficient algorithm to construct minimal spanning trees,
one can never construct minimal spanning trees if the input si=e is large. It would be disastrous for
any one to use an e/haustive search method in this case.
In the following, we will introduce another algorithm to construct a minimal spanning tree. This
algorithm is called Bruskal8s algorithm.
#(1
2.1.2 %rs&al's Algorithm
Bruskal8s algorithm to construct a minimal spanning tree is quite similar to 'rim8s algorithm. It
would first sort all of the edges in the graph into an ascending sequence. Then edges are added into
a partially constructed minimal spanning tree one by one. Each time an edge is added, we check
whether a cycle is formed. If a cycle is formed, we discard this edge. The algorithm is terminated if
the tree contains n($ edges.
Algorithm 2.2 Bruskal8s &lgorithm to !onstruct a :inimal Spanning Tree
Inpt! & weighted, connected and undirected graph G 2 )V,E*.
"tpt! & minimal spanning tree of G.
Step 1!
T
.
Step 2! 'hile T contains less than n($edges (o
!hoose an edge )v,w* from E of the smallest weight.
9elete )v,w* from E.
If the adding of )v,w*does not create cycle in T then
&dd )v,w* to T.
)lse 9iscard )v,w*.
en( 'hile
et us consider the graph in "igure #.A. The process of applying Bruskal8s algorithm to this graph is
illustrated in "igure #.4.
;oth algorithms presented above are efficient.
#(0
"igure #.4% The 'rocess of &pplying Bruskal<s &lgorithm to the >raph in "igure #.A
2.2 The *ongest Common Sbse+ence Problem
In this section, we will introduce another seemingly difficult problem and again show an efficient
algorithm to solve this problem.
!onsider the following sequence%
S% &;99C>T7.
The following sequences are all subsequences of S
&;99C>T, 9>T7, 99C>T, ;9C, &;9T7, 99CT, >7, &9C>
&;T7, T7, &;>, 997, &C, ;T7, T7, 7
Suppose that we are given two sequences as follows%
S
$
% &;99C>T7
S
#
% !9E9>CCT
Then we can see that the following sequences are subsequences of both S
$
and

S
#
9CT
9C>T
99C>T
#(A
&ll of the above sequences are called common subsequences of S
$
and

S
#.
The longest common
subsequence problem is defined as follows% >iven two sequences, find a longest common
subsequence of them. If one is not familiar with algorithm design, one may be totally lost with this
problem. &lgorithm #.- is a naive algorithm to solve this problem.
Algorithm 2.# & 6aive &lgorithm for the ongest !ommon Subsequence 'roblem
Step 1! >enerate all of the subsequences of, say S
$
Step 2! Starting with the longest one, check whether it is a subsequence of S
#
Step #! If it is not, delete this subsequence and, return to Step #.
Step $! @therwise, return this subsequence as a longest common subsequence of S
$
and

S
#
The trouble is that it is e/ceedingly time(consuming to generate all of the subsequences of a
sequence. & much better and much more efficient algorithm employs a strategy called the dynamic
programming strategy. Since this strategy is used very often to solve problems in molecular biology,
we shall describe it in the ne/t section.
2.# The ,-namic Programming Strateg
The dynamic programming strategy can be e/plained by considering the graph in "igure #.$3. @ur
problem is to find a shortest route from verte/ S to verte/ T. &s can be seen, there are three
branches from S. Thus we have at least three routes, namely going through , going through ! and
going through ". We have no idea which one is the shortest. ;ut we have the following principle%
#f we go through a vertex X$ we should find a shortest route from X to T
"igure #.$3% & >raph
et d)x,%* denote the shortest distance between vertices x and %. We have the following equation%
#(5

'

+
+
+

* , ) * , )
* , ) * , )
* , ) * , )
min * , )
T " d " S d
T ! d ! S d
T d S d
T S d
The question is% +ow do we find the shortest route from, say verte/ , to verte/ T, 6ote that we
can use the same principle to find a shortest route from to T. That is, the problem to find a
shortest route from to T is the same as the problem finding a shortest route from S to T e/cept the
si=e of the problem is now smaller.
The shortest route finding problem can now be solved systematically as follows%

'

+
+
+

* , ) * , )
* , ) * , )
* , ) * , )
min * , )
T " d " S d
T ! d ! S d
T d S d
T S d

'

+
+
+

* , ) -
* , ) $5
* , ) $1
min
T " d
T ! d
T d

'

+
+

* , ) * , )
* , ) * , )
min * , )
T E d E d
T & d & d
T d

'

+
+

* , ) $3
* , ) $$
min
T E d
T & d

'

+
+

#$ $3
.$ $$
min
2 -$
#(4
)#.#*
)#.$*

'

+
+
+

* , ) * , )
* , ) * , )
* , ) * , )
min * , )
T G d G ! d
T ' d ' ! d
T E d E ! d
T S d

'

+
+
+

* , ) #
* , ) $
* , ) 4
min
T G d
T ' d
T E d

'

+
+
+

#$ #
- $
#$ 4
min
2 .

'

+
+

* , ) * , )
* , ) * , )
min * , )
T ( d ( " d
T G d G " d
T S d

'

+
+

#A $0
#$ $.
min
2 -1
Substituting #.#, #.- and #.. into #.$, we obtain that
* , ) T S d
2 minD$1E-$,$5E.,-
E-1F 2 ##, which implies that the shortest route from S to T is T ' ! S . &s shown above, the
basic idea of dynamic programming strategy is to decompose a large problem into several sub(
problems. Each sub(problem is identical to the original problem e/cept the si=e is smaller. Thus
the dynamic programming strategy always solves a problem recursively.
In the ne/t section, we go back to the longest common subsequence problem and show how the
dynamic programming strategy can be applied to solve the problem.
2.$ Application of the ,-namic Programming Strateg- to Sol.e the
#($3
)#..*
)#.-*
*ongest Common Sbse+ence Problem
The longest common subsequence problem was presented in Section #.#. It was also pointed out
that we can not solve the problem in any naGve and unsophisticated way. In this section, we shall
show that this problem can be solved elegantly by using the dynamic programming strategy.
We are given two sequences%
m
a a a S ...
# $ $

and
n
b b b S ...
# $ #

. !onsider
m
a
and
n
b
. There
are two cases%
Case 1!
n m
b a
. In this case,
m
a
, which is equivalent to
n
b
, must be included in the longest
common subsequence. The longest common subsequence of
$
S
and
#
S
is the longest common
subsequence of
$ # $
...
m
a a a
and
$ # $
...
n
b b b
plus
m
a
.
Case 2!
n m
b a
. Then we find two longest common subsequences, that of
m
a a a ...
# $
and
$ # $
...
n
b b b
and that of
$ # $
...
m
a a a
and
n
b b b ...
# $
. &mong these two, we choose the longer
one and the longest common subsequence of
$
S
and
#
S
must be this longer one.
To summari=e, the dynamic programming strategy decomposes the longest common subsequence
problem into three identical sub(problems and each of them is of smaller si=e. Each sub(problem
can now be solved recursively.
In the following, to simplify the discussion, let us concentrate our mind to finding the length of a
longest common subsequence. It will be obvious that our algorithm can be easily e/tended to find a
longest common subsequence.
et
* , ) ) i *"S
denote the length of a longest common subsequence of
i
a a a ...
# $
and )
b b b ...
# $ .
* , ) ) i *"S
can be found by the following formula%
{ }

'

) i
) i
b a if ) i *"S ) i *"S
b a if ) i *"S
) i *"S
* $ , ) *, , $ ) ma/
$ * $ , $ )
* , )
3 * $ , 3 ) * 3 , $ ) * 3 , 3 ) *"S *"S *"S
The following is an algorithm to find the length of a longest common subsequence based upon the
dynamic programming strategy%
Algorithm 2.$ &n &lgorithm to "ind the ength of a ongest !ommon Subsequence ;ased upon
the 9ynamic 'rogramming Strategy
Inpt!
m
a a a ...
# $
and
n
b b b ...
# $
"tpt! The length of a longest common subsequence of and !, denoted as
* , ) n m *"S
#($$
Step 1!
3 * $ , 3 ) * 3 , $ ) * 3 , 3 ) * * *
Step 2! for i 2 $ to m (o
for ) 2 $ to n (o
if ) i
b a
then
{ $ * $ , $ ) * , ) + ) i *"S ) i *"S
else { } { * $ , ) *, , $ ) ma/ * , ) ) i *"S ) i *"S ) i *"S
en( for
en( for
et us consider an e/ample.
2 &>!T
and ! 2 !>T
The entire process of finding the length of a longest common subsequence of and ! is now
illustrated in the following table, Table #.$. ;y tracing back, we can find two longest subsequences
!T and >T.
et us consider another e/ample% 2 aabcdec and ! 2 badea. Table #.# illustrates the process.
&gain, it can be seen that we have two longest common subsequences, namely bde and ade.
Table #.$% The 'rocess of "inding the ength of ongest !ommon Subsequence of &>!T and !>T
Table #.#% The 'rocess of "inding the ength of a ongest !ommon Subsequence of 2 aabcdec
and ! 2 badea
#($#
2./ The Time0Comple1it- of Algorithms
In the above sections, we showed that it is important to be able to design efficient algorithms. @r, to
put it in another way, we may say that many problems can hardly be solved if we can not design
efficient algorithms. Therefore, we now come to a critical question% +ow do we measure the
efficiency of algorithms,
We usually say that an algorithm is efficient if the program based upon this algorithm runs very fast.
;ut whether a program runs fast or not sometimes depends on the hardware and also the skill of the
programmers which are irrelevant to the algorithm.
In algorithm analysis, we always choose a particular step of the algorithm. Then we try to see how
many such steps are needed to complete the program. "or instance, in all sorting algorithms, the
comparison of data can not be avoided. Therefore, we often use the number of comparisons of data
as the time(comple/ity of a sorting algorithm.
et us consider the straight insertion sort algorithm. We are given a sequence of numbers,
n
x x x ...
# $
. The straight insertion sort algorithm scans this sequence. If
i
x
is found to be smaller
than
$ i
x
, we put
$
x
to the left of
$ i
x
. This process is continued until the left of
i
x
is not
smaller than it.
Algorithm 2.$ The Straight Insertion Sort &lgorithm
Inpt! & sequence of numbers
n
x x x ...
# $
"tpt! The sorted sequence of
n
x x x ...
# $
for ) 2 # to n (o
H $ ) i
H
)
x x
'hile
3 > < i and x x
i
(o
H $
H
$

+
i i
x x
i i
en( 'hile
H
$
x x
i

+
en( for
Suppose that the input sequence is 4, $A, $, 1, $3. The straight insertion sort sorts this sequence into
a sorted sequence as follows%
#($-
4
4,$A
$,4,$A
$,1,4,$A
$,1,4,$3,$A
In this sorting algorithm, the dominating steps are data movements. There are three data
movements, namely
i i )
x x x x
+$
,
and
x x
i

+$
. We can use the number of data movements to
measure the time(comple/ity of the algorithm. In the algorithm, there are one outer loop and one
inner loop. "or the outer loop, the data movement operations )
x x
and
x x
i

+$
are always
e/ecuted no matter what the input is. "or the inner loop, there is only one data movement, namely
i i
x x
+$
. This operation is e/ecuted only if this inner loop is e/ecuted. In other words, whether
this operation is e/ecuted depends on the input data. et us denote the number of data movements
e/ecuted for the inner loop by
i
d
. The total number of data movements for the straight insertion
sort is

+
+
n
i
i
n
i
i
d n
d X
#
#
* $ ) #
* # )
Best Case!
The worst case occurs when the input sequence is e/actly reversibly sorted. In such a case,
$
#
d
#
-
d
$ n d
n
Thus,
( )
#
#
#
* . *) $ )
#
* $ )
* $ ) #
#
* $ )
n
n n n n
n X
n n
d
n
i
i

+

A.erage Case!
To conduct an analysis of the average case, note that when
i
x
is being considered,
* $ ) i
data
movements have already been sorted. Imagine that
i
x
is the largest among the i numbers. Then the
#($.
I
.
inner loop will not be e/ecuted. If
i
x
is the )th largest number among the i numbers, there will be )(
$ data movements e/ecuted in the inner loop. The probability that
i
x
is the )th largest among i
numbers is
i
$
for
i ) $
. Therefore the average number of data movements is
#
-
$
$
....
- #
$
+

+
+ +

i
i
)
i
i
i i
i
)
The average case time(comple/ity for straight insertion sort is
( ) ( ) ( )
#
# #
#
5 $
.
$
-
#
$
#
-
n n n
i
i
n
i
n
i
n
i
+

,
_

+
+

In summary, the time(comple/ities of the straight insertion sort are as follows%


Straight Insertion Sort Algorithm
;est !ase% #)n($*
&verage !ase% ( )( ) $ 5
.
$
+ n n
Worst !ase% ( )( ) . $
#
$
+ n n
In the above, we showed how an algorithm can be analy=ed. "or each algorithm, there are three
time(comple/ities% the worst case time(comple/ity, the average case time(comple/ity and the best
case time(comple/ity. "or practical purposes, the average case time(comple/ity is the most
important one. Jnfortunately, it is usually very difficult to obtain an average case time(comple/ity.
In this book, unless we specify clearly, whenever we mention the time(comple/ity, we mean the
worst case time(comple/ity.
6ow, suppose we have a time(comple/ity equal to
n n +
#
. It can be easily seen that as n becomes
very very large,
#
n
dominates. That is, we may ignore the term n when n is large enough. We now
present a formal definition for this idea.
,efinition 2.1
** ) ) * ) n g n f
if and onl% if there exist two positive constants c and
3
n
such
#($1
that
3
* ) * ) n n all for n g c n f
.
This function is called big function. If
* )n f
is
** ) ) n g
,
* )n f
is bounded by
* )n g
, is
certain sense, as n is large enough. If the time(comple/ity of an algorithm is
** ) ) n g
, it will take
less than
* )n g c
steps to run this algorithm as n is large enough for some c. 6ote that n is the si=e
of the input data.
&ssume that the time(comple/ity of an algorithm is * )
#
n n + . Then
$ #
*
$
$ )
* )
#
#
#
#

+
+
n for n
n
n
n n n f
Thus, we say that the time(comple/ity is * )
#
n because we can fine c and
3
n
to be # and $
respectively.
It is customary to use
* $ )
to represent a constant. et us go back to the time(comple/ities of the
straight insertion sort algorithm. Jsing the big function, we now have%
Straight Insertion Sort Algorithm
;est !ase%
* )n
&verage !ase% * )
#
n
Worst !ase% * )
#
n
:any other algorithms have been analy=ed. Their time(comple/ities are as follows%
The Binar- Search Algorithm
;est !ase%
* $ )
&verage !ase%
* )log n
Worst !ase%
* )log n
The Straight Selection Sort
;est !ase%
* $ )
&verage !ase%
* log ) n n
Worst !ase% * )
#
n
2ic&sort
;est !ase%
* log ) n n
#($0
&verage !ase%
* log ) n n
Worst !ase% * )
#
n
3eapsort Sort
;est !ase%
* )n
&verage !ase%
* log ) n n
Worst !ase%
* log ) n n
The ,-namic Programming Approach to 4in( a *ongest Common Sbse+ence
;est !ase% * )
#
n
&verage !ase% * )
#
n
Worst !ase% * )
#
n
In Table #.-, we list different time(comple/ity functions in terms of the input si=es.
Table #.-% Time(!omple/ity "unctions
&s can be seen in this table, it is quite important to be able to design algorithms with low time(
comple/ities. "or instance, suppose that one algorithm has
* )n
time(comple/ity and another
one has
* )log n
time(comple/ity. When n 2 $3333, the algorithm with
* )log n
takes much
less number of steps than the algorithm with
* )n
time(comple/ity.
2.5 The 20,imensional Ma1ima 4in(ing Problem an( the ,i.i(e an(
Con+er Strateg
In the above sections, we showed the importance of designing efficient algorithms. In this section,
we shall introduce another problem, called the #(dimensional ma/ima finding problem. &
straightforward algorithm for this problem would have * )
#
n time(comple/ity. 7et we can have
an algorithm to solve the same problem with
* log ) n n
time(comple/ity. This algorithm is based
#($A
upon the divide and conquer strategy.
In the #(dimensional space, a point
* , )
$ $
% x
is said to dominate another point
* , )
# #
% x
if
# $
x x >

and
# $
% % >
. If a point is not dominated by any point, it is called a maxima. "or e/ample, in "igure
#.$$, all of the circled points are ma/ima8s. The maxima finding problem is defined as follows%
>iven a set of n points, find all of the ma/ima8s of these points.
"igure #.$$% & Set of #(dimensional 'oints
& straightforward algorithm to find all of these ma/ima8s is to conduct an e/haustive search. That
is, for each point, we compare it with all of the other points to see if it is dominated by any of them.
The total number of comparisons needed is therefore
# K * $ ) n n
. This means that the time(
comple/ity of the straightforward algorithm is * )
#
n
et us consider "igure #.$# in which the set S of input points is divided into two sets, those to the
left of * and those to the right of *. This line *, which is perpendicular to the X(a/is, is a median
line. That is, * is determined by the median of the X(values of all of the points. 9enote the set of
points left to * by
*
S
and those to the right of by
+
S
. The divide and conquer approach would
find the ma/ima8s of
*
S
and
+
S
separately. In "igure #.$#,
$
,
,
#
,
and
-
,
are ma/ima<s for
*
S
and
0
,
,
5
,
and
$3
,
are ma/ima<s for
+
S
.
#($5
"igure #.$#% The Sets of 'oints in "igure #.$$ 9ivided by a :edian ine
We observe that the ma/ima8s of
+
S
must be ma/ima8s of the whole set. ;ut some of the ma/ima8s
of
*
S
may not be the ma/ima8s of S. "or instance,
-
,
is not. To determine whether a ma/ima of
*
S
is a ma/ima of S, let us observe "igure #.$-. In "igure #.$-, the ma/ima<s of
*
S
and
+
S
are
all proLected onto the Y(a/is. & ma/ima u of
*
S
is a ma/ima of S if and only if there is no ma/ima
v of
+
S
whose %(value is higher than that of u. In "igure #.$-, it can be seen that
-
,
is not a
ma/ima of S because the %(value of
0
,
and
5
,
are both higher than the %(value of
-
,
. We may
now conclude that the set of ma/ima<s of S is D
$
,
,
#
,
,
0
,
,
5
,
,
$3
,
F.
"igure #.$-% The 'roLection of :a/ima<s onto the Y(a/is
We can see that the divide and conquer approach consists of two stages. "irst, it divides the set of
input of data into two subsets
$
S
and
#
S
. We solve the two problems with
$
S
and
#
S

separately. et us denote the two sub(solutions by
$

and
#

. The second stage is the merging


#($4
scheme which merges
$

and
#

into the final solution .


The reader may be pu==led about one problem% +ow are we going to find the ma/ima8s of, say
*
S
, The answer is% We find the ma/ima8s of
*
S
and
+
S
by recursively using this divide and
conquer algorithm. That is,
*
S
, for e/ample, is to be divided into two subsets again and ma/ima8s
are found for both of them. Thus, the following &lgorithm to find ma/ima<s based upon the divide
and conquer strategy.
Algorithm 2.5 &n &lgorithm to "ind :a/ima8s ;ased upon the 9ivide and !onquer Strategy
Inpt! & set S of #(dimensional points.
"tpt! The ma/ima<s of S.
Step 1! If S contains only one point, return it as a ma/ima. @therwise, find a line perpendicular to
the X(a/is which separates S into
*
S
and
+
S
, each of which consisting of nK# points.
Step 2! Cecursively find the ma/ima<s of
*
S
and
+
S
.
Step #! 'roLect the ma/ima<s of
*
S
and
+
S
onto * and sort these points according to their %(
values. !onduct a linear scan on the proLections and discard each of ma/ima of
*
S
if its %(
value is less than the %(value of some ma/ima<s of
+
S
.
&lthough in our algorithm, we divide the set of points into subsets so small that it contains only one
point, it is not necessary to do so. In practice, we can divide the set into subsets containing any
small constant number of points and solve the problem with this small subset of points by a naive
and straightforward method. Imagine that the final subset contains only two points, the ma/ima of
these points can be determined by Lust one comparison.
To give the reader some feeling of the MrecursivenessM of the divide and conquer strategy, let us
consider the set of 5 points in "igure #.$.. To simplify our discussion, let us divide our set of 5
points into . subsets, each of them contains # points as shown in "igure #.$..
#(#3
"igure #.$.% The 9ividing of 'oints into . Subsets
@ur algorithm will proceed as follows%
)$* The sets of ma/ima<s for
$$
S
,
$#
S
,
#$
S
and
##
S
are separately found to be
F , D F, , D F, D
0 1 #$ . - $# $ $$
, , , , ,
and
F , D
5 A ##
, ,
respectively.
)#* :erging
$$
S
and
$#
S
will eliminate
$
,
. Thus
F , D
. - $
, ,
.
)-* :erging
#$
S
and
##
S
will eliminate
1
,
and
0
,
. Thus
F , D
5 A #
, ,
.
).* :erging
$

and
#

will eliminate
.
,
. Thus the set of ma/ima<s is
F , , D
5 A -
, , ,
We will conduct an analysis of &lgorithm #.0. In the algorithm, there is a step which finds the
median of a set of numbers. "or e/ample, consider the set D$A, 1, A, #0, $3, #$, $-, -$, .F. The
median is $-. 6ote that there are four numbers, namely 1, A, $3 and ., which are smaller than $-
and four numbers, namely $A, #0, #$ and -$, which are greater than $-. To find such a median, it
appears that we have to sort the numbers. "or this case, the sorted sequence is ., 1, A, $3, $-, $A,
#$, #0, -$ and the median $- can now be easily determined. The trouble is that the time(comple/ity
of sorting is at least
* log ) n n
. If we use a median finding algorithm which employs sorting, its
time(comple/ity is at least
* log ) n n
. In the following section, we shall show that there is a
median finding algorithm whose time(comple/ity is
* )n
. This algorithm is based upon the prune
and search strategy. :eanwhile, in the analysis of &lgorithm #.0, we shall use this fact.
The time(comple/ity of &lgorithm #.0 depends on the time(comple/ities of the following steps%
)$* In Step $, the splitting step, there is a median finding operation. &s we pointed out in the above
paragraph, this can be accomplished in
* )n
steps. This means that the time(comple/ity of the
splitting step is
* )n
.
#(#$
)#* In Step #, there are two sub(problems. Each of them is of si=e nK# and will be recursively
solved by using the same algorithm.
)-* In Step -, the merging step, there is a sorting step and a linear scan step. The sorting takes
* log ) n n
steps and the linear scan takes
* )n
steps. Thus it takes
* log ) n n
teps to
complete Step -. Therefore the time(comple/ity of the merging step is
* log ) n n
.
et T)n* denote the time(comple/ities of the algorithm. et S)n* and -)n* denote the time(
comple/ities of the merging and merging step respectively. Then
$ * )
$ * ) * ) * # K ) # * )

> + +
n for b n T and
n for n - n S n T n T
Jsing the definition of the big function, we have
n cn n T
n n c n c n T n T
log * # K ) #
log * # K ) # * )
# $
+
+ +
where c is a constant.
et us assume that
.
n #
for some .. We now have
* log * # K log) ) * . K ) .
log ** # K log) * # K ) * . K ) # ) #
log * # K ) # * )
n n n n c n T
n cn n n c n T
n cn n T n T
+ +
+ +
+

n cn n cn bn
n n cn nT
n n n n n n n c nT
log
#
$
log
#
$
** log # K * log $ ))) * $ )
* # log ... * . K log) * # K log) log ) * $ )
#
+ +
+ +
+ + + + +
Therefore, * log ) * )
#
n n n T .
&lthough * log )
#
n n is much better than * )
#
n , we can still improve it. In the following, we
shall show that we can improve our algorithm such that its time(comple/ity becomes
* log ) n n
.
In &lgorithm #.0, we have to perform sorting in every merging step. This is the main reason why its
time(comple/ity is * log )
#
n n . Suppose we perform a preprocessing by sorting the points
according to their %(values before we start the divide and conquer ma/ima finding algorithm. This
preprocessing takes
* log ) n n
time. The total time(comple/ity is now
* log ) n n
E
* )n T
. ;ut
* ) * ) * # K ) # * ) n n n T n T + +
where n N $
and
b n T * )
when n 2 $
It can be easily shown that
* log ) * ) n n n T
and the total time(comple/ity, including the
preprocessing step, is
* log ) n n
.
2.6 The Selection Problem an( the Prne an( Search Strateg-
In the ma/ima finding algorithm based upon the divide and conquer strategy, we need to find the
#(##
.
.
.
.
.
.
.
.
median of a set of numbers. This median finding problem can be generali=ed to the selection
problem which is defined as follows. We are given a set S of n numbers and a number .. We are
asked to find the .th smallest, or the .th largest, number among the numbers in S. It is obvious that
the median finding problem is a special case of the more general selection problem.
To solve the selection problem, an easy method is to sort the numbers. &fter sorting, the .th
smallest, or the .th largest, number can be found immediately. In this section, we show that we can
avoid sorting by using the prune and search strategy. The prune and search strategy is again a
strategy which is recursive, in some sense. That is, an algorithm based upon this strategy is always
recursive.
We are given n data items. Suppose that we have a mechanism in which after each iteration, a
constant percentage, say f, of the given input data are eliminated. The problem is finally solved
when the problem si=e is reduced to a reasonably small number. et T(n) be the time needed to
solve a problem using the prune and search strategy. &ssume that the time needed to eliminate f data
items from n data items is * )
.
n .
Then

'

> +

1
1 * ) * * $ ))
* )
n for c
n for n n f T
n T
.
"or sufficiently large n, we have
. . .
.
n f c cn n f T
cn n f T n T
* $ ) * * $ ))
* * $ )) * )
#
+ +
+
p. . . .
. p. . . .
f f f cn c
n f c n f c cn c
* $ ) ... * $ ) * * $ ) $ ) 8
* $ ) ... * $ ) 8
#
+ + + + +
+ + + +
Since
$ * $ ) < f
,as n ,
* ) * )
.
n n T
We now e/plain why the prune and search strategy can be applied to solve the selection problem.
>iven a set S of n numbers, suppose that there is a number p which divides S into three subsets
$
S
,
#
S
and
-
S
,
$
S
containing all numbers smaller than p,
#
S
containing all numbers equal to p and
-
S
containing all numbers greater than p. Then we have the following cases%
Case 1! The si=e of
$
S
is greater than .. In this case, the .th smallest of S must be located in
$
S

#(#-
.
.
.
.
.
.
.
.
and we can prune away
#
S
and
-
S
.
Case 2! The condition of !ase $ is not valid. ;ut the si=e of
$
S
and
#
S
is greater than .. The .th
smallest number of S must be equal to p.
Case #! 6one of the conditions of !ase $ and !ase # is valid. In this case, the .th smallest number
of S must be located in
-
S
and we can prune away
$
S
and
#
S
.
The problem is to determine an appropriate p. This number p must guarantee that a constant fraction
of numbers can be eliminated. &lgorithm #.A can be used to find that p.
Algorithm 2.6 & Subroutine to "ind p from n 6umbers for the Selection 'roblem
Inpt! & set S of n numbers..
"tpt! The number p which is to be used in the algorithm to find the .th smallest number based
upon the prune and search strategy.
Step 1! 9ivide S into
1
1
1

1
n
subsets of 1 numbers, adding if possible into the last subset.
Step 2! Sort each of the 1(number subsets.
Step #! "ind the median
i
m
of the ith subset. Cecursively find the median of

'

1
1
1

1
,..., # , $
n
m m

by using the selection algorithm. et p be this median.
"igure #.$1% The E/ecution of &lgorithm #.A
#(#.
That p selected can guarantee
.
$
of the input data can be eliminated and is illustrated in "igure
#.$1.
Some points about &lgorithm #.A are in order. "irst, it is not absolute that the input set must be
divided into subsets containing 1 numbers. We may divide it into subsets, each containing, say A,
numbers. @ur algorithm would work as long as it each subset contains a constant number of
numbers. 6ote that as long as the input si=e is a constant, it takes
* $ )
, meaning a constant, steps
to complete the algorithm. Thus, each sorting performed in Step # takes constant number of steps.
"or Step -, p will be found by using the selection algorithm itself recursively.
The following is the algorithm based upon the prune and search strategy to find the .th smallest
number.
Algorithm 2.7 & 'rune and Search &lgorithm to "ind the Smallest .th 6umber
Inpt! & set S of n numbers..
"tpt!.The .th smallest number of S.
Step 1! 9ivide S into
1
1
1

1
n
subsets of 1 numbers, if n is not a net multiple of S ,add some dummy
elements to the last subset such that it contains five elements.
Step 2! Sort each subset of elements.
Step #! Jse &lgorithm #.A to determine p.
Step $! 'artition S into three subsets
$
S
,
#
S
and
-
S
, containing numbers less than p, equal to p
and larger than p.
Step /! If
. S
$
, discard
#
S
and
-
S
and selects the .th smallest number of
$
S
in the ne/t
iterationH else if
. S S +
# $
, p is the .th smallest element of pH otherwise, let
# $
8 S S . .
. Solve the problem that selects the 8 . th smallest number from
-
S

during the ne/t iteration.
et
* )n T
denote the time(comple/ity of the algorithm. Then
* ) * 1 K ) * * . K - )) * ) n n T n T n T + +
The first term
* * . K - )) n T
is due to the fact that $K. of the input data will be eliminated after each
iteration. The second term
* 1 K )n T
is due to the fact that during the e/ecution of &lgorithm #.A, we
have to e/ecute the selection problem involving nK1 numbers. The third term
* )n
is due to the
fact that dividing n numbers into nK1 subsets takes
* )n
steps.
It can be proved that
* ) * ) n n T
. The proof is rather complicated and is omitted here.
&lthough we may dislike time(comple/ities such as
-
n
,
1
n
and etc, they are not so bad as
compared with the time(comple/ities as
n
#
or O n . When n 2 $3333,
n
#
is an e/ceedingly large
#(#1
number. If an algorithm has such a time(comple/ity, then the problem can never be solved by any
computer when n is large. &n algorithm is a pol%nomial algorithm if its time(comple/ity is
** ) ) n p
where
* )n p
is a polynomial function, such as
#
n
,
.
n
and etc. &n algorithm is an
exponential algorithm if its time(comple/ity cannot be bounded by a polynomial function. There
are many problems which have polynomial algorithms. The sorting problem, the minimal spanning
tree problem and the longest common subsequence problem all have polynomial algorithms. &
polynomial problem is called a pol%nomial problem if there e/ist polynomial algorithms to solve
them. Jnfortunately, there are many problems which, up to now, have no polynomial algorithms to
solve them. We are interested in one question% #s it possible that in the future$ some pol%nomial
algorithms will be found for them/ This question will be answered in the ne/t section.
2.7 The 8P0Complete Problems
The concept of 6'(completeness is perhaps the most difficult one in the field of design and analysis
of algorithms. It is impossible to present this idea formally. We shall instead present an informal
discussion of these concepts.
et us first define some problems.
The partition problem! We are given a set S of numbers and we are asked to determine whether S
can be partitioned into two subsets
$
S
and
#
S
such that that the sum of elements in
$
S
is equal
to the sum of elements of
#
S
.
"or e/ample, let S 2 D$-, #, $A, #3, 5F. The answer to this problem instance is MyesM because we can
partition S into
$
S
2 D$-, $AF and
#
S
2 D#, #3, 5F.
The Sm of Sbset Problem! We are given a set S of numbers and a constant c and we are asked to
determine whether there e/ists a subset S0 of S such that the sum of S0 is equal to c.
"or e/ample, let S 2 D$#, 4, --, .#, A, $3, 1F and c 2 #.. The answer of this problem instance is
MyesM as there e/ists S0 2 D4, $3, 1F and the sum of the elements in S0 is equal to #.. If c is 0, the
answer will be MnoM.
The Satisfiabilit- Problem! We are given a ;oolean formula X and we are asked whether there
e/ists an assignment of true or false to the variables in X which makes X true.
"or e/ample, let X be
* ) * ) * )
- # # $ - # $
x x x x x x x
. Then the following assignment
will make X true and the answer will be PyesQ.
T x ' x ' x
- # $
, ,
If X is
$ $
x x
, there will be no assignment which can make X true and the answer will be PnoQ.
The Minimal Spanning Tree Problem! >iven a graph G, find a spanning tree T of G with the
minimum length.
#(#0
The Tra.eling Salesperson Problem! >iven a graph G 2 (V,E), find a cycle of edges of this graph
such that all of the vertices in the graph is visited e/actly once with the minimum total length.
"or e/ample, consider "igure #.$0. There are two cycles satisfying our condition. They are
a f c d e b a "
$
and
a f d e b c a " #
.
$
"
is shorter and is the solution
of this problem instance.
"igure #.$0 & >raph
"or the partition problem, the sum of subset problem and the satisfiability problem, their solutions
are either MyesM or MnoM. They are called decision problems. The minimal spanning tree problem and
the traveling salesperson problem are called optimi1ation problems.
"or an optimi=ation problem, there is always a decision problem corresponding to it. "or instance,
consider the minimal spanning tree problem, we can define a decision version of it as follows%
>iven a graph G, determine whether there e/ists a spanning tree of G whose total length is less
than a given constant c. This decision version of the minimal spanning tree can be solved after the
minimal spanning tree problem, which is an optimi=ation problem, is solved. Suppose the total
length of the minimal spanning tree is a. If a R c, the answer is MyesMH otherwise, its answer is MnoM.
The decision version of this minimal spanning tree problem is called the minimal spanning tree
decision problem. Similarly, we can define the longest common subsequence decision problem as
follows% >iven two sequences, determine whether there e/ists a common subsequence of them
whose length is greater than a given constant c. We again call this decision problem the longest
common subsequence subsequence decision problem. The decision version problem will be solved
as soon as the optimi=ation problem is solved.
In general, optimi=ation problems are more difficult than decision problems. To investigate whether
an optimi=ation problem is difficult to solve, we merely have to see whether its decision version is
difficult or not. If the decision version is difficult already, the optimi=ation version must be
difficult.
;efore discussing 6'(complete problems, note that there is a term called 2, problem. We cannot
#(#A
formally define 6' problems here as it is too complicated to do so. The reader may Lust remember
the following% ($) 6' problems are all decision problems. (#) 6early all of the decision problems
are 6' problems. &mong the 6' problems, there are many problems which have polynomial
algorithms. They are called , problems. "or instance, the minimal spanning tree decision problem
and the longest common subsequence decision problem are all ' problems. There are also a large
set of problems which, up to now, have no polynomial algorithms.
"igure #.$A% 6' 'roblems
6'(complete problems constitute a subset of 6' problems, as shown in "igure #.$A . 'recise and
formal definition of 6'(complete problems cannot be given in this book. ;ut some important
properties of 6'(complete problems can be stated as follows%
)$* Jp to now, no 6'(complete problem has any worst case polynomial algorithm.
)#* If any 6'(complete problem can be solved in polynomial time in worst case, all 6' problems,
including all 6'(complete problems, can be solved in polynomial time in worst case.
)-* Whether a problem is 6'(complete or not has to be formally proved and there are thousands of
problems proved to be 6'(complete problems.
).* If the decision version of an optimi=ation problem is 6'(complete, this optimi=ation problem is
called 2,3hard.
;ase upon the above facts, we can conclude that all 6'(complete and 6'(hard problems must be
difficult problems. 6ot only they do not have polynomial algorithms at present, it is quite unlikely
that they can have polynomial algorithms in the future because of the second property stated above.
The satisfiability problem is a famous 6'(complete problem. The traveling salesperson problem is
an 6'(hard problem. :any other problems, such as the chromatic number problem, verte/ covering
problem, bin packing problem, 3K$ knapsack problem and the art museum problem are all 6'(hard.
In the future, we will often claim that a certain problem is 6'(complete without giving a formal
proof. @nce a problem is said to be 6'(complete, it means it is quite unlikely that a polynomial
algorithm can be designed for it. In fact, the reader should never even try to find a polynomial
#(#5
algorithm for it. ;ut the reader must understand that we can not say that there e/ists no polynomial
algorithms for 6'(complete problems. We are merely saying that the chance of having such
algorithms is very very small.
It should be noted here that 6'(completeness refers to worst cases. Thus, it is still possible to find
an algorithm for an 6'(complete problem which has polynomial time(comple/ity in average cases.
It is our e/perience that this is also quite difficult as the analysis of average cases is usually quite
difficult to begin with. Is is also possible to design some algorithms which perform rather well
although we can not have an average case analysis of them.
Should we give up hope when we have proved that a problem is 2,3hard/ 6o, we should not. In the
ne/t section, we shall introduce the concept of appro/imation algorithms. Whenever we have
proved a problem to be 6'(complete, we should try to design an appro/imation algorithm for it.
This will be discussed in the following section.
2.9 Appro1imation Algorithms
&s indicated in the previous section, many optimi=ation problems are 6'(hard problems. This
means that it is quite unlikely that polynomial algorithms can be designed for these problems. Thus
it is desirable to have appro/imation algorithms which will produce appro/imate solutions with
polynomial time(comple/ities.
"igure #.$5% & >raph
et us consider the verte/ covering problem. >iven a graph G 2 (V,E), the vertex covering problem
requires us to find a minimum number of vertices from V which covers all edges in E. "or instance,
for the graph in "igure #.$5, verte/ a covers all edges. The solution is DaF. "or the graph in "igure
#.$4, the solution is Db,dF.
It has been proved that the verte/ covering problem is 6'(complete. &lgorithm #.4 is an
#(#4
appro/imation algorithm for this problem.
"igure #.$4% & >raph
et us apply this appro/imation algorithm to the graph in "igure #.$5. Suppose we pick up edge
)a,d*. We can see that all other edges are incident to Da,dF. Thus Da,dF is the appro/imate solution.
6ote that the optimum solution is DaF. Thus the si=e of the appro/imate solution is twice as large as
that of the optimum solution.
6ow we apply the algorithm to the graph in "igure #.$4. Suppose we pick up )c,d*. S will be Dc,dF
and edges )b,c* and )d,e* will be eliminated. Edge )a,b* still remains. We pick up )a,b*. The final S
will be Dc,d,a,bF. It was pointed out that the optimum solution is Db,dF. Thus the appro/imation
algorithm has again produced an appro/imate solution with si=e as large as twice of the optimum
solution.
Algorithm 2.9 &n &ppro/imation &lgorithm to Solve the Serte/ !overing 'roblem
Inpt! & graph G 2 )V,E*.
"tpt! &n appro/imate solution for the verte/ covering problem with performance ratio # .
Step 1! 'ick up any edge e. 'ut the two end vertices u and v of e into S.
Step 2! Eliminate all edges which are incident to u and v.
Step #! If there is no edge left, output S as the appro/imate solution. @therwise, go to Step $.
It is by no means accidental that in each case, the solution of the appro/imate solution is twice as
large as the optimum solution. We shall prove later that &lgorithm #.4 will always perform in such
a way.
et pp be the solution of an appro/imation algorithm. et 4pt be an optimal solution. The
performance ratio of the appro/imation algorithm, denoted as , is defined as
4pt
pp

. "or
#(-3
some appro/imation algorithms, the performance ratio may be a function of the input si=e n. "or
instance, it may be
* )log n
or
* )n
. "or some appro/imation algorithms, the performance
ratios are constants. In general, we like an appro/imation algorithm to have a performance ratio as a
constant and it should be as small as possible. "or &lgorithm#.4, we shall prove that the
performance ratio is less than or equal to #.
et . edges be chosen by our appro/imation algorithm. Then
. pp #
. Since the optimal
solution 4pt must be a verte/ cover, every edge must be covered by at least one verte/ in 4pt. 9ue
to the special property of our appro/imation algorithm, no two edges selected share the same
covering verte/. Thus, we have
4pt .
This means that
4pt . pp # #
We conclude that # for our appro/imation algorithm.
#(-$

You might also like