You are on page 1of 157

Lecture Notes for CMPSCI 611

Advanced Algorithms

Micah Adler
Department of Computer Science
University of Massachusetts, Amherst

Draft Copy: Do Not Distribute


Acknowledgments
These lecture notes are an edited version of the scribe notes for CMPSCI 611 that were compiled by students
taking the course at the University of Massachusetts, Amherst, in the fall of 2000, the fall of 2001, and the
fall of 2002. In addition to the efforts of these students, these notes benefitted greatly from the editing work
of John Ridgway during the fall of 2000.
Chapter 1

Preliminaries

In this course we will discuss the design and analysis of algorithms. We will focus less on teaching a bag
of algorithms and more on the broadly-applicable techniques that will guide one in finding an appropriate
algorithm given a problem. We need a formalization for evaluating what these algorithms can and cannot
do; we need a model of computation.

1.1 A formal model of computation: the RAM model

In this course, we will be primarily concerned with the RAM (Random Access Memory) model of computation
because it is considered to be simple (so that theorems are easier to prove), realistic (representative of real
computations), and machine-independent. In the RAM model of computation, the CPU can access any
location in memory at unit cost.

Figure 1.1: RAM model of computation

1.1.1 Other possible models of computation

In the Turing Machine model, only a memory locations immediate neighbors can be accessed from the
current location. While this simplification makes it easier to prove theorems concerning the limits of com-
putability, it is unrealistic for the computers we use today, which have random access memory. This makes
the model less useful for discussing algorithms.

1
The Hierarchical Memory Model, which includes cache and disk, has a more realistic representation
of memory than the RAM model, because not all memory accesses are unit cost. In this model, proving
theorems, particularly concerning the analysis of running time, is much more difficult.
Parallel Processing models take into account the aspects of parallelizing computation.
Well never mention these models again in the context of this course.

1.2 Running time

We analyze algorithms in order to determine whether or not we consider it feasible to find the solution for
a given problem with current resource constraints. One of the metrics we use to determine feasibility is the
running time of the algorithm, which is the product of the number of operations that the algorithm requires
to solve a given problem and the unit-cost of one operation.
Allowable, unit-cost operations are:

pairwise arithmetic operation


access to memory
logical operations
comparisons (such as larger than or same)

1.2.1 Asymptotic Behavior: Worst-Case Running Time

Once we have a formal model for computation, how many operations do we need to compute the solution to
a problem? We are interested in the rate that the number of operations grows as a function of the input size.
For each particular input size, we measure the worst-case running time: on inputs of size n we denote this
worst-case running time T (n). Table 1.1 shows the actual (wall clock) time required to perform a number
of operations on a machine capable of executing one billion operations per second. The running times for
this machine are listed for various functions T (n) and input sizes n ranging from ten through one million.
This table illustrates the dramatic effect on running time of the asymptotic growth rate. In particular, for
sufficiently large inputs, leading constant factors in the expression T (n) do not have much of an effect on
whether a running time is realistic or not. As a result, we are much more concerned with the asymptotic
growth rate than with constant factors. Thus, we use asymptotic notation to hide constant factors and
focus on the asymptotic growth rate.

1.2.2 Review of asymptotic notation

Asymptotic notation allows us to formally state which details in T (n) we are ignoring. Lets first review
upper bounds. If the worst-case running time, T (n), is described as O(f (n)) (pronounced big oh of f (n)
or order f (n)) then the growth rate of T (n) is asymptotically less than or equal to that of f (n). Figure
1.2 illustrates this concept showing an example where T (n) cf (n) provided that n n0 . There must be
some constant c (independent of n) for which this is true. A formal description of the big O upper bound is:

T (n) = O(f (n)) c, n0 such that n n0 , T (n) cf (n)


Input size n

T (n) n = 10 n = 20 n = 50 n = 100 n = 1000 n = 106


n log n 30 ns 90 ns 0.2 s 0.7 s 10 s 20 ms
n2 100 ns 400 ns 2.5 s 10 s 1 ms 17 min
n5 0.1 ms 3.2 ms 0.3 s 10.8 s 11.6 days 3 1013 yrs
2n 1 s 1 ms 13 days 4 1013 yrs
n! 4 ms 771 yrs

Table 1.1: Time required to perform T (n) operations (109 operations/second)

Figure 1.2: T (n) = O(f (n))

If T (n) is described as o(f (n)), then the asymptotic growth rate of T (n) is strictly less than that of f (n).
Figure 1.3 illustrates this concept, with T (n) < cf (n) provided that n n0 . For any c, there must be some
n0 for which this is true. The exact description of little o is as follows:

T (n) = o(f (n)) c > 0, n0 such that n n0 , T (n) < cf (n)

We also use the lower bounds big and little Omega. Big Omega denotes that the asymptotic growth rate of
T (n) is greater than or equal to that of f (n) and is described as follows:

T (n) = (f (n)) c, n0 such that n n0 , T (n) cf (n)

Little Omega denotes that the growth rate of T (n) is strictly greater than that of f (n), and is described
below:

T (n) = (f (n)) c > 0, n0 such that n n0 , T (n) > cf (n)

The final type of asymptotic notation we use is Theta, the case where the asymptotic growth rate of T (n)
is equal to that of F (n). This case is described as follows:
Figure 1.3: T (n) = o(f (n))

T (n) = (f (n)) T (n) = O(f (n)) and T (n) = (f (n))

This information is summarized in the following table:

big O T (n) = O(f (n)) c, n0 such that n n0 , T (n) cf (n) Figure 1.2
< little o T (n) = o(f (n)) c > 0, n0 such that n n0 , T (n) < cf (n) Figure 1.3
big Omega T (n) = (f (n)) c, n0 such that n n0 , T (n) cf (n)
> little omega T (n) = (f (n)) c > 0, n0 such that n n0 , T (n) > cf (n)
= Theta T (n) = (f (n)) T (n) = O(f (n)) and T (n) = (f (n))
Chapter 2

First Algorithmic Paradigm: Divide


and Conquer

In this course, we study broadly applicable algorithmic paradigms. The first such paradigm we study is
called divide and conquer. Divide and conquer algorithms work by breaking the problem down into smaller
and smaller subproblems (the divide step) until the subproblems are of a size that is trivial to solve.
The solutions to the subproblems are then recombined to get the solution to the original problem. Usually,
determining an efficient process for performing the recombination is the most difficult part of designing a
divide and conquer algorithm.

2.1 Merge sort

Merge sort starts with an unsorted list of numbers. We first divide the list into two halves (the divide step),
and then recursively sort the two subproblems (recurse step). Finally, the results of these subproblems is
combined (merge step) into a single sorted list. We can perform the merge in linear time by setting a pointer
to the beginning of each list and adding the elements one by one in sorted order to the new merged list,
always choosing the smaller element from the two lists.
We express the running time using a recurrence relation:

2 subproblems merge
z }| {
n z }| {
T (n) = 2 T( ) + (n) for n > 1
2
T (1) = (1) sort list of length 1 in constant time

We assume n is a power of 2 (although a similar analysis works if n is not a power of 2). Weve broken the
problem into two subproblems, each of size n/2. After recursively solving the subproblems, the merge step
takes linear time in n. As we recurse, the subproblems get smaller until they reach a size of 1. At this point,
the algorithm has bottomed out. The recurrence relation describes the run time for inputs of size n > 1. We
must also express what happens at the base case, which is T (1) above.

5
Figure 2.1: Example of merge sort

2.2 General formula for solving recurrence relations

In divide and conquer algorithms we divide a problem of size n into a pieces where each piece is of size n/b.
After recursively solving each sub-problem, it takes n operations to combine the sub-problems. At each
level of the recurrence, the problem size is reduced by a factor of b.
The general form for the recurrence relation for the divide and conquer algorithmic paradigm is:
n
T (n) = a T ( ) + (n ) for n > 1
b
T (1) = (1) for n = 1

If we let = logb a, then the solution to the recurrence is dependent upon the relationship of to :

(n ) if >
T (n) = (n ) if < (2.1)
(n log n) if =

In the case of merge sort, a = b = 2, = log2 2 = 1, and = 1 = . Since = we know from Equation 2.1
that the solution must be of the form (n log n). Then substituting in the value = 1 into this general
form we obtain T (n) = (n log n), as expected.
A recursion tree is a good representation for showing the intuition behind Equation 2.1. Assume the recursion
tree shown in Figure 2.2 where each node has a possible children. The running time for each level of the tree
is listed down the left side of the figure. The size of the subproblems for each level is listed down the right
side of the table. To tally the total running time, we must account for the contribution of the subproblems
at each level.
At each level we spend n time per subproblem for the divide and merge steps where n is the size of the
subproblem. From one level to the next, the size of the input reduces logb n many times by a factor of b
until the algorithm reaches bottom. Given a level in the computation tree at depth d from the root, there
are ad subproblems. The level cost in the tree is simply the sum of the cost across subproblems. We assume
that the size of the original input n is a power of b. Thus, the height of the tree is logb n.
Note: In this course, all references to log assume log2 , unless it is explicitly stated otherwise.

Figure 2.2: Recursion tree for merge sort

By adding the running times of each level of the tree, we get the total running time of the algorithm to be:
n n
T (n) = n + a ( ) + a2 ( 2 ) + + alogb n
b b

a
If we then let r = b = b , the following results from factoring out n :

T (n) = n (1 + r + r2 + + rlogb n )

Recall that if r 6= 1:
k
X rk+1 1
ri =
i=0
r1

We now consider the three cases of Equation 2.1.


Case 1: for r < 1 ( > ) we have:

1 rlogb n +1
T (n) = n ( )
1r

Thus,
1
n T (n) n ( ).
1r
Figure 2.3: Matrix multiplication

Since r is a constant independent of n, and in asymptotic notation we ignore constants, we get T (n) = (n ).
With this case, in our summation for the total running time, the terms become smaller and smaller as we
precede down the recursion tree, which corresponds to the fact that the time spent at the root of the tree
dominates the running time.
Case 2: for r > 1 ( < ) we have:

rlogb n+1 1
T (n) = n ( )
r1
= (n rlogb n+1 ) recall that r = b
= (n (b )logb n+1 )
= (n n ) since b is a constant
= (n )

In this case, the terms of the summation get larger and larger as we precede down the tree. This case
corresponds to the leaves of the recursion tree dominating the running time.
Case 3: for r = 1 ( = ) we have

logb n+1 times


z }| {

T (n) = n (1 + 1 + + 1)
T (n) = n (logb n) + n
= (n log n)

In this case, the time spent at each level of the tree is the same, so the total running time is the time at each
level (n ) multiplied by the number of levels in the tree (logb n + 1).

2.3 Matrix multiplication

2.3.1 Introduction

Input: Two n n matrices A and B, having entries from any ring (e.g., the real numbers)
Output: C = A B

Recall the definition of matrix multiplication [ALG97]:


In general, let B be an m n matrix and A an n p matrix. Let C be the product BA. The ijth entry of
C is the dot product of the ith row of B and the jth column of A. In other words C is the m n matrix,
such that:
Figure 2.4: n-bit number multiplication

n
X
Cij = bik akj
k=1

This definition lends itself to an algorithm. We refer to the algorithm that is derived from this definition as
the standard algorithm. The runtime of this standard algorithm for an n n array is (n3 ), since there are
n2 terms and each can be computed in time O(n), assuming we can multiply any pair of matrix elements in
constant time.
For centuries it seemed we could do no better, but in 1969 Strassen shocked the world, and inspired a new
direction in research, by coming up with a faster way, using a completely different sequence of operations.
To understand Strassens approach, we start by looking at a simpler problem: Multiplying 2 n-bit numbers.

2.3.2 Warm up: Multiplying two n-bit numbers

Input: A and B, n-bit integers


Output: C = A B
The naive approach requires T (n) = (n2 ), where n is the number of bits (assuming bit operations), because
each bit in B has to be multiplied by every bit in A. By taking the divide and conquer approach, we can
improve on this integer multiplication algorithm, because it turns out we dont have to do every pairwise
multiplication.
Divide each integer into two halves of n/2 bits each (see Figure 2.4).

A = A1 2n/2 + A2
B = B1 2n/2 + B2
AB = (A1 B1 )2n + (A1 B2 + A2 B1 )2n/2 + A2 B2 (2.2)

Thus we have divided the original n-bit size problem into 4 n2 -bit size subproblems. Combining the subprob-
lems takes time (n) because this can be done with bit-shift operations and addition. So the running time
is:
n
T (n) = 4 T ( ) + (n) for n > 1
2
T (1) = (1)
= 1
= log2 4 = 2
T (n) = (n2 )
Compared to the standard algorithm there is no improvement; we have designed a more complicated al-
gorithm that runs in the same time. The culprit is the number of subproblems. We have divided the
multiplication into 4 subproblems, which resulted in = 2. We can actually do this with a recursive
formulation using 3 subproblems by solving A1 B2 + A2 B1 from the results of the other multiplications:
We use the following 3 intermediate subproblems:

P1 = A1 B1
P2 = A2 B2
P3 = (A1 + A2 )(B1 + B2 )

Note that we can construct the above problems in linear time. Then

A B = P1 2n + (P3 P1 P2 )2n/2 + P2 (2.3)

The above combining can also be done in linear time. Lets show Equation 2.2 is equivalent to Equation 2.3.

P3 = (A1 + A2 )(B1 + B2 ) = A1 B1 + A1 B2 + A2 B1 + A2 B2
P3 P1 P2 = A1 B2 + A2 B1
P1 2n + (P3 P1 P2 )2n/2 + P2 = (A1 B1 )2n + (A1 B2 + A2 B1 )2n/2 + A2 B2 = AB

Now we have reduced the number of subproblems from 4 to 3. Lets take another look at the runtime:
n
T (n) = 3 T ( ) + (n) for n > 1
2
T (1) = (1)

As we can see, we have increased the number of additions, causing the (n) representing the rime required
to recombe the solutions to be larger than the orginal (n). However, asymtotically it remains (n), as the
larger constant is hidden in the term.
Therefore,

= 1
= log2 3 1.59
T (n) = (nlog2 3 ) (n1.59 )

(A nice property of this method is that although we described it in terms of bit operations, you could use the
same method by hand, in terms of digits, to do something faster that you have been doing since elementary
school.)
Note: this is not the optimal solution. There is an algorithm based on FFTs that has a running time of
T (n) = (n log n log log n), which is the fastest known solution to this problem. Although we will not
discuss the actual algorithm, we will later talk about the FFT algorithm that serves as the basis for that
algorithm.

2.3.3 Strassens Amazing Algorithm using Divide and Conquer


Input: Two n n matrices A, B
Output: C = A B

Divide: Whereas before we divided integers of size n bits into sub-integers of n2 bits, with matrices we
subdivide the matrices into 4 quadrants of submatrices, each of size ( n2 n2 ).
   
A11 A12 B11 B12
A= B=
A21 A22 B21 B22

Treating A and B as two 2 2 matrices, we can multiply them and write down the result:

 
A11 B11 + A12 B21 A11 B12 + A12 B22
C= (2.4)
A21 B11 + A22 B21 A21 B12 + A22 B22
It is easy to verify that this is still valid even if A and B are submatrices rather than single values. We have
divided the original n n problem into eight ( n2 n2 ) problems. The input size has been divided by 4, but
here n is the dimension of the matrix, so each problem is of size n2 rather than n4 . The runtime is:
n
T (n) = 8T ( ) + (n2 ) n>1
2
T (1) = (1)
= 2
= log2 8 = 3
T (n) = (n3 )
Note: the recombination portion of the recurrence equation is n2 because of the work in adding ( n2 n
2)
submatrices.
This naive divide-conquer algorithm did not improve our runtime, and again the problem is the number of re-
cursive subproblems, which in this case is 8. However, like we did in the warm-up problem of multiplying two
n-bit numbers, we can reduce the number of multiplications by defining the following 7 sub-multiplications:

P1 = (A11 + A22 ) (B11 + B22 )


P2 = (A21 + A22 ) B11
P3 = A11 (B12 B22 )
P4 = A22 (B11 + B21 )
P5 = (A11 + A12 ) B22
P6 = (A11 + A21 ) (B11 + B12 )
P7 = (A12 A22 ) (B21 + B22 ) (2.5)
Each of the above sub-multiplications is a product of ( n2 n
2)
matrices. Hence the recombination will be
(n2 ). We claim we can compute the product A B with these combinations:
 
P1 + P4 P5 + P7 P3 + P5
AB= (2.6)
P2 + P4 P1 P2 + P3 + P6

If we substitute P1 to P7 with what we have defined, we will be able to see that each term in Equation 2.6
is equivalent to each term in Equation 2.4. Lets pick one term, P2 + P4 to demonstrate that this works.
P2 + P4 = ((A21 + A22 ) B11 ) + (A22 (B11 + B21 ))
= A21 B11 + A22 B11 A22 B11 + A22 B21
= A21 B11 + A22 B21
The runtime of this algorithm is:
n
T (n) = 7T ( ) + (n2 ) n>1
2
T (1) = (1)
= 2
= log2 7 2.81
T (n) = (log2 7) (n2.81 )

Note that although Strassen did improve the asymptotic bound of matrix multiplication, the number of
additions have also increased. In the obvious algorithm there were 4 additions, in Strassens algorithm there
are 18 additions. As a result, in practice, people dont actually use Strassens algorithm very often. It is
asymptotically faster, but the constant factor is much larger, so the asymptotic behavior does not take over
until fairly large values of n. In practice you would use Strassens algorithm if you were multiplying two very
large matrices.
The point where Strassens algorithm is more efficient than a standard approach depends on a number of
architectural aspects of the specific machine the algorithm is being run on. This includes things like the
properties of the memory hierarchy (i.e., caches), as well as the relative speeds of multiplication and addition.
Different sources quote quite different numbers for this tradeoff point, ranging from at least as small as n = 8
to at least as large as n = 100. If you were writing a program to accomplish this, you would keep sub-dividing
the matrices. Once you reached the cutoff point for your particular machine, you would switch to the more
straightforward approach.
It turns out that Strassens is not the fastest known algorithm for matrix multiplication. While Strassens
algorithm multiplies two 2 2 matrices using 7 multiplies, Pan extended this [P84]. Instead of dividing the
problem into recursive subproblems 41 the size of the original problem, he divided them up into much smaller
pieces.
Pan came up with a way to multiply two:

68 68 matrices using 132, 464 multiplies


70 70 matrices using 143, 640 multiplies
72 72 matrices using 155, 424 multiplies

We leave as an exercise: which of these has the fastest runtime?


Currently, the asymptotically fastest known algorithm for matrix multiplication is O(n2.376 ) [CW90], not
using divide and conquer. There are also randomized algorithms that are faster if you dont need an exact
result.
The best known lower bound for this problem is (n2 ). This is because we must at least write down an
answer for every element in the n n matrix, which will take (n2 ) time. It should be noted that the
above-mentioned results are based on the assumption that we are dealing with dense matrices. If we are
dealing with sparse matrices, there are different algorithms that are faster.

2.4 Closest Pair of Points in a Plane

The problem statement: we have a set of points in a Euclidean plane and we want to find the pair of points
that have minimum distance between them.
q
p

Figure 2.5: Points in the Plane

Let: d(p, q) be the (Euclidean) distance between points p and q.


Input: A set P of n points in the plane.
Output: The pair p, q P (p 6= q) such that d(p, q) is minimal.
Actually: We shall only describe the algorithm for finding the minimum distance instead of the pair of
points which achieve it. Minor modifications to this algorithm allows us to actually find the pair that
achieves this minimum as well.

The obvious algorithm is to compare all pairs, and choose the minimum distance. This requires time (n2 ).
We will use divide and conquer to do the same task in (n log n).

2.4.1 Divide and Conquer Approach to Find Minimum Distance Between Pairs
of Points in the Plane

Algorithm:

1. Base Case: If n < 3 , try all pairs. This takes constant time because there can be at most 1 pair to
consider.
2.a Otherwise divide P into two halves with a vertical line such that there are equal numbers of points
on either side of the line. Points which lie on the dividing line will be assigned to either side of the
dividing line so that we have equal numbers of points assigned to each side. We get n2 points in PL
and PR where PL are points on the left and PR are points on the right.
2.b Recursively find the
L : min d(p, q) where p, q PL (p 6= q).
R : min d(p, q) where p, q PR (p 6= q).
The answer is not to just take the minimum of these two values because we may have divided the
closest pair.
2.c (Combining) Test every pair which crosses the dividing line
M = min d(p, q) where p PL and q PR .
A straightforward implementation of this step requires (n2 ) time ( n2 n2 ) since there are n
2 points on
either side of the line.
3. Return min(L , R , M )
Figure 2.6:

Figure 2.7:

The algorithm as stated has a running time of (n2 ), since the combining step requires time (n2 ). We
can improve on this. The key here is that when we get to step 2.c., we have already solved the recursive
subproblems, and we know that the final value returned, , must be no greater than min(L , R ). Thus, if
the minimum distance pair goes across the median, it must be within of that median, since points that
are separated by distance greater than in their x-values cannot be candidates for the pair with minimum
distance. Likewise, the minimum distance pair cannot be separated by a distance greater than in their
y-values either. We will use this fact to create a third subset of points called PM .
Note that it is possible that two points lie at the same location on the dividing line and during partitioning
one is put into PL and one into PR . However, if both points are put into the same partition (which must be
the case if there are more than two points at the same location) then either L = 0 or R = 0.

2.c.i. Let PM be the set of points from both PL and PR that are within distance of the dividing line
horizontally. Figure 2.6 depicts two strips of width on either side of the dividing line that form the
set PM .

2.c.ii. For each point p PM that is to the left of the dividing line, check the distance to points q PM to
the right of the dividing line that are in the rectangle depicted in Figure 2.7 (points to the right of the
dividing line are treated similarly).

If we compare p to all points in the rectangle, than we are guarenteed to find a minimum distance pair,
provided that the pair has distance less than (otherwise, the minimum distance pair does not go between
the two halves). Note that if a closer point lies below the rectangle, when we do the check for that point,
we will find that shorter distance. So, because we are checking points on both sides of the center dividing
line, we do not need to check above and below the point p.
We start by analyzing C(n) = the total number of pairwise comparisons over n points. The total number of
pair-wise comparisons for step 2.c. is (n), since for each point in PM , we only need to calculate the distance
to 4 points across the dividing line. If there were more than 4 then some pair would be closer than , and
the only way you can get 4 comparisons is if they are all distance (at the four corners of the rectangle).
We see that
n
C(n) = 2T ( ) + (n)
2
= 1
= log2 2 = 1
=
C(n) = (n log n)

We still need to be able to

A. divide the plane into sets of points on the left and right, and
B. for any point p PM we have to quickly find points in its special rectangle.

How do we do this?

A. Sort the points by their x-value. This allows us to find PL , PR , PM in linear time. To find the points
in PL , pick the first n2 points. The points in PR are the last n2 points. Points within distance of the
dividing line can also be found easily using this sorted order once we know the value of .
B. Sort the points in PM by their y-value. The points in the rectangle of a given point p are the next seven
points in the sorted order of PM . If we only had to consider points on the right side of the dividing
line, looking at the next four points in the sorted order would be sufficient, since at most we would find
the four corners of ps box. But PM also contains points to the left of the dividing line, where there
can be at most three points (other than p) at the corners of another box. See Figure 2.8. There
can be overlapping points at the same location that were arbitrarily assigned to PL and PR , whose
distance is zero (thus it is necessary to check all seven points).

We next analyze the runtime. If we are sorting the points at each recursive step, we get the following
recurrence relation:
n
T (n) = 2T ( ) + (n log n)
2
Although this does not fit into the model of the generic formula for recurrence relations, we state here that
it gives us a runtime of T (n) = (n log2 n)
A faster way to do this is to sort the points once at the start of the algorithm and then maintain two sorted
lists: one sorted by x-values (let us call it X), one sorted by y-values (let us call it Y ). So, if the array X
received by a recursive call is already sorted then division of P into PL and PR requires splitting X into XL
and XR , which is easily accomplished in linear time. We also need to split Y into YL and YR . This can be
done in linear time by simply examining the points in array Y in order. If a point in Y is in PL we append
it to the end of YL otherwise to the end of YR . In linear time, we can also create the array YM which is the
delta
1 4

delta 2,3
delta
6,7
p 5
delta delta

Figure 2.8:

points in PM sorted by Y value. Again, we examine the points in Y in order, placing those points in YM
that are also in the set PM - that is, that are within of the central dividing line.
Thus, we sort the arrays X and Y only once at the beginning. This takes time (n log n). Then, the
additional work required at each recursive step is linear. To analyze the runtime, we note that the recursive
portion of our algorithm requires time as described by the following recurrence relation:

n
T (n) = 2T ( ) + (n)
2

This gives us a runtime of T (n) = (n log n) for the recursive portion of the algorithm, for a total runtime
of (n log n).

2.5 Fast Fourier Transform and Fast Polynomial Multiplication

We next study the fast Fourier transform (FFT), the last divide and conquer algorithm we shall consider.
The FFT has many applications; our motivating application will be polynomial multiplication. Using the
FFT for polynomial evaluation and interpolation makes it possible to multiply
 two degree n 1 polynomials
in time (n log n), whereas a more obvious algorithm takes time n2 .

2.5.1 Polynomial Arithmetic using Coefficient Representation

Suppose A (x) and B (x) are polynomials of degree n 1:


A (x) = a0 + a1 x + a2 x2 + . . . + an1 xn1
B (x) = b0 + b1 x + b2 x2 + . . . + bn1 xn1
Assume that A (x) and B (x) are represented as coefficient vectors a and b (this is what is known as coeffi-
cient representation) [CLRS, p. 824]:
a = (a0 , a1 , a2 , . . . , an1 )
b = (b0 , b1 , b2 , . . . , bn1 )

Addition of two polynomials can be done in time (n) by adding the corresponding terms:

A (x) + B (x) = ((a0 + b0 ) , (a1 + b1 ) , (a2 + b2 ) , . . . , (an1 + bn1 ))

Multiplication is not quite as simple as addition. The obvious algorithm is to pairwise multiply each pair
of terms and combine like terms. This takes time (n2 ).

2.5.2 Polynomial Arithmetic using Point-Value Representation

Point-value representation (PVR) is an alternate way in which to represent polynomials. Using this repre-
sentation, a degree n 1 polynomial is evaluated at n distinct points. For example, we would represent A (x)
as follows:

A (x) = {(x0 , y0 ) , (x1 , y1 ) , . . . , (xn1 , yn1 )}


where yi = A (xi ) , and if i 6= j then xi 6= xj

Claim 1 A point-value representation with n distinct points defines a unique degree n 1 polynomial.

Proof: See [CLRS, pp. 825826]


Addition in PVR can be done by using the same x values to specify A (x) and B (x), and adding the
corresponding y values. This takes time (n) for two degree n 1 polynomials.

A (x) = {(x0 , y0 ) , (x1 , y1 ) , . . . , (xn1 , yn1 )}



B (x) = {(x0 , y00 ) , (x1 , y10 ) , . . . , xn1 , yn1
0
}

A (x) + B (x) = {(x0 , y0 + y00 ) , (x1 , y1 + y10 ) , . . . , xn1 , yn1 + yn1
0
}

Multiplication in PVR is similar to addition except that we multiply corresponding terms instead of adding
them. This also takes time (n) for two degree n 1 polynomials.

A (x) = {(x0 , y0 ) , (x1 , y1 ) , . . .}


B (x) = {(x0 , y00 ) , (x1 , y10 ) , . . .}
A (x) B (x) = {(x0 , y0 y00 ) , (x1 , y1 y10 ) , . . .}

Note that A (x) B (x) is a degree 2n 2 polynomial, which needs 2n 1 points in order to be specified.
Therefore, we should start with point-value representations of A (x) and B (x) that specify 2n 1 points.

2.5.3 A Framework for Fast Polynomial Multiplication

We have seen that multiplication using point-value representation takes linear time. Unfortunately, for
most problems, polynomials are given in coefficient representation and we would like our answers to be
in coefficient representation as well. We will now look at a new approach that will enable us to multiply
polynomials given in coefficient representation
 and obtain an answer in coefficient representation in time
(n log n), rather than the time n2 required by the obvious algorithm.
The general framework we will use for fast polynomial multiplication involves three stages:
Obvious
Coeff. representation Algorithm Coeff. rep. of
of A(x), B(x) A(x)*B(x)
(n 2 )
(degree n-1)
Evaluation at
(2n-1 points) Interpolation
FFT, (n log n) (n log n)
Pointwise
Point value Multiplication Point value
representation of representation of
(n)
A(x), B(x) A(x)*B(x)

Figure 2.9: A graphical outline of polynomial multiplication using FFT.

1. Evaluation Convert coefficient representations of A (x) and B (x) to PVR by evaluating each polyno-
mial at 2n 1 points.

2. Multiplication Multiply the polynomials in PVR.

3. Interpolation Convert the result back to coefficient representation.

This approach is shown graphically in Figure 2.9. We will use the FFT to do both evaluation and interpolation
in time (n log n).

2.5.4 Fast Polynomial Evaluation



Initially it may seem that it takes time n2 to evaluate a polynomial at n points. (Evaluation at one point
takes time (n) using Horners rule [CLRS, p. 824], so evaluation at n points using this approach takes
time n (n) = n2 ). However, because we can choose the points at which we evaluate the polynomials
A (x) and B (x), we can select special points that make evaluation faster. The basic idea of the FFT is to
choose the complex nth roots of unity as the evaluation points because these have properties that can
lead to faster evaluation.

Definition 2 Complex nth roots of unity are solutions to xn = 1. These are n equally spaced points
around the unit circle in the complex plane.

Definition 3 The principal nth root of unity (n ) is the first complex nth root of unity encountered
starting at 1 and going counter-clockwise around the unit circle. (It is also the number that makes an angle
of 360
n degrees with the real axis on the complex plane).

Figure 2.10 shows the 8th roots of unity on the complex plane, also identifying the principle 8th root. Note
that the x-axis represents the real components of the complex plane and the y-axis represents the imaginary
components of the complex plane.
One property of the complex plane that will be very useful to us is that if a complex number makes an
angle of degrees with the real axis, then the complex number k makes an angle of k degrees with the
real axis. This property is illustrated in Figure 2.11.
From this property arises another important property, known as the Halving lemma.
i
1
n

360/n
-1 1

-i

Figure 2.10: Complex 8th roots of unity (n=8).

i
k

.k


-1 1

-i

Figure 2.11: A property of the complex plane.


n
Lemma 4 (Halving lemma) If n is even, then the squares of the n complex nth roots of unity are the 2
complex n2 nd roots of unity repeated twice.

e.g.
n0 , n1 , n2 , . . . , nn1 (start with the n complex nth roots of unity)
2 2 2 2
n0 , n1 , n2 , . . . , nn1 (then square each one)
0 1 2
n
2 1 0 1 2
n
2 1
n n
n2 , n2 , n2 , . . . , n , n2 , n2 , n2 , . . . , n (result : complex nd roots of unity repeated twice)
2 2 2 2

Because were doubling the angle for each of the n points by taking their squares, we end up with half as
many points around the unit circle, each repeated twice.

2.5.5 The Fast Fourier Transform for Evaluation

Before looking at the more general case, well look at a small example to demonstrate how using the complex
nth roots of unity can make evaluation more efficient.
Example:
Let us consider the case when n=4, and A (x) is a degree 3 polynomial.
A (x) = a0 + a1 x + a2 x2 + a3 x3
Let us evaluate A (x) at the complex 4th roots of unity (i.e., 40 , 41 , 42 , 43 ) which are
40 = 1
41 = i
42 = 1
43 = i
Thus, A can be written in point-value representation as follows:

A (x) = { 40 , a0 + a1 + a2 + a3 ,

41 , a0 + a1 4 + a2 42 + a3 43 ,

42 , a0 + a1 42 + a2 44 + a3 46 ,

43 , a0 + a1 43 + a2 46 + a3 49 }

As a very rough measure of efficiency, we shall here count the number of additions that need to performed for
a computation. This evaluation requires 12 additions to solve. But the number of additions can be reduced
by substituting the complex values from above and reordering to get a point-value representation as shown
below.
A (x) = { (1, (a0 + a2 ) + (a1 + a3 )) ,
 
4 , a0 + a2 42 + 4 a1 + a3 42 ,

42 , (a0 + a2 ) (a1 + a3 ) ,
 
43 , a0 + a2 42 4 a1 + a3 42 }

We see that, in the above representation, certain terms are repeated. By taking advantage of these repetitions,
the work can be brought down to 8 additions.
Let us see how this approach can be generalized to evaluate a degree n 1 polynomial at the complex nth
roots of unity. Let A (x) be a degree n 1 polynomial. By grouping the even-indexed coefficients and the
odd-indexed coefficients together we get the following (note that we are assuming that n 1 is odd):

A (x) = a0 + a1 x + a2 x2 + . . . + an1 xn1


 
= a0 + a2 x2 + a4 x4 + . . . + an2 xn2 + x a1 + a3 x2 + a5 x4 + . . . + an1 xn2
| {z } | {z }
Aeven (x2 ) Aodd (x2 )
2
 2

= Aeven x + xAodd x

where Aeven is a polynomial of degree n2 1, with coefficients (a0 , a2 , a4 , . . . , an2 ), and


Aodd is a polynomial of degree n2 1, with coefficients (a1 , a3 , a5 , . . . , an1 ).
2 2 2 2
From the Halving Lemma, evaluation of Aeven at (1) , (n ) , n2 , . . . , nn1 is equivalent to the eval-
n n
1 1
uation of Aeven at 1, n2 , 2n , . . . , n2 , 1, n2 , 2n , . . . , n2 . The same holds for Aodd . Thus, the problem
2 2 2 2
of evaluating a degree n 1 polynomial at the nth roots of unity reduces to the evaluation of two degree
n n
2 1 polynomials at the 2 nd roots of unity. So now we have reduced the problem to two smaller recursive
sub-problems.
Before we see the actual algorithm, let us define the following arrays:


y : array such that yk = A nk
 
y even : array such that ykeven = Aeven kn2
 
y odd : array such that ykodd = Aodd kn2

Now lets look at the algorithm for fast polynomial evaluation.

Algorithm: FFT (a0 , a1 , a2 , . . . , an1 )

if n = 1, return a0
else
y even := F F T (a0 , a2 , a4 , . . . , an2 )
y odd := F F T (a1 , a3 , a5 , . . . , an1 )
n
for k = 0 to 2 1
yk := ykeven + nk ykodd
yk+ n2 := ykeven nk ykodd
return y

Running time: FFT has the following recurrence relation since we are solving 2 sub-problems each of size
n
2 and the cost of combining is linear in n:
n
T (n) = 2T + (n)
2
T (1) = (1)
Solving this recurrence relation yields:

T (n) = (n log n)

2.5.6 Fast Interpolation at the Complex nth Roots of Unity

So far we have seen how to perform evaluation of a polynomial at the complex nth roots of unity using the
FFT. However, for the polynomial multiplication to be complete we must have an algorithm that can be
used to convert the PVR of the polynomial back to the coefficient form. This is called interpolation. It turns
out that the FFT can also be used to do interpolation in time (n log n).
In the interpolation step, we are given the PVR of the polynomial:

{(n0 , y0 ), (n1 , y1 ), . . . , (nn1 , yn1 )}

where yi = A(ni ), and we wish to compute a0 , a1 , ..., an1 .


Note that this is the opposite of the evaluation step where we were given a0 , a1 , ..., an1 and we computed
y0 , y1 , ..., yn1 .
In fact, lets take a moment to look back at the evaluation step in a different way. We can write the evaluation
of the polynomial at the complex nth roots of unity as the matrix product:

a0 y0
a1 y 1

Vn a2 = y2

.. ..
. .
an1 yn1

where Vn is a matrix of the following form (this is an example of a type of matrix known as a
Vandermonde matrix):


1 1 1 1 ... 1
1 n n2 n3 ... nn1
2(n1)

Vn = 1 n2 n4 n6 ... n

.. .. .. .. .. ..
. . . . . .
2(n1) 3(n1) (n1)(n1)
1 nn1 n n . . . n

Interpolation, on the other hand, involves the following multiplication:



a0 y0
a1 y1

a2
= Vn1 y2


.. ..
. .
an1 yn1

We can show that Vn1 has the following form:


i
1
n

360/n
-1 1
360/n

-1
n
-i

Figure 2.12: Another property of the complex plane.


1 1 1 1 ... 1
(n1)

1 n1 n2 n3 ... n

1 1 n2 n4 n6 ...
2(n1)
n

Vn1 =
n
.. .. .. .. .. ..

. . . . . .
(n1) 2(n1) 3(n1) (n1)(n1)
1 n n n . . . n

As an exercise, you can verify that Vn1 is indeed the inverse by showing that Vn1 Vn = I.
Thus, not counting the n1 term, Vn1 is the same as Vn , except that we substitute 1 for . Looking
back over the FFT algorithm, we can see that the reason this algorithms works is based on the Halving
(n1)
Lemma. However, if we substitute n0 , n1 , n2 , . . . , n for n0 , n1 , n2 , . . . , nn1 , then we still get the
analogous result to the Halving Lemma (see Figure 2.12 - basically we are still going around the unit circle
on the complex plane, only now we are going clockwise around the circle, whereas before we were going
counter-clockwise). Thus, roughly the same algorithm works. In particular, we can use the FFT algorithm
with three modifications:

1. switch y and a

2. replace n with n1

3. divide each final result by n.

The recurrence for this algorithm is the same as the recurrence for the FFT evaluation algorithm, so we can
also do interpolation in time (n log n).
Thus the interpolation can be done in (n log n) time as well. This interpolation is also called the inverse
F.F.T.. Thus by using the F.F.T. and the Inverse F.F.T. we can transform a polynomial of degree bound
n back and forth between its coefficient and point-value representation in time (n log n). Thus we have
shown that we can multiply two polynomials in (n log n) time.
f(t)

0
T T
Figure 2.13: A periodic function f (t).

f(t)= y0 x + y1 x + y2 x + ...

0 0 0

Figure 2.14: F.F.T. as a sum of elementary periodic functions.

2.5.7 What is a Fourier Transform?

Let f (t) be any periodic function with period length T as shown in Figure 2.13. The function f (t) can be
expresses directly (i.e., by describing its value at every point within a period). Furthermore it can also be
expressed as a linear sum of elementary periodic functions with periods , T, T /2, T /3, . . .. The elementary
periodic functions are shown in Figure 2.14. The coefficients of the elementary periodic function, denoted
y0 , y1 , y2 . . ., are known as the Fourier coefficients. The change from representing the function f (t) directly
to representing it using its Fourier coefficients is known as a Fourier Transform.
In contrast to the Fourier Transform, the Discrete Fourier Transform starts with a periodic function f (t)
that is only known at exactly n evenly spaced discrete points every period. Let a = (a0 , a1 , . . . , an1 ) be the
values of f (t) at the n evenly spaced values of t, and let y = (y0 , y1 , . . . , yn1 ) be the first n coefficients of
the Fourier coefficients of this discrete approximation of f (t). The conversion of the vector a to the vector
y is the Discrete Fourier Transform (the DFT), and the reverse conversion is its inverse. It turns out that
this transformation is exactly what we have been computing: if y = DF T (a), then y consists of the n values
produced when we evaluate the polynomial with coefficients defined by a at the n n-th roots of unity.
Thus, we have two identical interpretations of the DFT: it is a transformation from a represention of the
periodic function f (t) at n discrete points to a represention of f (t) using its Fourier coefficients; it is also a
transformation from the coefficient representation of a polynomial to its point-value representation, where
the n values used are the nth roots of unity. The FFT, on the other hand, is simply the name given to the
algorithm we have described for computing the DFT.
Chapter 3

Greedy Algorithms and Matroids

Definition 5 A greedy algorithm is an algorithm that finds a solution by adding elements to the solu-
tion one by one, where each element that is added is the best current choice without regard to the future
consequences of this choice.

A greedy algorithm is usually easy to program, and it usually runs quickly. However, it is not always the
case that the greedy algorithm finds the optimal solution. Our goal is to develop general techniques for
determining if the greedy algorithm does return the best solution, or if more sophisticated methods are
required.

3.1 First Application of Greedy Algorithms:


Minimum Spanning Tree

We start off with a specific problem for which the greedy algorithm finds the best solution: the Minimum
Spanning Tree (MST) Problem.
Input: Undirected, connected graph G = (V, E) with edge weights.
Output: The minimum-weight subset of edges, E 0 E, such that the graph, G = (V, E 0 ), is acyclic and
connected.
In the following algorithm (called Kruskals algorithm), we maintain F E, which is always a forest (a
possibly unconnected, acyclic subset of the edges of our graph).
Kruskals Algorithm:
Sort edges by non-decreasing weight
F =
Do until F is a spanning tree of G:
get the next edge e
if F + e is acyclic then F = F + e
Return F

Note that this is in fact a greedy algorithm: every edge added to the forest is always the best possible edge
(i.e., the edge of minimum weight) that can be added to the forest, without regard to the future implications
of adding this edge.

25
3.1.1 The running time of Kruskals algorithm
1. Sorting: (|E| log |E|)

2. Determining if F + e is cyclic: We can implement the algorithm by using an array with an entry
for each vertex. Each such entry contains a label that denotes which connected component the corre-
sponding vertex belongs to. Now to check if F + e is cyclic we just need to check the entries in the
array for the two vertices that e connects. If they are the same then F + e is cyclic, if they are different
the F + e is acyclic. The runtime required to perform this check is (1) and is performed (|E|) times,
for a total of O(|E|).

3. Adding e to F : To do this we switch the labels for the smaller connected component to be the same
as the labels for the larger connected component, which takes O(|V |) time. This step is performed
exactly |V | 1 times, for a total of (|V |2 ) time to add the edges to the forest.

Thus the total run time of the algorithm is O(|V |2 + |E| log |E|). Later, we shall study a more efficient
technique for checking if an edge can be added to a graph without causing a cycle. This improves the
running time to O(|E| log |E|).

3.1.2 Proof of Correctness

Given any unconnected forest F , let S(F ) be the set of spanning trees that include F .

Lemma 6 For any unconnected forest F , there is a minimum cost spanning tree T S(F ) that includes
edge e, the minimum weight edge that is: (a) not in F , and (b) does not cause a cycle with F .

Proof: We assume that the claim is not true. If this is the case, there must exists some T 0 S(F ) such
that e 6 T 0 , and the cost of T 0 is less than the cost of any tree containing e. If we add e to T 0 this must
cause a cycle. (See Figure 3.1) Furthermore, since e did not cause a cycle with F , there must be some edge
e0 that is in this cycle that was not in the original forest F . Furthermore, since e0 is not in F and does not
cause a cycle with F , it must be the case that the weight of e0 is at least the weight of e.
Note that if we add e to T 0 and remove e0 from T 0 we obtain a new tree, which we refer to as T 0 + e e0 .
We see that the weight of T 0 + e e0 is at most the weight of T 0 . This contradicts our original assumption,
and thus the lemma is true.
We are now ready to show that Kruskals algorithm finds the minimum spanning tree. We actually prove
that every forest F produced during the algorithm can be extended to a minimum spanning tree. This
implies that the algorithm works correctly, since the last forest must be a spanning tree, and thus it is also
a minimum spanning tree.

Theorem 7 Every forest F produced by Kruskals algorithm is such that S(F ) contains a minimum spanning
tree.

Proof: We prove this by induction on the size of F :

1. Base Case: F is empty. In this case S(F ) contains all spanning trees, and thus it also contains the
minimum spanning tree.
x
e
y

v
u e
Figure 3.1: Depiction of forest F in S

2. Inductive Step: Assume that before adding an edge e, the set F is such that S(F ) contains a
minimum spanning tree. An edge e added by Kruskals algorithm must be the minimum weight edge
that is not in F and that does not cause a cycle with F . Lemma 4.2 tells us that after this edge is
added to F , S(F ) must still contain a minimum spanning tree.

3.2 Subset Systems

We next examine how to generalize our results for the minimum spanning tree to a broad class of optimization
problems.

Definition 8 A subset system S = (E, I) is a finite set E with a collection I of subsets of E such that:

if i I and i0 i then i0 I.

The above property of I is commonly stated as I is closed under inclusion. We refer to the elements of I
as independent sets. The term independent sets is perhaps confusing: it does not mean that the subsets in
I are disjoint; rather it comes from one of the applications of subsets systems which we shall examine later.
The right way to think of the collection I is as a description of which subsets of E are valid. The subsets in
the collection I are valid, and all other subsets are not valid.
Example 1:
E = {e1 , e2 , e3 }
I = {{e1 , e2 }, {e2 , e3 }, {e1 }, {e2 }, {e3 }, {}}

It is easy to verify that this I is closed under inclusion.


Example 2:
E : edges of graph
I : acyclic subsets of edges
Since it is not possible to introduce a cycle by removing edges, any subset of an acyclic subset of edges will
remain acyclic and hence be part of I. Thus, this I is closed under inclusion.
Example 3:

E : edges of graph
I : subsets of edges such that no two edges share a vertex.

We call such subsets of the edges a matching. Since it is not possible to increase the degree of a vertex by
removing edges, any subset of a matching will still be a matching. Thus, this I is closed under inclusion.

3.2.1 The generic optimization problem for subset systems

For any subset system, we can define an optimization problem for that subset system as follows:

Input: (E, I), and weight function w : E <+ P


Output: i I such that w(i) is maximum. (w(i) = ei w(e))

For Example 2, the resulting optimization problem is called the Maximum Weight Forest (MWF) problem,
and is very similar to the MST problem. For Example 3, the resulting optimization problem is called the
Maximum Weight Matching problem.

3.2.2 Greedy Algorithm

We can now define the greedy algorithm for the generic optimization problem on the subset system (E, I)
as follows:

i=
Sort the elements of E by non-increasing weight
For each e E
if i + e I then i = i + e
return i

Note that this is simply a generalization of Kruskals MST algorithm. This is a very simple, and usually very
fast, algorithm. It turns out that for some subset systems, it produces the optimal output, but for others, it
does not. For which of our example subset systems does the greedy algorithm work?
Example 1:
It does work. To give one specific example (which does not prove that it works in general) assume that

w(e1 ) = 3
w(e2 ) = 1
w(e3 ) = 4

The greedy algorithm will first select e3 and, because {e1 , e3 } 6 I, returns {e2 , e3 }, the optimal solution.
Example 2: Maximum Weight Forest (MWF)
3 2 3 2

2 2 2 2
2 2
3 3

2 2 2 2

Figure 3.2: The matching on the left depicts the greedy solution (weight=6). The matching on the right
depicts a maximum matching (weight=7).

It does work. To see this, note that the MWF and the MST problem are the same, except that (a) MST
is looking for the minimum weight and MWF is looking for the maximum weight, and (b) MWF does not
require the resulting forest to be a tree. However, if we have a connected graph, a maximum weight forest
will always be a tree - otherwise we could always add an edge. Thus, we can convert a MWF problem to a
MST problem by defining a new weight function over E. Specifically,

w0 (e) = w(e) + max w(e0 )


e E
0

where w : E <+ is the weight function for MWF. We see that the greedy algorithm for the optimization
problem for Example 2 produces exactly the same result as Kruskals algorithm would return on the converted
instance of the MST problem. Since we already showed that Kruskals algorithm returns the correct answer
for the MST problem, it follows that the greedy algorithm always returns the correct answer for the MWF
problem.
Example 3:
In this case, the greedy algorithm does not work. Consider the example depicted in Figure 3.2.

3.3 Matroids

Definition 9 A matroid is a subset system M = (E, I) that satisfies the exchange property:
if i, i0 I such that |i| < |i0 |, then there is some e i0 i, such that i + e I.

Example 1:
This can be seen to satisfy the exchange property by inspection. For example, if we have i = {e1 } and
i0 = {e2 , e3 }, then e = e2 could be added to i.
Example 2: Maximum Weight Forest (MWF)
We shall demonstrate that this subset system is a matroid in the next lecture.
Example 3: Maximum Weight Matching
Although the optimal matching in Figure 3.2 contains more edges than the greedy matching, none of these
edges can be added to the greedy solution and still be a valid matching. Thus, this is a matroid.
It turns out that matroids give us a technique for determining when the greedy algorithm will work, and
when it will not work. In particular, in the next lecture we shall prove the following Theorem:
Theorem 10 For any subset system (E, I), the greedy algorithm solves the optimization problem for (E, I)
if and only if (E, I) is a matroid.

Before proceeding with the proof of this Theorem, we clarify the difference between the terms maximal and
maximum. In particular:

Definition 11 A maximal solution is one to which nothing can be added and still yield a valid subset.

Note that this is a local property. Also note that the greedy algorithm always outputs maximal solutions.

Definition 12 A maximum solution has the property that no other solution is better.

Note that this is a global property.


Proof: (Of Theorem) We need to prove this in both directions:
(): If (E, I) is a matroid, then the greedy algorithm produces the optimal solution.
To show a contradiction lets assume (E, I) is a matroid, but the greedy algorithm does not find the optimal
solution. Since the greedy algorithm does not find the optimal solution there must be the following two
solutions. Let the greedy solution be:
i = {e1 , e2 , . . . , ek }
and the optimal solution be:
j = {e01 , e02 , . . . , e0k0 }
such that:
w(j) > w(i)
0
We first show that k = k . That is, both i and j have the same number of elements. We know that i and j
must be maximal, otherwise j would not be optimal and i greedy would not have halted. Additionally, since
(E, I) is a matroid, I must obey the exchange property. If i and j were of different sizes we could take an
element from the larger one and add it to the smaller one. But, since we are not able to make either i or j
larger, they must have the same size. So |i| = |j| and thus k = k 0 .
Now assume that our numbering of elements in i and j was in non-increasing order as follows:

w(e1 ) w(e2 ) . . . w(ek )

w(e01 ) w(e02 ) . . . w(e0k )


Since the two sets have the same size and w(j) > w(i), there must be some element in j that is larger than
its corresponding element in i. Let s be the smallest number such that w(e0s ) > w(es ). Consider the subset
of the greedy solution:
= {e1 , e2 , . . . , es1 }
and the subset of the optimal solution:
= {e01 , e02 , . . . , e0s }
Because we have a matroid, and because the above sets have different sizes ( is larger than ), we can use
the exchange property as follows:

e0t such that {e0t } I

Because the elements are ordered by decreasing weight: w(e0t ) w(e0s ) > w(es ). More simply, w(e0t ) > w(es ).
So e0t should have been added to i by the greedy algorithm instead of es but was not. This is a contradiction.
Figure 3.3: An example of maximal independent subsets i and i0

(): If the greedy algorithm gives the optimal solution, then (E, I) is a matroid.
We will prove the contrapositive of the above statement: Assume that (E, I) is not a matroid, then for some
weight function, the greedy algorithm fails. Since (E, I) is not a matroid, i , i0 I such that |i| < |i0 | but
there is no element e i0 i such that i + e I. Exploiting this fact we choose an explicitly failing weight
function. Let m = |i| and

m + 2 if e i
w(e) = m + 1 if e i0 i

0 otherwise

The greedy algorithm would return i, where w(i) = m(m + 2) = m2 + 2m. But we know that |i0 | > |i| so
|i0 | m + 1. So the optimal solution is at least w(i0 ) (m + 1)(m + 1) = m2 + 2m + 1. Thus, the greedy
algorithm has not found the optimal solution.

3.3.1 Examples of Matroids

We first reexamine the matroid for the Maximum Weight Forest problem, which is also known as the graphic
matroid. Here, we have subset system (E, I) where:
E: Edges of a graph.
I: Acyclic subsets of edges.

We know from our previous proof of Kruskals algorithm [FW02] that the greedy algorithm works on (E, I).
Since we do have subset system, it follows that we have a matroid. However, we will typically have an
optimization problem, and want to know whether or not a greedy algorithm works for that problem. In
order to decide this question, we first must determine whether or not this greedy algorithm can be defined
in terms of a subset system and our generic greedy algorithm for that subset system. If so, we must then
determine if that subset system is a matroid. This can be done by showing that the subset system obeys the
exchange property. However, it is sometimes easier to show that the subset system obeys another property,
provided in the following theorem:

Theorem 13 Cardinality Theorem: A subset system (E,I) is a matroid if and only if A E, if i, i0 I


are maximal independent subsets of A, then |i| = |i0 |.

For example, for the subset system for the Maximum Weight Forest problem, we might have the instantiations
of E, A, i, and i0 represented in Figure 3.3.
Proof: (): If (E, I) is a matroid, and i, i0 are maximal independent subsets of A E, then |i| = |i0 |.
(Note that this is similar to the proof we provided for k = k 0 in the proof of Theorem 10.) Suppose |i| =
6 |i0 |.
Then because (E, I) is a matroid, by the exchange property we can add an element from the larger set to the
smaller set. For example, if |i0 | > |i| then some element from i0 i could be added to i. But i is supposed
to be maximal, a contradiction.
Figure 3.4: A set of connected components, Ci in A. Let Vi be the set of vertices in a Ci . Let Ei be the set
of edges for a maximum acyclic subset of a Ci .

(): We will prove the contrapositive of this: that if (E, I) is not a matroid, then there is some A such
that there are two maximal independent subsets of A, i and i0 , such that |i| =
6 |i0 |. Assume (E, I) is not a
matroid. This implies that the exchange property doesnt hold. Then, i, i I such that |i| < |i0 | but there
0

is no e i0 i such that i + e I. Let A = i i0 . Note that i is maximal in A, but i0 is not necessarily


maximal in A. However, i00 such that i0 i00 and i00 is maximal in A. This makes |i00 | |i| so |i00 | =
6 |i|.
We then have two maximal independent subsets of different size.
Using the Cardinality Theorem we can prove that the MWF subset system is a matroid.
Proof: Consider any A E. Let C be the set of connected components in A. Let V be the set of vertices
in the graph. We show that any maximal acyclic subset of A has |V | |C| edges. From this, the Cardinality
Theorem implies that that this subset system is a matroid.
Number the connected components in C, and let Vi denote the set of vertices in connected component i. In
any given connected component i, and any maximal acyclic subset of the edges in i, there are |Vi | 1 edges.
If we sum over all connected components, we see that the total number of edges in any maximal acyclic
subset of A is |V | |C|.
We next demonstrate that the following problem is also a matroid:

Input: Directed graph G = (V, E), weight function w : E R+ .


Output: The maximum weight subset of E such that no two edges point to the same node.

Define a subset system (E, I) as follows:

E: Edges of a graph.
I: Set of valid subsets of the edges.

We claim that (E, I) is a matroid.


Proof: Consider any A E. The number of edges in any maximal independent subset of A is equal to the
number of vertices pointed to in A. From this, the Cardinality Theorem gives us that (E, I) is a matroid.
We consider one final example: Linearly Independent Columns.

Input: A n m matrix M (populated from R), a weight for each column.


Output: The maximal weight set of linearly independent columns.
U V
Figure 3.5: Example of a bipartite graph.

Proof: From linear algebra, we know that all maximal linearly independent subsets of a set of vectors have
the same cardinality. Therefore, by the Cardinality Theorem, this is a matroid. (More details can be found
in [PS82].) Note that it does not matter what the weighting function is. This example is the source of the
expression independent to refer to the valid subsets. It is also the origin of the term matroid.

3.4 Bipartite Matching and the Intersection of Matroids

We have thus far examined how matroids are useful for defining polynomial time solutions via greedy algo-
rithms. Some problems, however, are not matroids, but can be expressed as the intersection of matroids.
This leads to a polynomial time algorithm for solving those problems. We first look at a specific example of
this: bipartite matchingings, and then conclude with a brief overview of how the idea extends to any subset
system that is the intersection of matroids.

3.4.1 Bipartite Matching

The bipartite matchine problem is defined as follows:

Input: A bipartite graph B = (U, V, E), where U and V are disjoint sets of vertices, not necessarily of the
same size, and E is a set of edges such that all edges go between a vertex in U and a vertex in V .

Output: A matching (no two edges share a vertex) of maximum size.

A bipartite matching is useful, for example, on an assignment problem where we wish to maximize the
number of tasks completed by assigning individuals to tasks subject to the following constraints:

1. Each person may be assigned to at most one task.

2. At most one person may be assigned to a task.

3. Not every person can do every task.

In this example U is the set of individuals, V is the set of tasks to be done, and E specifies who is able to
do what tasks. The goal is to find a matching, i.e. a legal assignment of tasks.
We have learned that a greedy algorithm does not necessarily return a maximum matching because, while
matchings are a subset system, they are not a matroid. However, by using two matroids we can find the
largest matching.
In particular, we want to break the problem down into two matroids whose intersection gives the solution. In
a matching, at most one edge touches any vertex. This constraint may be broken down into two requirements
by first considering U independently of V and then considering V independently of U . We define a matroid
(E, I) such that there is at most one edge per node of U for all i I. Similarly, we define a matroid (E, I 0 )
such that there is at most one edge per node of V for all i I 0 .
It is straightforward to prove that both (E, I) and (E, I 0 ) are matroids. Furthermore, notice that valid
matchings are exactly the subsets of E in the collection I I 0 . Thus, the problem of finding the maximum
sized matching is equivalent to finding the maximum cardinality element of I I 0 .

0 0
Theorem14 Let M = (E,  I) and N = (E, I ) be any two matroids. The largest set in I I can be found
3 0 0 0
in time O |E| C (I, I ) , where C(I, I ) is the time to test if i I or i I .

While the running time given in this theorem appears large, it is polynomial, provided that testing if i I
and if i I 0 are polynomial procedures. Furthermore, for specific problems (such as bipartite matching),
faster algorithms are known. Rather than proving the general theorem, we consider the bipartite matching
problem as a specific case.

3.4.2 Augmenting Paths

Augmenting paths are a key concept for finding a maximum matching. Let M be a matching of a bipartite
graph, B = (U, V, E).

Definition 15 A free vertex is a node not incident to any edge in M .

Definition 16 An augmenting path is a sequence of edges that begins and ends at a free vertex, alternating
between matching edges e M and non-matching edges e E M .

See Figure 3.6. Augmenting paths allow us to build matchings one edge at a time (Section 3.4.4 generalizes
this to any intersection of matroids problem). We use augmenting paths to help us define an algorithm in
the next section for finding a maximum matching.

Definition 17 If P is an augmenting path for the matching M , the symmetric difference of M and P is
defined to be M P := (M P ) (M P ).

Similar in nature to XOR (exclusive or), the symmetric difference operator gives edges that are in M or P ,
but not both.

3.4.3 Algorithm for Bipartite Matching

The following is an algorithm for finding a maximum bipartite matching:


U V
matching edges on augmenting path

non-matching edges on augmenting path


Free vertex
non-matching edges not on augmenting path

matching-edges not on augmenting path

Free vertex

Figure 3.6: Example of an augmenting path.

M
while there exists an augmenting path P
M M P
return M

To prove the algorithms correctness, we must show three properties: the symmetric difference maintains
the matching property at each iteration, the algorithm terminates, and the result is maximum.

Claim 18 If M is a matching and P is an augmenting path for M , then M P is also a matching.

Proof: The claim states that for all e1 , e2 M P , e1 and e2 do not share a vertex. Since M P =
(M P ) (P M ), we consider the three cases of sets e1 and e2 may belong to.

1. e1 , e2 M P . Since M is a matching, e1 , e2 cannot share a vertex.


2. e1 , e2 P M . If a vertex is free, there is at most one non-matching edge incident to it in the
augmenting path. Thus, e1 and e2 cannot share a free vertex. All other vertices have two incident
edges: one matching edge and one non-matching edge. Since e1 and e2 are both in the augmenting
path and thus non-matching, they cannot share a non-free vertex either.
3. e1 M P , e2 P M . If e2 is incident to a free vertex, it cannot share the vertex with any edge
e1 M by definition. Assume e2 and e1 are incident to the same (non-free) vertex, then e1 would
necessarily be in M and P . This is a contradiction of the assumption e1 M P ; e2 and e1 therefore
do not share a non-free vertex either.
No edge in M P shares a vertex with any other edge, so M P is a matching.
The assignment M M P at each iteration thus maintains the matching property.

Claim 19 For any matching M and augmenting path P , |M P | = |M | + 1.

Proof: The augmenting path P begins and ends with free vertices and, thus with non-matching edges.
Edges along the path alternate between matching and non-matching, so the path therefore has odd length
and contains one more non-matching edge than matching edge. As a result the number of edges added to
the matching M is one larger than the number of edges removed from the matching M .
Thus, the algorithm adds one edge to the matching at each iteration. Since the number of edges is finite,
the algorithm must terminate.

Lemma 20 If a matching M is not maximum, then there exists an augmenting path for M .

Proof: If M is not a maximum matching, then there exists a maximum matching M 0 such that |M 0 | > |M |.
Let E 0 be the edges in M M 0 and G0 = (U, V, E 0 ). For all vertices v U V , at most two edges in E 0 are
incident to v. That is, at most one edge from M and at most one edge from M 0 are incident to v. Consider
the connected components of G0 : By virtue of the symmetric difference, paths must alternate between edges
of M and M 0 , and they will either terminate or loop to the start (e.g. cycle). Any such cycle must have
an even length, implying that the number of edges from M and M 0 in the component are equal. However,
since |M 0 | > |M |, there must exist some alternating path with more edges from M 0 than M . This is an
augmenting path.
Weve proven there is always an augmenting path for any non-maximum matching, so the algorithm will
continue until there is no augmenting path, at which point the matching must be maximum.
Claims 18 and 19 along with Lemma 20 prove the correctness of the algorithm for finding a maximum
bipartite matching.
While we now have an algorithm guaranteed to find a maximum bipartite matching, it relies on being able to
find augmenting paths. We might use generic Breadth First Search (BFS) starting at every free vertex in U
to find augmenting paths, but we must guarantee that paths alternate between matching and non-matching
edges.
To meet the alternation requirement, we modify the graph so that matching edges are directed from V to U
and non-matching edges are directed from U to V . Directing the graph guarantees that every path we find
is augmenting. Performing BFS from every free vertex in U on the resulting graph will find an augmenting
path if one exists.
A direct implementation of BFS for finding an augmenting path requires O (|U | |E|) time, since we need to
try every free vertex (and thus we might require O(|U |) iterations of BFS. However, a more careful version
requires only O (|E|) time: if the algorithm finds that a free vertex cannot be reached from some vertex, it
is marked and no additional searches are conducted from that vertex. In this way each vertex is visited only
a number of times proportional to the number of edges incident to that vertex.
The number of augmenting paths is O (min (|U | , |V |)). One edge is added for every augmenting path, so
there will be as many augmenting paths as the size of the maximum bipartite matching. Moreover, there
cannot be more edges in the maximum matching than there are vertices in one set.
The total run time is therefore O (|E| min (|U | , |V |)).
Input: Maximum Branching:

Figure 3.7: Example of a directed graph and its branching.

3.4.4 Extension to General Intersection of Matroids

The methods in Section 3.4.1 can be extended to any subset system that is an intersection of two matroids.
odd
Given two matroids, (E, I) and (E, J), find an alternating sequence {ei }i=1 , where odd is an odd integer,
and ei E. The criteria for T
ei in the general algorithm and the relation to the bipartite matching problem
are as follows, where M I J.
General Algorithm Bipartite Matching
M + e1 I and M + e1 6 J. Add an edge incident to a free vertex.
M + e1 e2 I J. Remove a matching edge.
M + e1 e2 + e3 I and M + e1 e2 + e3 6 J Add the next non-matching edge
.. ..
. .
M + e1 e2 + e3 + eodd I J Add the last non-matching edge
If such a sequence exists, we can find it using a version of BFS on an auxiliary graph (not defined here). The
result of this is that the largest cardinality element that is the intersection of two matroids can be found in
polynomial time. For details, see [PS98]
Branchings This general extension can be applied to a directed version of the minimum spanning tree
problem, called branchings.

Definition 21 A branching for a graph G = (V, E) is a set of edges E 0 E such that the corresponding
undirected graph is a tree, and all the edges in the graph point away from some vertex, called the root.

The problem of finding a branching is formalized as follows (see Figure 3.8).

Input: A directed graph G = (V, E) and a root vertex r V .


Output: A valid branching rooted at r (if one exists).

Exercise: Show that a set of edges is a branching if and only if the undirected edges are a tree and every
vertex has exactly one incoming edge, except the root, which has no incoming edges.

Although the greedy algorithm does not work for this problem (see Figure 3.8), this problem may be formu-
lated as the intersection of matroids if we let
E Edges of the graph
Input:

The chosen edges,


The other edges.

Figure 3.8: An example of where the greedy algorithm does not work

I Acyclic subsets of edges (undirected)


J At most one incoming edge per node (zero at the root)
T
and search for a maximum sized set E 0 E such that E 0 I J. If |E 0 | = |V | 1, then there is a branching.
We note that there is a faster algorithm due to Tarjan that runs in time O(|E| + |V |log|V |).

3.5 The Union-Find data structure

We next examine a useful data structure problem known as Union-Find. The motivation we see for this data
structure is from matroids: specifically it can be used to improve the performance of Kruskals algorithm.
However, there are a number of other important uses for Union-Find.

3.5.1 Review of Kruskals algorithm

Recall that Kruskals algorithm is used to find the Minimum Weight Spanning Tree (MST) for an undirected
graph G. The algorithm proceeds as follows:
Given an undirected graph G with a set of vertices V and a set of weighted edges E, do the following:

sort the edges in E by increasing weight


F =
for each edge e = (u, v) in E
if F + e is acyclic
F =F +e
return F

When we introduced Kruskalss algorithm, we described a naive data structure (based on an array with an
entry for each vertex) to allow us to perform the test of whether F + e is acyclic. Using this data structure,
we see that:

The initial step of Kruskals algorithm, sorting the edges, takes O(|E| log |E|).

There are |E| tests for an acyclic graph, each of which take O(1) time.

There are |V | additions of an edge, each of which takes O(|V |) time.


Thus, if we use the naive data structure to test for acyclic graphs, the total running time of Kruskals
algorithm is O(|E| log |E| + |V |2 ). To understand this running time better, note that a connected graph will

have between |V | 1 and |V2 | = (|V |2 ) edges. For a dense graph |E| is close to |V |2 while for a sparse
graph |E| is close to |V | 1. We see that for sparse graphs the term |V |2 can in fact be much larger than
the |E| log |E| term, and thus we focus on reducing the |V |2 term. To do so, we use a more efficient data
structure to test whether or not adding an edge to the forest F will result in a cycle.
We require a data structure capable of performing the following operations:

Keep track of the connected components of the graph.


Merge the connected components, whenever an edge is added to F .

We study a data structure that can perform these operations, called the Data Structure for Disjoint Sets or
the Union-Find Data Structure. This data structure is not only useful for Kruskals algorithm, but also for
other problems.

3.5.2 Union-Find Data Structure

The Union-Find data structure contains a number of disjoint sets. Each of these sets contains an element
designated as the label of the set. The Union-Find data structure also allows three operations on these
sets. One example of the Union-Find data structure would be

{a, b, c} with label a


{d, e, f } with label e

Make-Set(v) The Make-Set function creates a set containing just v, sets the label of the set to v, and adds
the resulting set {v} to the sets stored by the Union-Find data structure. Starting with the example
above, Make-Set(v) results in

{a, b, c} with label a


{d, e, f } with label f
{v} with label v

Union(u, v) The Union function takes the two disjoint sets, each containing u and v, respectively, replaces
them with the union of the two sets, and sets the label of this new set to an arbitrary value in the set.
Using the example above, Union(f, v) results in

{a, b, c} with label a


{d, e, f, v} with label e

Find(v) The Find function returns the label of the set containing v. Using the example above, Find(v) = e
and Find(a) = a.

3.5.3 Kruskals algorithm using a Union-Find Data Structure

We next describe Kruskals algorithm using a Union-Find data structure. Note that if vertices u and v of an
edge e are in different connected components of a forest F , then we can add edge e to the forest F without
creating a cycle. Using a Union-Find data structure to represent the connected components of the forest
F , it is safe to add the edge e if Find(u) 6= Find(v). The pseudo-code for Kruskals algorithm with the
Union-Find data structure is:

sort the edges in E by increasing weight


for each vertex v in V
Make-Set(v)
F =
for each edge e = (u, v) in E
if Find(u) 6= Find(v)
Union(u, v)
F =F +e
endif
return F

The running time of Kruskals algorithm varies depending on what specific implementation of the Union-Find
data structure is used. In what follows, several data structures and their running times will be discussed.

3.5.4 Simple Implementation of Union-Find

In this implementation each set is represented as a linked list of nodes. Each node n contains three data
elements, a name for the node, a pointer to the label of the set and a pointer to the next element of the
linked list. At the end of the list the next pointer in the node is set to Null. Below is a depiction of the case
where we have the set {a, b} with label a, the set {c, d, e} with label c, and the set {f } with label f .

?? ?
? ? ?
a - b c - d - e f

The running time of the different operations of Union-Find, with thie implementation, are as follows:

Make-Set(v): O(1) because it simply involves the creation of a single node.

Find(v): O(1) because each node of the linked list has a back pointer to the label of the list. Note that
we assume that the application using the Union-Find data structure maintains a pointer to the node
containing any given element. Thus, we can always find the element itself in O(1) time.

Union(u,v): O(length of the appended list). When we perform a Union operation, we append one list to
the other. We need to find the end of the first list, but we can maintain a pointer to the end of each
list that will be used for this purpose. The next pointer of the last node of the first list will now point
to the beginning of the appended list. The above two operations can be done in constant time. Next,
we have to modify the label pointers of each node in the appended list. The time required to do this
will be (length of the appended list).

To analyze the performance of this Union-Find data structure, we consider the case of n Make-Set operations
and m total operations. The reason we consider n separately is because that represents the size of the input.
Since there are initially n sets, and each union operation reduces the number of disjoint sets by 1 the number
of Unions is at most n 1. Each Union requires time at most O(n), and thus the total time required is
O(n2 + m).
To see that the time can also be as much as (n2 + m), consider what happens when there are n Make-Set
operations, followed by n Union operations, followed by m 2n Find operations. Also, assume that for each
Union, the longer list is always appended to the shorter list. The ith Union operation
Pn1 will update pointers
on i nodes. Therefore the total time required by all the Union operations is i=1 i = (n2 ). Thus, the
total time required is (n2 + m).
We can improve performance of this strategy by always appending the shorter list to the longer one. This
will require each list to keep track of its length so that one can choose the shorter list in constant time.
Adding and maintaining this information does not affect the asymptotic running time of the operation.
Again, consider that we have n sets and m total operations to perform. If the shorter set is appended to
the longer set, this results in at least doubling the size of the shorter set. Thus, whenever the back pointer
of a node is updated, the length of the list it is in has at least doubled. Since we have n total elements, the
size of the set resulting from a Union must be n. Thus each nodes back pointer can only be updated a
maximum of log n times. Since there are n elements, the worst case running time is now O(n log n + m).
We next analyze the running time of Kruskals algorithm, using this data structure. As before, all the edges
in E are sorted by weight. This is unchanged from our previous analysis and takes O(|E| log |E|) time.
There are |V | Make-Sets, each of which takes (1). There are |E| Finds, each of which take (1). There
are |V 1| Unions, which (from our earlier analysis) take a total of O(|V | log |V |). |E| is at least as large
as |V | 1 because the graph is connected, and thus, the total running time of Kruskals algorithm with the
described simple implementation of a Union-Find data structure is O(|E| log |E|).
Although the Union-Find data structure allows us to solve the MST problem in O(|E| log |E|) it is not the
fastest approach. Prims algorithm [CLRS, p. 570] utilizing Fibonacci heaps [CLRS, p. 476], can find the
Minimum Spanning Tree of a graph in O(|E| + |V | log |V |). Also, faster techniques are known for sparse
graphs; see [CLRS].

3.5.5 Faster Implementation of Union-Find

While further improvements to the running time of Union-Find will not further improve the running time
of Kruskals algorithm, there are other applications of Union-Find, and thus we next explore more efficient
solutions to this problem.
We next consider representing each set as a rooted tree. At each node in the graph is one element of the set
and a pointer to its parent. The element at the root of the tree is the label of the set, and its parent pointer
is self-referencing.

{a, b} with label a:

a 

{c, d, e} with label c:


c 


*H
YH
 HH
 H
d e

{f } with label f :

f 

The operations of Union-Find are implemented as follows:

Make-Set This is simply the creation of a single node, which takes time (1).

Find We chain up the tree to the root to find the approriate label. This takes time (depth of node d in
the tree).

Union To perform a Union of the elements u and v, we first perform a Find operation on both elements.
One of the root nodes of the two trees then has its parent pointer updated to point to the other trees
root. This runs in O(time for find operations + (1)) time. For example, using the sets above, the
union of f and e will produce the following rooted tree:

f 

c
*H
Y
 HH
  HH
d e

The worst case possibility for this data structure occurs if, for every Union, we append the larger tree to the
smaller tree. This could create a chain of n nodes. If this is followed by a sequence of m2n Find operations,
all referencing the node at the end of the chain, we have a running time of at least (n m). It is also easy
to show that it is always no worse than O(n m). Thus, if we only use the rooted tree implementation, the
performance is actually worse than the performance of the linked list implementation.
3.5.6 Improvements

The following improvements can be made to the way we use the rooted tree structure:

Union by size: Rather than randomly choosing the new root, always add the smaller tree to the
larger tree. The worst case running time of n Make-Set operations and m total operations is now
(m log(n)). The proof is left as an exercise.
Path compression: Every time we perform a Find operation, we flatten the tree by adjusting all
pointers on the path to the root so that they will directly point to the root. Since we are already
chaining up the path to the root during the Find operation, changing these pointers only slows the
Find operation by a constant factor.

For example, consider the following rooted tree:

a 


*H
YH
 HH
 H
b d
*
 *H
Y
  HH
    HH
c e f

If we perform a Find operation on the element g, then the rooted tree will be changed to the following. Note
that g and f now directly point to a.

g - a 
*H
Y
 6HH
 HH
b f d

6 6

c e
The running time when we use the union-by-size and path-compression improvements to the rooted tree
implementation is O(m(n)), where (n) is the inverse of Ackermanns function. Ackermanns function
grows very quickly, and so the inverse grows very slowly. For example, (n) 4 for n BIG. To gain an
idea of just how large the number BIG is, we use towers of twos. For example:

22 = 4
22
2 = 16
2
22
2 = 216 = 64K
2
22
22 = 264K = 19729 digits in base 10
2048 times
z }| {
...2
22
22 = BIG

So for (n) to be greater than 4, n must be greater than a tower of twos with height 2048.

3.5.7 Ackermanns Function

We define an infinite sequence of functions A0 , A1 , A2 ..., such that:

A0 (x) = 1+x
applied x times
z }| {
Ak (x) = Ak1 (Ak1 (Ak1 (Ak1 (...Ak1 (x)...))

The first few terms in the sequence of functions can be calculated easily:

A0 (x) = 1+x
adding 1 x times
z }| {
A1 (x) = (1 + (1 + (1 + (1 + ...(1 +x))))) = 2x
A2 (x) = (2(2(2(2...(2x))))) = 2x x 2x
T ower of twos x high
z }| {
(...2x ) )
(2(2 )
A3 (x) (2

Since 2x x 2x , we write A3 (x) as an inequality. As an example we take the value x = 2 and calculate the
first 5 numbers in the sequence: A0 (2), A1 (2)...

A0 (2) = 1+2=3
A1 (2) = (A0 (A0 (2)) = A0 (3) = 4
A2 (2) = A1 (A1 (2)) = A1 (4) = 8
A3 (2) = A2 (A2 (2)) = A2 (8) = 28 8 = 2048
T ower of 20 s 2048 high
z }| {
(...22048 ) )
(2(2
A4 (2) = A3 (A3 (2)) = A3 (2048) (2

Ackermanns function takes k as input, and is defined as A(k) = Ak (2). The inverse of Ackermanns function
is (n), which is equal to the smallest k such that A(k) n.
The original proof that the running time of n Make-Set and m total operations is O(m(n)) can be found
in [T75]. For an easier to read presentation of the proof, see [K92] or [CLRS].
Chapter 4

Dynamic Programming and Shortest


Paths

We next consider a powerful algorithm design technique known as dynamic programming. We apply this
technique to the knapsack problem, as well as the problem of finding shortest paths in graphs. We then
consider other techniques for finding shortest paths.

4.1 The Knapsack Problem

We introduce dynamic programming by studying the knapsack problem. This problem is motivated by the
following scenario: a thief with a knapsack breaks into a house and has to choose from n items, where each
item i has weight wi and value vi . His goal is to take items worth as much as possible while not exceeding
W , the total weight that his knapsack can hold.
We here consider a simpler version of this problem, where the weight of each item is equivalent to its cor-
responding value. That is, for all items i, wi = vi . The same techniques that are used for this simplified
version of the problem can also be used to solve the general problem.

Formal Problem Description


Input: n items, each with weight/value wi , capacity W
[Assume weights wi and capacity
P W are integers.]P
Output: Subset S of items such that iS wi W and iS wi is maximized.

Example 1
Input: {6,5,5}, W = 10
To start with, we will concentrate on finding only the optimal total value (not the subset of items that yields
this total value). We then shall demonstrate that the technique we develop for finding the optimal value can
also be used to find the solution that achieves this optimal.
First, consider using the greedy approach. Note that this does not always return the best solution. In
Example 1, above, the greedy algorithm returns 6, but the optimal is 10.
Next, consider the divide and conquer approach. In order to do so, we must consider smaller sub-problems.

47
We do so with the following definition:

Definition 22 knap(i, j) = optimal solution obtained by using only items 1 to i with capacity j.

Example 1 (continued)
knap(2, 5) = 5
knap(2, 8) = 6
knap(1, 5) = 0
We could divide the problem in half (as we often do with divide and conquer problems) by considering
knap(n/2, n). However, it is not clear what the other half should be, nor how to obtain the optimal solution to
the original (full sized) problem by efficiently combining this solution with a solution to another subproblem.
Thus, we do not know how to divide the problem into two recursive subproblems of half the size. On the
other hand, we can divide it so that each recursive sub-problem has size one less than the problem that
creates it, as follows:

max {knap(i 1, j), knap(i 1, j wi ) + wi } , wi j
i 1 : knap(i, j) =
knap(i 1, j) , wi > j
i = 0 : knap(i, j) = 0

This defines the solution to knap(i, j) in terms of subproblems that have one item less to consider. To see
where these equations come from, we explain the case where i 1. Consider the effect of item i. An optimal
solution either uses item i or it does not. If the optimal solution does not use item i, knap(i 1, j) is the
best we can do. Otherwise, the optimal solution does use item i. In this case, we need to allocate weight wi
to item i, and use the remaining weight as efficiently as possible. Thus, we obtain the value of item i (which
is wi ), plus the optimal use of the remaining weight, (which is knap(i 1, j wi )). Since one of the two
solutions must be optimal, we can simply take their maximum. Note that we can only use item i if wi j.
Time Analysis. This algorithm leads to a correct answer, and thus we have a valid divide and conquer
approach. However, the running time is exponential. To see this, note that most recursive subproblems
produce two new recursive subproblems. Thus, we produce a binary tree of subproblems. The height of the
tree is n + 1, and thus the tree may have as many as 2n leaves (see Figure 4.1).
However, it turns out that this approach is somewhat wasteful in terms of computation. To see why, consider
the topology of the execution tree (see Figure 4.2).
Note that any node at level i must be of the form knap(i, j), where 0 j W . Thus, there can only
be W + 1 distinct nodes at level i, leading to a total of (n + 1)(W + 1) distinct nodes in the entire tree.
Therefore, there may be some duplication of nodes in the tree. In particular, when W is much smaller than
2n , there will be considerable duplication of nodes in the tree.

4.1.1 The Dynamic Programming Approach

Realizing that the tree structure is wasteful, we improve our approach by using a table to keep track of which
subproblems we have solved. This table will have an entry knap(i, j) for each i between 0 and n and each j
between 0 and W . This table ensures that we only solve each sub-problem once.
We begin by filling in the first row of the table with all zeros, recalling that knap(0, j) = 0. For each entry
knap(i, j), we consider two entries from the previous row: knap(i 1, j wi ) and knap(i 1, j). We then
take the maximum of knap(i 1, j wi ) + wi and knap(i 1, j). Since each row of the table can be filled in
using only data from the row above, we can fill in the table row by row. Our final output is the entry stored
in the bottom right entry of the table, knap(n, W ). See Figure 4.3.
Knap(n, W)

Knap(n-1, W) Knap(n-1, W-wn)

Figure 4.1: Tree of height n + 1 generated by divide and conquer algorithm

level n

level i nodes: Knap(i,j): 0 < j < W

level 0

Figure 4.2: Wasteful tree structure


j:
0 1 2 j W
0 0 0 0 0
1
2
solution path (gives items that were picked)

Knap(i-1, j-wi ) Knap(i-1, j)

i: i-1
i
do not include item i
include item i (add wi to knapsack)

Knap(i, j) = max(Knap(i-1, j-wi)+wi , Knap(i-1, j))

n
Knap(n, W)

Figure 4.3: Table representation of the knapsack problem


We see that the total run time takes (n W ), since computing each table entry requires time (1), and
the table contains (n W ) total entries. As far as space goes, we only need to keep track of 2 rows at any
time, and each row has W + 1 entries. Therefore, we use space (W ).
Note that we have only been computing the optimal total weight of the knapsack and not the subset of
items. It is easy to modify our algorithm to return the subset as well. For each entry knap(i, j), we use an
additional bit to recall whether we took the left or right value from the row above. If we used the left value,
item i was included in the set; otherwise, item i was not included. Using these bits, we can retrace our way
up the table to obtain the subset of items taken. The time required for this modification is asymptotically
the same. However, since this modification must store the one bit for each entry of the table until the end
of the algorithm, the space analysis becomes (n W ).

NP-Completeness

The Knapsack Problem is known to be NP-Complete. In fact, even the simplified variant of the problem we
describe here is NP-Complete. However, it seems that we have a polynomial time solution to the problem.
How can this be? It turns out that we have not really defined a polynomial time algorithm for an NP-
Complete problem. The reason is that polynomial time is defined in terms of the input size. However, the
input size is not W , as it only takes log2 W bits to express the number W (or item sizes up to W ). In
the worst case, then, our running time could be exponential in terms of the input (W = 2log2 W ). We shall
discuss this more when we study NP-Completness.

4.2 When To Use Dynamic Programming

If a problem has the following properties, it is worth trying to use dynamic programming:

optimal substructure: The solution to the problem can be solved using solutions to smaller sub-
problems. This is a property dynamic programming has in common with divide and conquer.
overlap of sub-problems: By taking advantage of the fact that many identical sub-problems are created,
a dynamic programming algorithm may be more efficient solution than what would be achieved using
divide and conquer.

Typically, a dynamic programming approach uses a table representation to ensure that sub-problems are
solved exactly once.

4.3 Shortest Path Problems

There are several variants on the shortest path problem. One such variant, the all-pairs shortest path
problem, can be solved using dynamic programming (the Floyd-Warshall Algorithm), which is why we study
the shortest paths problems together with dynamic programming.

4.3.1 Variants of shortest paths

All variants of the shortest-paths problem have in common a weighted, directed graph G = (V, E), with
weight function w : E < as input.
Pk1
Definition 23 Let p = (v1 , v2 , ..., vk ) be a path. w(p) = i=1 w(vi , vi+1 )

 p
min{w(p) : u ; v} if there is a path from u to v
Definition 24 (u, v) =
otherwise

Single Source:
Input: Source vertex s V
Output: v V (s, v)

Single Destination:
Input: Destination vertex t V
Output: u V (u, t)
Algorithm: Solve the single-source problem with t as the source and the edges of G reversed.

Single Pair:
Input: Source vertex u V , destination vertex v V
Output: (u, v)
Algorithm: Solve the single-source problem with u as the source.
An asymptotically faster algorithm is not known.

All Pairs:
Input: No additional input
Output: u, v V (u, v)
Algorithm: One option is to run the single-source algorithm for each vertex as the source.
A faster solution is the Floyd-Warshall Algorithm (to be described below).

4.3.2 Breadth First Search

If e E w(e) = 1, then the single-source problem can be solved efficiently using breadth-first search. The
graph G is searched in layers, starting from the source s. Then, (s, v) is equal to the layer at which v is
discovered. (See Figure 4.4.) The running time of breadth-first search is O(|V | + |E|).

4.4 Floyd-Warshall Algorithm

This algorithm is based on dynamic programming, and thus we define smaller subproblems. We do this
using the following concept:

(k)
Definition 25 dij = length of shortest path from i to j for which all intermediate vertices are in the set
{v1 , v2 , ..., vk }.

Note that dkij can be larger than the shortest path distance from i to j. An example of this is given in Figure
4.5. Also note that dnij = (i, j), since dnij allows any intermediate vertices to be used. We see that this
(k)
definition allows for optimal substructure by using the following formula to compute dij recursively:
visited
vertices

Figure 4.4: The graph is discovered layer by layer

v2

3 1

5
v1 v
4

1 2

v3

(1) (2) (3)


Figure 4.5: An example of dkij . Note that d14 = 5, d14 = 4, and d14 = 3.
vk
v1 v2 ... ... v k-1 v1 v2 ... ... v k-1

vi v
j
v1 v2 ... ... v k-1

(k)
Figure 4.6: vk will either factor in dij , or not
top level, input weight

k=0 j

final level

Figure 4.7: Three-dimensional table D used to calculate all (u, v)

(
(k1) (k1) (k1)
(k) k 1 min{dij , dik + dkj }
dij =
k = 0 wij
where 
w((i, j)) (i, j) E
wij =
(i, j)
/E
To see why this formula is correct, note that the shortest path from vertex i to vertex h using k, is either
the shortest path from vertex i to vertex h without using k, or the shortest path from vertex i to vertex
k, and then the shortest path from vertex k to vertex j (see Figure 4.6). This formula takes the shorter of
these two quantities.
To use this formula, we construct a three-dimensional table D whose top layer D(0) is initialized to (wij ).
See Figure 4.7. From the discussion above, we see that D(k) (the kth layer of the table) can be calculated
using D(k1) . Thus, we can compute the values in the table one layer at a time, until we reach D(|V |) . Note
that this will be the final solution to the problem. In other words, we iterate through values of k, i, and j
until D(|V |) = (i, j)) is calculated. This algorithm can be described by the following pseudo-code:

1. D(0) = (wij )
2. for k = 1 to |V |
3. for i = 1 to |V |
4. for j = 1 to |V |
(k) (k1) (k1) (k1)
5. Dij = min(Dij , Dik + Dkj )

6. return D(|V |)

It is easy to see that the running time of the Floyd-Warshall Algorithm is (|V |3 ).
4.5 Dijkstras Algorithm

Dijkstras algorithm solves a special case of the single-source shortest-paths problem, where all of the edge-
weights are non-negative.

V' $

r
Q
r

' $
R
r r
r
s
& %
& %
Figure 4.8: Dijkstra Algorithm

In a manner reminiscent of the Floyd-Warshall algorithm of the previous section, Dijkstras algorithm main-
tains a set R of processed vertices, as well as a vector d, where d [v ] is the minimum path distance from s
to v using only those vertices in R. Vertices will be added to R one at a time, until they are all in that set.
When R = V , the vector d will contain the output.
A difference between Dijkstras algorithm and the Floyd-Warshall algorithm is that in the Floyd-Warshall
algorithm, we added the vertices into the the set R in any arbitrary order. We could do that for the single-
source shortest-paths problem algorithm as well, but it turns out that if we do so, updating d[v] is not
particularly efficient. Instead, in Dijkstras algorithm, the vertex not in R that has the smallest value of d[v]
is always chosen to be added to R. This makes the process of updating d[v] considerably faster. Dijkstras
algorithm proceeds as follows:

Pseudocode for Dijkstras Algorithm

Input: Directed graph, G = (V, E), a weight function w : E <, a source vertex s.
Output: v V, (s, v).

1. d[s] = 0

2. R = {s}, Q = V {s}
3. v 6= s, d[v] = w(s, v)

4. while |Q| 1
5. u = v Q that minimizes d[v].

6. Q = Q {u}, R = R + {u}

7. for each v Adj[u], d[v] = min(d[u] + w(u, v), d[v]).


(Note: Adj[u] denotes the set of vertices adjacent to u.)
8. return d

Step 7 of this algorithm updates the vector d when a new vertex is added to R. Note that this step is quite
efficient - perhaps more so than we would expect. In fact, it is not immediately obvious that the operations
performed in Step 7 are sufficient. The key is (and we shall prove this) that we can show that the following
invariant holds throughout the execution of the algorithm: v R, d[v] = (s, v). This implies that vertices
already in R do not need to be updated when we add a new vertex u to R. Thus, the only way that d[v] can
change for a vertex v when u is added to R is if the shortest path to v that only uses vertices in R has u as
its last vertex in R.

Running time of Dijkstras Algorithm

Steps 1 and 2 take constant time. Step 3 takes time O (|V |). The while loop which spans from step 4 to step
7 dominates the running time of this algorithm.
Throughout the algorithm, the extract-min step, step 5, is executed O (|V |) times, once for each vertex. Step
7 is executed O (|E |) times, once for each edge. The total running time of Dijkstras algorithm depends on
the data structure that is used to maintain the distance vector.

By keeping d [v ] in a binary heap, both the extract-min and the update take time O (log|V |). This
makes the total running time when using a binary heap O (|E |log |V |).
By keeping d [v ] in a Fibonacci heap, the extract-min still takes time O (log|V |), but updates become
constant time, O (1). This makes the total running time when using a Fibonnaci heap O (|E |+|V |log
|V |).

The catch with achieving these running times is the fact that Dijkstras algorithm assumes that all edge-
weights are non-negative. Note that there is an alternative technique, called the Bellman-Ford algorithm,
that allows for negative edge-weights. However the runtime for this algorithm is O (|V ||E |).

4.5.1 Proof of Correctness for Dijkstras algorithm

To prove the correctness of Dijkstras algorithm we need to prove the following theorem:

Theorem 26 If e E , w(e) 0 then at the end of Dijkstras algorithm, v V, d[v] = (s,v).

The proof of this Theorem consists of two lemmas:

Lemma 27 Throughout the algorithm v, d[v] (s,v).

Let du [v ] denote d [v ] at the step of the algorithm where u is chosen as the minimum.

Lemma 28 u, du [u] = (s,u).

These two lemmas together prove the theorem as follows: After the initialization, the only time that d [v ] can
change is at an update step, and thus it can only decrease. Each vertex v is placed in R at some point in the
algorithm and at that point, d[v] = (s, v), by Lemma 28. Since d[v] can only decrease but cannot decrease
u
P

P: Shortest path from source s to u


y: first vertex on P that is not in R
R x
x: predecessor of y on P
y

Figure 4.9: Depiction of vertices and path for proof of Lemma 28

below its minimum (by Lemma 27), once it reaches the value (s, v), it will stay unchanged throughout the
rest of the algorithm. So, at the end of the algorithm, d[v] = (s, v).
Proof: (Of Lemma 27) By contradiction. Let v be the first vertex such that at some point in the algorithm,
d[v] < (s, v). The only step that could have changed d[v] to such a value is an update step of the form
d[v] = d[u] + w(u, v). Since we assume that d[v] < (s, v), then d[u] must also be less than (s, u). But, the
only way that d[u] could have gotten that value was in an earlier step of the algorithm, which contradicts
the assumption that v was the first such vertex.
Proof: (Of Lemma 28) We also prove this by contradiction. Let u be the first vertex placed in R such that
du [u] 6= (s,u). Define P , x, and y as shown in Figure 4.9. Note that s may be the same as x and/or u may
be the same as y, but it is not possible that x is the same as y.

Claim 29 du [y] = (s, y).

Proof: Since we assume that u is the first vertex getting the wrong value, other vertices placed in R before
u, including the vertex x, must have satisfied the invariant. Hence, we have dx [x ] = (s,x ). After x is placed
in R and after all of the corresponding updates are done, d[y] dx [x] + w(x, y). Since P is a shortest path,
a shortest path to y must pass though x. Thus, it must be the case that du [y] = (s, y).
Since y lies on the shortest path to u and there are no negative weights, (s,y) (s,u). Thus, du [y] =
(s,y) (s,u) du [u]. But, du [u] du [y] because we picked u as the vertex with the minimum value of
d[v] Q, not y. So, for both of these to be true, du [u] = du [y] = (s,u). This contradicts the assumption
that du [u] 6= (s,u).

4.6 Seidels Algorithm

The Floyd-Warshall algorithm demonstrates that we can use dynamic programming to solve the all-pairs
shortest-paths problem in time O(|V |3 ). Now we are going to give a faster algorithm, but it only works
for a special case: a connected, undirected and unweighted graph. This algorithm is called Seidels Algo-
rithm[S92], and it uses matrix multiplication to compute the all pairs shortest paths.

4.6.1 Warm-up: A problem that uses matrix multiplication

We start by showing that matrix multiplication can in fact be useful for shortest-path computation with an
easier problem. This problem will also be used as a component of Seidels shortest path algorithm.
2

Graph G
3
1

4
5

Figure 4.10: A graph G

2 Original edges
G2 Augumented edges

3
1

4
5

Figure 4.11: A graph G2

Input: A graph G = (V, E).


Output: The graph G2 , defined to be G augmented by edges (i, j) for every pair of vertices i and j such
that the graph G has a path of length 2 from i to j.

Given an adjacency matrix for the graph G, our objective is to compute the adjacency matrix for the graph
G2 . An example is provided in Figures 4.10 and 4.11. One way to do this would be to compute the all-pairs
shortest paths on G and add an edge between vertices i and j if the shortest path between them has length
2. This would take O(|V |3 ) time. A second algorithm is to look at all possible pairs of vertices (i, j) and for
every pair, look for an intermediate vertex k. Since there are O(|V |2 ) possible pairs and O(|V |) intermediate
vertices to check for each pair, the total running time of this algorithm is again O(|V |3 ).
It turns out that this running time can be improved by a technique using matrix multiplication. Define
M [G] as the adjacency matrix of graph G. An example for the graph depicted in Figure 4.10 looks like:


0 1 0 0 0

1 0 1 1 0

M [G] =
0 1 0 0 0

0 1 0 0 1
0 0 0 1 0

Consider what we need to do for a vertex-pair (i, j): we need to check all intermediate vertices k to see if
both (i, k) and (k, j) are equal to 1. This means that there is an edge between i and k and an edge between
k and j. This is actually similar to the value computed for entry (i, j) of the matrix M [G]2 = M [G] M [G].
P|V |
In particular, note that M [G]2 (i, j) = k=1 M [G](i, k) M [G](k, j). The product M [G](i, k) M [G](k, j) is
equal to 1 only when both components are 1. Thus, M [G]2 (i, j) is the number of paths of length exactly 2
from i to j. This gives us the following algorithm for computing M [G2 ] given M [G]:

1. Compute M [G]2 = M [G] M [G]



1 if i 6= j and (M [G](i, j) = 1 or M [G]2 (i, j) > 0)
2. Set M [G2 ](i, j) =
0 otherwise

Let (n) be the time it takes to compute an n n matrix multiplication. The total running time is
O((|V |) + |V |2 ) = O((|V |)), since (n) is (n2 ). Therefore, the running time for computing G2 with this
algorithm is asymptotically as fast as the fastest matrix multiplication algorithm.

4.6.2 Seidels algorithm

Seidels algorithm will use the graph G2 . In addition, we use P [G], a matrix based on the graph G, where:

1 if shortest path from i to j has odd length
P [G](i, j) =
0 otherwise

Seidels algorithm for all-pairs shortest paths (APSP) works as follows:

Input: An undirected and unweighted graph G = (V, E).


Run: APSP(M [G])
Output: D[G]: the shortest path distance matrix for graph G.

The algorithm APSP(M [G]) is defined as follows:

APSP(M [G]):
1. compute M [G2 ]
2. if M [G2 ](i, j) = 1, i 6= j
then
0 if i = j
3. return D[G](i, j) = 1 if M [G](i, j) = 1

2 otherwise
else
0
4. D [G] = APSP(M [G2 ])
5. compute P [G]
0
6. return D[G] = 2 D [G] P [G]

Proof of Correctness

Proof: Define the diameter of a graph as the largest of all shortest-path distances in the graph. The proof
for the correctness of Seidels algorithm can be shown using induction on the diameter of G.
Base case: diameter 2, thus i 6= j, M [G2 ](i, j) = 1, so APSP stops at step 3. In this case, Step 3 can be
seen by inspection to return the correct answer.
0
Inductive case: Assume D [G] = D[G2 ] is correctly computed.
Case 1: If P [G](i, j) = 0, then the shortest path distance between i and j is even. This immediately implies,
(and is illustrated in Figure 4.12) that the path distance between i and j in G2 is half of that in G, or
1
D[G2 ](i, j) = D[G](i, j)
2
i j

Figure 4.12: Path of even length

i j

Figure 4.13: Path of odd length

Case 2: If P [G](i, j) = 1, then the shortest path distance between i and j is odd. If this is the case, we
must take at least one of the edges from the original graph G (since otherwise we will always be an even
number of hops away from i). Thus, if the shortest path in G has length 2m + 1, then the shortest path
in G2 will take m of the augmented edges, and 1 of the original edges, for a path length of m + 1. This is
illustrated in Figure 4.13. Thus,

1
D[G2 ](i, j) = [D[G](i, j) 1] + 1
2
1 1
= D[G](i, j) +
2 2
1
= [D[G](i, j) + 1]
2

0
Thus, D[G] = 2 D [G] P [G] in both cases.

Runtime of Seidels algorithm

Step 1 and Step 5 take O((|V |)) time, where (|V |) is the runtime of the fastest algorithm for multiplying
two |V | |V | matrices. We have already seen how to do this for Step 1. We describe how to achieve this
runtime for Step 5 in the next subsection. This requires a clever technique.
Steps 2 and 3 take O(|V |2 ) time since they look at all entries of their respective matrices. Step 6 also requires
O(|V |2 ) time. Step 4 results in O(log(|V |)) recursive calls to the algorithm. To see this, note that we are
always dividing the diameter of the graph by 2 (ignoring parity) and the diameter is at worst |V |.
Thus, the total running time is O((|V |) log |V |), where (|V |) = (|V |2 ) since matrix multiplication must
at least examine every matrix entry once.
4.6.3 Computing the matrix P[G] in time O((|V |))

Let X = D0 [G] M [G]. Remember that D0 [G] is the shortest path matrix of G2 and M [G] is the adjacency
matrix of G.

Claim 30

P [G](i, j) = 0 X(i, j) D0 [G](i, j) degree [G] (j )

where, degree [G] (j ) is the number of edges that are incident to node j in graph G.

Given this claim, we can use the matrix D0 [G] to compute the matrix P [G] in time O((|V |)). To do so, we
first compute the matrix X as well as the degree in G of every vertex. Then, we compute each entry of the
matrix P [G] individually by comparing X(i, j) with D0 [G](i, j) degree [G] (j ).
Proof: In order to prove this claim, we need to show that both directions of the implication are true. To
do so, we will break down the proof into two parts:

P [G](i, j) = 0 = X(i, j) D0 [G](i, j) degree[G] (j)


P [G](i, j) = 1 = X(i, j) < D0 [G](i, j) degree[G] (j)

Entry (i, j) of the matrix X is equal to the following:


|V |
X
X(i, j) = (D0 [G](i, k) M [G](k, j))
k=1
X
= (D0 [G](i, k))
kAdj[G] (j)

From this representation it is apparent that X(i, j) is the sum of shortest path lengths from i to the neighbors
of j. Note that the number of terms in the sum is degree[G] (j). First, we prove that if P [G](i, j) = 0, then
X(i, j) D0 [G](i, j) degree[G] (j). If each of the X(i, j) terms is at least D0 [G](i, j), then their sum would
be at least D0 [G](i, j) degree[G] (j). Hence we only need to prove that if P [G](i, j) = 0, then

k Adj[G] (j), D0 [G](i, k) D0 [G](i, j)

In other words, we shall show that the distance to k cannot be less than the distance to j whenever k is one
hop away from j and P [G](i, j) = 0. To show this, we look at two cases:
Case 1: k is not on any shortest path from i to j in the graph G (see Figure 4.14). If this is the case, then
in the graph G, the length of the path from i to k must be at least as long as the length of the path from i
to j. Thus, in the graph G2 , the length of the shortest path from i to k cannot be shorter than the length
of the shortest path from i to j.
Case 2: k is on the shortest path from i to j in the graph G (see Figure 4.15). In this case, the shortest
path in the graph G from i to k has odd length, and thus the shortest path from i to k in the graph G2
must use at least one edge that is in G. On the other hand, the shortest path from i to j in G has even
length, and thus the shortest path from i to j in G2 only requires edges from G2 that are not in G. Each
edge of the shortest path in G2 from i to j makes two steps of progress in G, where as at least one edge of
i j k

Figure 4.14: Example for case 1

i k j

Figure 4.15: Example for case 2

the shortest path in G2 from i to k only makes one step of progress in G. Hence the length of the path from
i to j in G2 is equal to that of the shortest path from i to k in G2 .
Thus, P [G](i, j) = 0 = X(i, j) D0 [G](i, j) degree[G] (j). We now move on to the second half of the
proof, where the parity of the shortest path from i to j is odd. We must prove that if P [G](i, j) = 1, then
X(i, j) < D0 [G](i, j) degree[G] (j). To do so, we demonstrate that if P [G](i, j) = 1, then

k Adj[G] (j), D0 [G](i, k) D0 [G](i, j)


and
k Adj[G] (j), D0 [G](i, k) < D0 [G](i, j)

Again we look at two cases.


Case 1: k is not on the shortest path from i to j in G (see Figure 4.16). This is similar to case 2 above.
Because the length of the shortest path from i and j in G is odd, the shortest path from i to j in G2 must
use at least one edge from G. However, to reach k in G2 , we can use only edges that are in G2 but not in
G. Hence the path lengths are equal.
Case 2: k is on the shortest path (see Figure 4.17). We see that D0 [G](i, k) < D0 [G](i, j) since k can be
reached from i in G2 using only edges that are in G2 , but not in G, while we need to take an extra step
along an edge in G to go to j. Therefore, D0 [G](i, k) < D0 [G](i, j).
This concludes the proof.

i j k

Figure 4.16: Example for case 1


i k j

Figure 4.17: Example for case 2


Chapter 5

Network Flow

We next look at the problem of network flow. Variations of network flow come up in a number of different
settings. Perhaps more importantly, we shall see that we can use algorithms for network flow to solve a large
number of other problems as well. Thus, network flow is a useful technique for designing algorithms.

5.1 Definitions

The input to a flow problem is a flow network. A flow network consists of a graph, a capacity for each edge
of the graph, as well as two designated vertices, called the source s and the sink t. The idea behind network
flow is that we must transport as much as possible of some commodity from s to t. We can think of this
commodity as a fluid such as oil or water, or any other substance (such as hockey pucks) that is neither
created or destroyed at the intermediate nodes. The constraint is that the capacity associated with each
directed edge limits the amount of the commodity that we can transport along that edge. A solution to the
network flow problem is an assignment of flows to edges such that (a) flow is conserved at each intermediate
node of the graph, and (b) no edge capacity is exceeded. The optimal solution is a solution which transports
the maximum amount of the commodity from s to t.
The graph depicted in Figure 5.1 is an example of a flow network. It shows the capacity of each edge and
the values (in parentheses) of a sample flow.
We next provide a more formal description of network flow.
Input:

Directed Graph G = (V, E).

Capacities: C(u, v) > 0 for (u, v) E. Note that we assume C(u, v) = 0 for (u, v)
/ E.

A source node s, and a sink node t.

Output: The maximum flow f from s to t where f : V V < has the following properties:

1. Skew-symmetry: u, v V, f (u, v) = f (v, u).


P
2. Conservation of Flow: v (V {s, t}), uV f (u, v) = 0.

65
v1 12(12) v2

16(11) 20(15)
9(4)
s 10 4(1) 7(7) t
13(8)
4(4)

v3 14(11) v4

Figure 5.1: Example Flow Network

3. Capacity Constraint: u, v V, f (u, v) c(u, v).

We define the size of a flow as the total flow coming out of the source:
X
|f | = f (s, v)
vV

We will show that the size of the flow is always also equal to the flow going into the sink:
X
|f | = f (v, t)
vV

In other words, the total flow leaving the source is equal to the total flow coming into the the sink. Our
objective is to maximize f . In order to measure the largest possible flow, the notion of a cut in the flow
network is crucial.

Definition 31 An st cut of G is a partition of the vertices V into two sets A and B such that s A and
t B.

The capacity of cut (A, B) is defined as the sum of the capacities of all edges that cross the cut:
X
C(A, B) = C(u, v).
uA,vB

See Figure 5.2. The flow across a cut is defined similarly, as the flow across all edges that cross the cut:
X
f (A, B) = f (u, v).
uA,vB

For any cut, by summing over the edges that cross that cut, we see that the flow across it is at most the
capacity across the cut. In other words, f (A, B) C(A, B). Furthermore, we can show that for any flow f ,
the flow across all s-t cuts is the same:

Claim 32 For any flow f and cuts (A, B), f (A, B) = |f |, where |f | is the total flow out of s.
A B

. .
s t

Figure 5.2: An example st cut

Proof: We prove this by induction on the size of A.


Base Case: |A| = 1. In this case, the claim follows from the definition of the size of a flow since A = {s}.
Inductive Step: We assume that f (A, B) = |f | A such that |A| = k. We show then that f (A0 , B 0 ) =
|f | A0 such that |A0 | = k + 1.
Consider some cut (A0 , B 0 ), such that |A0 | = k+1. Consider some new cut formed by moving any u A0 {s}
to the set B 0 . This gives us the cut (A0 {u}, B 0 {u}). The flow over cut (A0 {u}, B 0 {u}) is the flow
over the original cut plus the flow over the edges which are from v, for v A0 to u minus the flow over the
edges which are from u to v, for v B 0 . Thus,
X X
f (A0 {u}, B 0 {u}) = f (A0 , B 0 ) + f (v, u) f (u, v)
vA0 vB 0

From the Conservation of Flow, as well as Skew-symmetry, we know that


X X X
f (v, u) f (u, v) = f (v, u) = 0.
vA0 vB 0 vV

Therefore, f (A0 {u}, B 0 {u}) = f (A0 , B 0 ). Since the size of the cut (A0 {u}, B 0 {u}) is k, the Inductive
Hypothesis tells us that f (A0 {u}, B 0 {u}) = |f |, and so we have shown that f (A0 , B 0 ) = |f | A0 such
that |A0 | = k + 1
The consequence of the above proof is that the total flow into t equals the total flow out of s. It also lets us
give an upper bound on |f |. If we consider all st cuts in the graph, the size of the flow through the graph
must be less than or equal to the capacity of each of them, including the capacity over the minimum cut.
Thus, the minimum capacity cut is an upper bound on the maximum achivable flow. We show in the next
section that, perhaps surprisingly, we can always achieve this upper bound: there is always a flow whose
value is equal to the capacity of the cut with the minimum capacity.

5.2 The Max-Flow Min-Cut Theorem

Theorem 33 For any flow network and flow f for that network, the following two statements are equivalent:

1. f is a maximum flow.
v1 12 v2

11 15

5 5 5
s 11 3 t
4 7
8 4

5 3
v3 v4
11

Figure 5.3: Residual Network for the Example Flow Network

2. an st cut(A,B) such that |f | = c(A, B).

In other words, the size of the maximum flow is always equal to capacity of the minimum cut. To prove this,
we need to introduce the concept of a residual network.

5.2.1 Residual Networks

Residual networks measure the capacity of a network that a flow is not using.

Definition 34 Given a flow network G and flow f in G, a Residual Network Gf is defined as

Gf = (V, Ef ) where Ef = {(u, v) : C(u, v) f (u, v) > 0}

Cf (u, v) = C(u, v) f (u, v).

In other words, Ef is the set of edges where we could send more flow. The capacity of an edge in Ef is the
amount of additional flow we could send. Figure 5.3 shows the residual network of Figure 5.1.
The residual capacity for an edge is the original capacity minus the flow along that edge. For example the
edge from V1 to V3 has 10 (1) = 11 capacity and the edge from V2 to V1 has 0 (12) = 12 capacity.
Note that edges that were not in the original network can appear in the residual network. However, for an
edge to be in the residual network, either it or the reverse edge must be in the original network. Thus, the
number of edges in the residual network is at most twice the number in the original network.

5.2.2 Augmenting Paths

In order to prove the Max-Flow Min-Cut Theorem, we need one more concept: that of an augmenting path.

Definition 35 An augmenting path p for flow f is a path from s to t in graph Gf .


Figure 5.4: Cut (A, B)

If there is an augmenting path from s to t then we can increase the flow in the original network. To do
so, we increase the flow along every edge of the augmenting path by some fixed amount. The limit on how
much we can increase the flow is the minimum capacity edge along the augmenting path. The resulting flow
will not exceed the capacity of any edge. Also, since each intermediate node on the augmenting path has an
incoming edge as well as an outgoing edge, the resulting flow will also obey flow conservation.

Definition 36 The bottleneck capacity b(p) for an augmenting path p is the minimum capacity in Gf of any
edge on p.

If we find an augmenting path p from s to t then we can increase the flow by b(p) units along that path.
This will be useful when we define algorithms for finding the maximum flow; these algorithms will be based
on finding augmenting paths.

5.2.3 Proof of Max-Flow Min-Cut Theorem

To prove the Max-Flow Min-Cut theorem, we use a third statement involving the existance of augmenting
paths. In particular, we show that, given flow network G and flow f , the following three statements are
equivalent:

1. f is a maximum flow in G.

2. s t cut (A, B) of G such that |f | = c(A, B).

3. 6 an augmenting path in Gf

Proof:

2 = 1: Assume there is a cut such that the value of the flow is equal to the capacity of the cut. If we
were to increase the flow, then we would also increase the flow across this cut and exceed the capacity
of the cut. Therefore, the flow f cannot be increased and f is the maximum flow.

1 = 3: We prove the contrapositive: if there is an augmenting path, then the flow could be increased.
To see that this is true, note that if there is an augmenting path, then the flow could be increased by
b(p) units along that augmenting path.

3 = 2: Suppose that Gf has no augmenting path, that is, that Gf contains no path from s to t.
Define:
A = {v : v is reachable from s in Gf }
B = V A.
Consider the cut (A, B): we have s A and t / A because there is no path from s to t in Gf . For
any u, v such that u A and v B, we have f (u, v) = C(u, v), since otherwise (u, v) Ef and then v
should be in the set A. Therefore the flow between any two vertices that cross the cut must be equal
to the capacity of the edge between that pair of vertices. By summing over all pairs of vertices that
cross the cut (A, B), we see that c(A, B) = f (A, B) = |f |.

5.3 The Ford-Fulkerson Algorithm [FF56]

The first algorithm for network flow we consider actually dates back to the 1950s. It proceeds as follows:

1. flow f = 0
2. while there exists an augmenting path p for f
3. find p
4. augment f by b(p) units along p
5. return f

Note that with this algorithm, it is possible for an augmentation to decrease the flow on a specific edge.
However, the overall flow always increases. We also know that if we cant find any more augmenting paths,
then we must have a maximum flow. Thus, if the algorithm returns a flow, then it must be the maximum
flow.
We first analyze this algorithm for the case where all capacities are integers. Note that this implies that
every flow obtained during the Ford-Fulkerson algorithm must have an integer value. This in turn tells us
that |f |, where f is the maximum flow, is also an integer.
With integer capacities, we know that the runtime of the algorithm is O(|E| * |f |). To see this, note that
we require O(|E|) time for finding each residual network and its corresponding augmentating path. Each
augmenting path increases the flow by at least 1, and so the maximum number of augmenting paths required
is O(|f |). Note that this is not a fully polynomial solution, since (similar to the knapsack problem), |f |
may not be polynomial in the size of the input.

5.3.1 Problems with Ford-Fulkerson


1. As long as the optimal flow value |f | is small, the running time for Ford-Fulkerson is good. However,
when |f | becomes very large, then the running time grows unacceptably large. In Figures 5.5 and
5.6, we can see that the single edge with a value of one can become a tremendous bottleneck, forcing
a ridiculous 2,000,000,000 augmentations.
2. If the network has irrational capacities then the Ford-Fulkerson algorithm is not guaranteed to halt or
even to converge to the correct answer.

5.4 Ford-Fulkerson Algorithm with the Edmonds-Karp Heuristic


[EK72]

This algorithm runs in polynomial time and works with irrational capacities, correcting the main problems
with the basic Ford-Fulkerson Algorithm.
Figure 5.5: An augmenting path with residual capacity 1

Figure 5.6: the resulting residual network and another augmenting path with residual capacity 1

5.4.1 The Algorithm

1. flow f = 0
2. while there exists an augmenting path p for f
3. find shortest (unweighted) augmenting path p
4. augment f by b(p) units along p
5. return f

Note the change in step three, where we are more selective with the augmenting path we choose. The shortest
augmenting path p can be found with a breadth-first search.

Theorem 37 The Ford-Fulkerson algorithm, when run with the Edmonds-Karp heuristic, finds a maximum
flow in time O(|E|2 |V |).

To prove this theorem, we first prove a useful lemma.

Definition 38 f (s, u) is the shortest, unweighted path distance from s to u in the residual network Gf .

Lemma 39 f (s, u) is non-decreasing as f changes.

In other words, as we keep augmenting the flow, the shortest path distance may change but never decreases.
Proof: We prove this by contradiction. Thus, assume that we have an augmention from f to f 0 and a vertex
v, such that the shortest path distance decreases: f 0 (s, v) < f (s, v) (*)
s
u v
Figure 5.7: Gf 0

u v
s t

Figure 5.8: Gf

If there is more than one such vertex, let v be the closest one to s in Gf 0 for which (*) holds. The
distance f 0 (s, v) is the smallest for this vertex.
Let u be the second to last vertex on the shortest path from s to v in Gf 0 . See Figure 5.7.

Claim 40 (u, v) 6 Ef , i.e., the edge (u, v) is not an edge in the residual network for the flow f .

Proof: Otherwise, f (s, v) f (s, u) + 1. However, since v is the closest vertex in Gf 0 for which the shortest
path distance decreases, it must be the case that f (s, u) f 0 (s, u). Thus, f (s, v) f 0 (s, u)+1 = f 0 (s, v).
Since this is a contradiction, the claim must be true.
Since (u, v) Ef 0 , Claim 40 implies that the augmenting path for the augmentation from f to f 0 contains the
edge (v, u). Since this augmentation finds a shortest path, it must be the case that f (s, u) = f (s, v)+1. This
gives us the following: f (s, v) = f (s, u) 1 f0 (s, u) 1 = f0 (s, v) 2. However, this is a contradiction,
and so Lemma 39 is proven.

Definition 41 Edge (u, v) is critical for flow f if (u, v) is on the augmenting path p for f and c(u, v) in Gf
is b(p).

To prove that Theorem 37 is true, we demonstrate the following:

Lemma 42 Between occasions when (u, v) is critical f (s, u) increases by at least 2.

Note that Lemma 42 is all that is required: the maximum possible distance to any node is |V |, and so each
edge can be critical at most |V |/2 times. There are at most 2|E| edges in the residual network, and thus
the total number of augmentations is at most |V ||E|. To find the shortest path with breadth-first search
requires time |E|. Therefore, the total running time is O(|E|2 |V |).
Proof: (of Lemma 42) At least one edge on any augmenting path must be critical. Let (u, v) be critical in
f . Since we use a shortest path for the augmentation, f (s, u) = f (s, v) 1. See Figure 5.8. After we have
augmented the flow along an augmenting path, any critical edge on the path disappears from the residual
network. Let f 00 be the next flow where (u, v) appears in Gf 00 , and let f 0 be the flow right before f 00 . See
Figure 5.8.
Since (u, v) is in f 00 and not in f 0 the augmenting path for f 0 must contain (v, u). Thus f 0 (s, v) = f 0 (s, u)1.
Combining and using Lemma 39, we see that f (s, u) = f (s, v) 1 f 0 (s, v) 1 = f 0 (s, u) 2. Therefore,
f (s, u) increases by at least two between occasions where (u, v) is critical.
s t
u v

Figure 5.9: Gf 0

U V U V

=>

G G

Figure 5.10: G is a bipartite graph, with thick edges denoting a matching from U to V . G is the corresponding
flow network. The figure also depicts a maximum flow, as well as the corresponding matching in G

5.5 Application of Network Flow: Bipartite Matching

We next demonstrate how to use network flow to solve other problems. In particular, we show how to find
a maximum cardinality bipartite matching using network flow. This is an alternative (but actually closely
related) to the intersection of matroids algorithm we looked at previously.
Given a bipartite graph G = (U, V, E), we define a corresponding flow network G. To construct G we add
two vertices to G: source s and sink t. Then we add edges from s to every vertex in U , and edges from every
vertex in V to t. The capacity of every edge is set to 1. An example is given in Figure 5.10.

Claim 43 G has a matching of size k G has a flow of size k.

Thus, by finding the maximum flow, we also find the size of the maximum matching. In fact, we can use
any algorithm that finds integral maximum flows to actually construct the maximum matching.
Proof: We show that (a) a matching of size k can be used to construct a flow of size k, and (b) a flow of
size k can be used to construct a matching of size k. For (a), note that we can deliver one unit of flow from
s to every vertex in U that is incident to an edge of the matching, one unit of flow along every matching
edge, and one unit of flow to t from every vertex in V incident to a matching edge. This is a valid flow of
the same size as the original matching.
For (b), we show that any integral flow has the form shown in network G of Figure 5.10. In particular, that
the edges that carry flow between vertices of U and vertices of V form a matching. To see this, recall that
the maximum flow in a network with integer capacities can always be achieved using integer flows for each
edge. Thus, we can assume that we have an integral flow. Since the total capacity of flow that can arrive at
any vertex u U is 1, the fact that flows are integral implies that there can be at most one outgoing edge
carrying flow from u. Similarly, since the capacity of the edge from any vertex v V to t is 1, there can
be at most one incoming edge carrying flow to v. Thus, the edges that carry flow from U to V must form
a matching. Since ({s} U, {t} V ) forms an s-t cut, the flow across that cut must by k. Thus, we can
construct a matching of size k.
Chapter 6

Randomized Algorithms

The next algorithmic paradigm we study is the use of randomness. Randomized algorithms introduce a
nondeterministic factor into the behavior of an algorithm, so that two runs of the algorithm with the same
input may produce different behaviors. Although it may not seem intuitive at first, this technique can offer
algorithmic approaches which are either faster or simpler than a deterministic algorithm for the same task.
There are also problems with known polynomial time randomized algorithms, but no known deterministic
algorithms. In our study of randomization, we shall examine some specific algorithms, and also provide some
general techniques for studying randomization in computer science.

6.1 Quicksort

Quicksort is a sorting algorithm based on the divide and conquer strategy. To sort an array of elements,
one element is selected to be the pivot. Then the input array is split into two subarrays, one containing
the elements that are less than the pivot and the other containing elements greater than the pivot. Each
subarray is sorted recursively and the results (now sorted) are joined.

The running time of this algorithm depends on how the pivot element is choosen. There are several possible
strategies:

1. Use first element: Easy, but very bad worst-case behavior: if the list is already sorted the running time
will be (n2 ).

2. Find median: This would produce an O(n log n) version of quicksort, provided we can find the median in
linear time. This is possible. However, finding the median in linear time requires somewhat complicated
algorithms, and leads to a version of quicksort with high constant factors.

3. Use random element: A simple solution that, for any input, yields a good running time most of the
time.

In this lecture, we study the behavior of this third choice: choosing the pivot element randomly. However,
before we can analyze this algorithm, we must define how we shall measure running time when dealing with
randomized algorithms.

75
11 16 3 20 8 6 22 12

6 3 8 11 20 16 22 12

Recursively

3 6 8 11 12 16 20 22

No cost to merge

3 6 8 11 12 16 20 22

Figure 6.1: An example of Quicksort

..... P1 i ...... j P2

L R

L R

Figure 6.2: The positions of i, j and pivots P1 and P2 in the final sorted list.

Definition 44 Average runtime: T (n) = max|x|=n E[T (x)], where E denotes the expected value with respect
to random choices made by the algorithm.

In other words, for every input, we measure the average running time for that input (where the probabilistic
nature of an algorithm can lead to different running times for a given input), and T (n) is the worst case
average running time over all inputs of size n.

Theorem 45 For quicksort with a random pivot, T (n) = O(n log n).

Proof: We measure the running time of quicksort in terms of the number of pairwise comparisons it
performs, and leave it as an exercise to prove that the actual running time of quicksort is (number of
pairwise comparisons).
In order to do so, we ask the following question: What is the probability that the ith smallest element is
compared to the jth smallest element? In studying this question, we shall assume i < j.
A comparison between i and j occurs if and only if either i or j is the first pivot chosen in the range from i
to j. To see this, note that if the first pivot in this range is p such that i < p < j, then i and j are placed
in different subarrays having only been compared to previously chosen pivots, and are subsequently never
compared to each other. On the other hand, if the first pivot in this range is either i or j, then when that
pivot is chosen, the two are in the same subarray, and will be compared to each other. See Figure 6.2.
There are j i + 1 elements between the ith smallest element and jth smallest element (inclusive), and the
probability for any of these elements to be chosen as the pivot is the same, so the probability that the ith
smallest element is compared to the jth smallest element is:
i or j 2
=
Range from i to j ji+1

We use the following indicator random variable:


1 if ith smallest element is compared with the jth smallest element
Zij =
0 otherwise

where we just argued that

2
P r[Zij = 1] =
ji+1

We are interested in the total number of comparisons performed:

X
Z= Zij
1i<jn

and in particular, we are interested in the expected number of comparisons performed, which is the expec-
tation of Z:

X X
E[Z] = E Zij = E[Zij ]
1i<jn 1i<jn

The second equality above follows from an important property of expectations called linearity of expec-
tation (which states that the expectation of a sum is the sum of the expections). We saw above that

n1 n
X X 2 X X 1
E[Zij ] = =2
ji+1 i=1 j=i+1
j i+1
1i<jn 1i<jn

Looking at the last sum on the right hand side, we observe that this is a harmonic series, and the result is
a Harmonic number, which is defined by:

n
1 1 1 1 X1
Hn = 1 + + + + ... + =
2 3 4 n k
k=1

For harmonic numbers, Hk ln k, so the expected number of comparisons is:


n1 n n1
X X 1 X
2 2 Hn 2n ln n
i=1 j=i+1
ji+1 i=1

Thus the expected running time of quicksort is T (n) = O(n ln n) for any input.
6.1.1 Las Vegas and Monte Carlo Algorithms

Quicksort is a Las Vegas algorithm. This means that the output is deterministic (and correct - it will always
return a sorted list) but that the runtime is variable and affected by the randomization. In particular, it
could have a running time as bad as (n2 ), although it turns out that for this particular algorithm, it is
unlikely that the running time is far from its expectation. A second type of randomized algorithm is a Monte
Carlo algorithm. The runtime of this algorithm is deterministic, but it is not guaranteed to always return
the correct answer. Instead, it has a (hopefully small) probability of returning an incorrect answer.

6.2 The Min-Cut Problem

We next consider an algorithm for finding the minimum cut in a graph. The randomized algorithm which
we presesnt for this problem is a Monte Carlo algorithm: it may sometimes produce an incorrect solution.
However, we are able to bound the probability of obtaining such an incorrect solution.
The problem we define is as follows:
Input: A connected, undirected graph G = (V, E).
Output: A minimum cut in G, i.e., a minimal set of edges whose removal breaks G into two or more
components.
Note that this problem is different than finding an s-t cut, since we are not specifying two vertices that must
fall on either side of the cut.

6.2.1 Obvious Algorithms

The most obvious technique would be to try all possible cuts. This requires exponential time (for example,
note that there are (2|E|) subsets of the edges). Thus, such an approach is completely unreasonable.
Another straightforward algorithm for the Min-Cut problem is to find a max-flow between every pair of
vertices (s, t) in the graph. Then, for each pair of vertices, find the minimum st cut of the corresponding
max-flow, and return the minimum of these st cuts. This algorithm has a running time of O(|V |5 ) because
finding the max-flow takes time O(|V |3 ) (using the Karzanov algorithm [K74]), and there are O(|V |2 ) pairs
of vertices.
An improvement can be made by fixing a single source s. Since in the final min-cut, this particular source
has to be on one side of the cut, it is sufficient to find the minimum st cut for a single source, resulting in
an algorithm that runs in time O(|V |4 ). However, using a randomized algorithm, there is a much simpler
way to find the minimum cut.

6.2.2 Kargers Randomized Algorithm

1. while the number of vertices > 2


2. select an edge e = (u, v) uniformly at random;
3. merge u and v into a single vertex, preserving all edges in the graph except those between u and v;
4. return the remaining edges of the resulting graph.

Figure 12.3 shows an example of how this algorithm works. The input graph contains 10 edges labeled from
a to k. Suppose that, at first, the edge c is picked, randomly, to be eliminated. The two vertices joined by
a b
g

e f i a d b i
c d
g
k h k
j j
Step 1 Step 4

b
g
a d b
a d e f i

h k
k
j
Step 5
Step 2

b
g
b
a d e f i

k k
j
Step 3 Step 6

Figure 6.3: An example run of Kargers Algorithm

c merge together into a single vertex. Now edges a and d connect the same pair of vertices. Next, edge h
is chosen, and then edge e. Note that when merging the vertices of edge e, edge f also disappears. This
process continues until two vertices are left. At this point, the remaining edges are b and k.
Luckily, these two edges indeed form a minimum cut of the original graph. In fact, the only two minimum
cuts of this graph are {a, c} and {b, k}. If we had had a step that chose an edge from the {a, c} cut as well as
a step that chose an edge from the {b, k} cut, then the result of the Kargers algorithm would be incorrect.
Although we cannot prevent this from happening, we can guarantee a certain probability that a correct cut
is returned by Kargers algorithm, as seen in the following theorem:

Theorem 46 Let C be a minimum cut of graph G = (V, E), and let n = |V |.

2
P r[C returned by Kargers algorithm] .
n2

Proof: Let k be the number of edges in C. Every vertex in G must have degree at least k, since if some
vertex v has degree less than k, then all of the edges of v form a cut which is smaller than C. Since there
are n vertices in G, there are at least nk/2 edges in G. Given this, we have

k 2
P r[first edge chosen randomly C] = .
nk/2 n

Suppose at some step of the algorithm, there are l vertices left. We call the graph at this step Gl . Since
a cut in Gl is necessarily a cut in G, the size of the minimum cut in Gl is at least k. Following the same
argument as before, there are at least kl/2 edges in Gl . We say C is hit if some edge in C is randomly
chosen. Thus:
k 2
P r[C is hit when there are l vertices left | C is not hit before] = .
kl/2 l

We define the following two types of events:

Hl : The event C not hit in Gl .

Sl : The event C survives up to Gl .

Using the fact that


P r[A B] = P r[A | B] P r[B]
(where A and B are not necessarily independent and P r[A | B] denotes conditional probability of A given
B), we get:

Pr[C is returned by algorithm] = Pr[C survives up to G2 ]


= Pr[S2 ]
= Pr[H3 S3 ]
= Pr[H3 |S3 ] Pr[S3 ]
= Pr[H3 |S3 ] Pr[H4 S4 ]
= Pr[H3 |S3 ] Pr[H4 |S4 ] Pr[S4 ]
= ...
Yn
= P r[Hl | Sl ]
l=3
      
2 2 2 2 2
1 1 1 1 1
n n1 n2 4 3
n2 n3 n4 n5 2 1
=
n n1 n2 n3 4 3
21 2
= 2
n (n 1) n

2
Hence, Kargers algorithm is guaranteed to return the minimum cut with probability at least n2 .

6.2.3 Full version of Kargers algorithm

The above probability is too small for the algorithm to be of much use. However, we can boost this probability
by repeating Kargers Algorithm m times, and returning the smallest cut found. This gives an algorithm
which is quite useful. We do so as follows:
n = |V |
repeat m times
run Kargers Algorithm
check size of cut, keep current minimum
return smallest cut found

This method only fails if every run of Kargers Algorithm does not find a minimum cut. What value of m
should we use to minimize the probability of failure? Given  > 0, we can ensure that the probability that
this algorithm fails is less than or equal to  by adjusting m accordingly. In particular, if we let

n2 1
m= ln ,
2 
then
 m
2
P r(C is never returned) 1 2
n
  n2 ln 1
2 2
= 1 2
n
 ln 1
1

e
= .

The second inequality is from the fact that for all k > 0,
 k
1 1
1 . (6.1)
k e

In order to be confident that the revised Kargers Algorithm returns a minimum cut, we can set  to a small
value such as  = e100 . Then, we will want to run Kargers Algorithm

n2
m= ln e100 = 50n2
2
times.
How small is this probability  = e100 ? Well, the probability of a person getting hit by a meteor while
running this algorithm is at least e60 (by a rather conservative estimate), which is much greater than
e100 .
The running time of Kargers algorithm as described is O(n4 log 1 ). However, using a more involved
technique, the running time of the algorithm can be improved to
 
2 1
O (n log n) log

where  is the probability of obtaining an incorrect result.

6.3 Verifying Polynomial Identities

We next consider the following problem:


Input: n-variable polynomial Q(X1 , X2 , . . . , Xn ) of degree d, with a representation that is easy to evaluate
at any point.
Output: The answer to the question Is Q(X1 , X2 , . . . , Xn ) 0 ?
The sign means equivalent, i.e., no matter what values the variables take, the polynomial is always zero.
For example,
Q = X12 X23 + X3 6 0, (6.2)
In this example n=3, d=5. Clearly, this polynomial is not equivalent to 0. On the other hand, consider

Q = (X1 + X2 )(X1 X2 ) + (X2 + X3 )(X2 X3 ) + (X3 + X1 )(X3 X1 ) 0. (6.3)

This polynomial is equivalent to zero. To see this we can simply expand the polynomial, and all the terms
cancel each other out.
Applications:

Q might be written in simple form, but the expansion of Q is exponential in this simple form. We shall
see an example of this shortly.

is Q Q0 ? is equivalent to the question is Q - Q0 0 ?

In general, there is no known deterministic polynomial time algorithm for these problems. However, we next
introduce a Monte Carlo type of algorithm that works quite well for them.

6.3.1 Schwartzs Algorithm

The following simple algorithm for this problem is actually used in practice (for example by Maple).

1. Let S be a set of distinct real numbers


2. for i = 1 to n do
3. pick zi S uniformally at random
4. if Q(z1 , z2 , . . . , zn ) = 0 then return YES
5. else return NO

We observe that a NO answer from Schwartzs algorithm is always correct, since we know for sure there is
a set of numbers that makes the polynomial not equal to zero. The YES answer from the algorithm might
be wrong. However, the following theorem states that the probability that its wrong can be bounded.

Theorem 47 If Q 0, then the output of Schwartzs algorithm is always correct. If Q 6 0, then the
probability that the output is wrong is less than d/|S|, i.e.,
  d
P r Q(z1 , z2 , . . . , zn ) = 0 . (6.4)
|S|

Proof:
The first half of the theorem is obvious. The second half of the theorem is proved by induction on n.
Base case: If n = 1, Q is a single variable polynomial of degree d. Then there are at most d real numbers
that make Q evaluate to 0. The worst case is that S contains all d roots. Thus,
  d
P r Q(z1 ) = 0 .
|S|
Inductive step: Suppose that for all polynomials with less than n variables, the claim of the theorem is
true. Now consider a degree d polynomial with n variables, Q(X1 , X2 , . . . , Xn ). We can rewrite Q as a
polynomial of X1 , (treating other variables as constants):
k
X
Q(X1 , X2 , . . . , Xn ) = Qi (X2 , . . . , Xn )X1i ,
i=0

where k is the maximum degree of X1 . Obviously Qk (X2 , . . . , Xn ) 6 0 because otherwise there will be no
term of X1 with degree k. Also note that the degree of Qk can be no larger than dk. Now pick z2 , z3 , . . . , zn
randomly from S. Since Qk (X2 , . . . , Xn ) is a n 1 variable polynomial, by induction,
  dk
P r Qk (z2 , . . . , zn ) = 0 .
|S|
On the other hand, if Qk (z2 , . . . , zn ) 6= 0, then
k
X
Q(X1 , z2 , z3 , . . . , zn ) = Qi (z2 , . . . , zn )X1i
i=0

is a one variable (X1 ) polynomial with degree k. Thus given that Qk (z2 , . . . , zn ) 6= 0,
  k
P r Q(z1 , z2 , z3 , . . . , zn ) = 0 ,
|S|
which is equivalent to saying that
  k
P r Q(z1 , z2 , z3 , . . . , zn ) = 0 Qk (z2 , . . . , zn ) 6= 0 .
|S|
For any two probabilistic events E1 , E2 ,
P r(E1 ) P r(E2 ) + P r(E1 |E2 )
which can be derived from the law of total probability
P r(E1 ) = P r(E1 |E2 )P r(E2 ) + P r(E1 |E2 )P r(E2 ).
Let
E1 : Q(z1 , . . . , zn ) = 0
E2 : Qk (z2 , . . . , zn ) = 0
Thus,
 
P r Q(z1 , . . . , zn ) = 0
   
P r Qk (z2 , . . . , zn ) = 0 + P r Q(z1 , . . . , zn ) = 0 Qk (z2 , . . . , zn ) 6= 0
dk k
= +
|S| |S|
d
= .
|S|
6.3.2 Revised Schwartzs Algorithm

choose S: |S| = 2d
repeat Schwartzs Algorithm many times
if ever see Q 6 0 return NO
else return YES

If we set |S| = 2d, then the probability that Schwartzs algorithm returns the wrong answer is less than
1/2, given that Q 6 0. If we are again given , and want to ensure that the probability we return an
incorrect answer is at most , then we can repeat the algorithm log2 (1/) times. This procedure can only
produce an incorrect result if each run of Schwartzs Algorithm returns the wrong answer, which happens
with probability at most
 log2 ( 1 )
1
= .
2

6.3.3 Boolean Satisfiability

It would be tempting to say that any randomized algorithm with a small probability of error can be run
multiple times and the success rate boosted. An example that demonstrates the failure of such an approach
is a the Negative Tautology problem:
Input: Boolean formula (X1 , X2 , . . . , Xn ).
Output: Is F ALSE?
This problem looks similar to the polynomial identity verification problem, but actually it behaves quite
differently. In particular, consider the following randomized algorithm, which is similar in spirit to Schwartzs
algorithm.
Algorithm:
choose random assignent for
if evaluates to FALSE return YES
else return NO
Consider the behavior of this algorithm on the input = X1 X2 Xn . We know that 6 F ALSE.
However,
  1
P r (z1 , z2 , . . . , zn ) = T RU E = n ,
2
thus,
  1
P r (z1 , z2 , . . . , zn ) = F ALSE = 1 n .
2

If 6 F ALSE, the probability that the algorithm gives a wrong answer is 1 1/2n . As a result, we would
need to run the algorithm an exponential number of times in order to produce a result we have any confidence
in.

6.3.4 Application of Schwartzs Algorithm to the Perfect Matching Problem

Definition 48 A perfect matching is a matching of size n.

We will use Schwartzs Algorithm to solve the following perfect matching problem:
u v
1 1

2 2
G:

3 3

4 4

Figure 6.4: Example of a Bipartite Graph

Input: A bipartite graph G = (U, V, E), where |U | = |V | = n.


Question: Does G contain a perfect matching?

Note that this is different than actually finding a perfect matching, and in fact, our algorithm will not find
one if it exists. It will only tell us
whether or not it exists. Note first that we could use network flow instead
to solve the problem in time O(m n), where m is the number of edges. However, the following algorithm is
simpler, and, in the case of dense graphs, asymptotically faster.
Application of Schwartzs Algorithm
Let M [G] be the n n adjacency matrix for G = (U, V, E), defined as follows:

1 if (ui , vj ) E
M [G](i, j) =
0 otherwise

Let A[G] be an n n matrix s.t.



xij if (ui , vj ) E
A[G](i, j) =
0 otherwise

Example: In Figure 6.4, we see a bipartite graph G = (U, V, E), where |U | = |V | = 4. In this graph G, we
see that E = {(u1 , v1 ), (u2 , v4 ), (u3 , v1 ), (u3 , v3 ), (u4 , v2 )}, so:

x11 0 0 0
0 0 0 x24
A[G] =
x31

0 x33 0
0 x42 0 0

Definition 49 Let A be a square matrix. The determinant of A, det(A), is defined as follows:


n
!
X Y
det(A) = sgn() A(i, (i))
S i=1

where S is the set of permutations of {1, 2, ..., n}, and sgn() = (1)k where k is the number of pairwise
exchanges required to transform to the identity permutation.

For example, let us find the determinant of this 3 3 matrix:



a b c
M = d e f
g h i
For n = 3, S = {(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)}, so det(M ) = aei af h bdi + bf g +
cdh ceg.

Theorem 50 [Tutte],[Edmonds] A bipartite graph G contains a perfect matching iff det(A[G]) 6 0.

Proof: Let A[G] be the n n matrix derived from a bipartite graph G = (U, V, E), where |U | = |V | = n.
Since each variable xij occurs at most once in A[G], there can not be any cancellation of terms in the
summation. Therefore the determinant is not identically zero if and only if there is a permutation for
which the corresponding term in the summation is non-zero. The latter happens if and only if each of the
entries A[G](i, (i)), 1 i n, is non-zero. This is equivalent to having a perfect matching (corresponding
to ) in G.
Thus, if there is no perfect matching, then every term 0. If there is a perfect matching, then there is at
least 1 non-zero term, and so det(A[G]) 6 0 (because no terms can cancel).
We can then use Schwartzs algorithm to determine whether or not det(A[G]) 0. Each run of Schwartzs
algorithm involves the evaluation of one determinant. However, evaluating the determinant of a matrix can
be shown to be equivalent to matrix multiplication. Thus, for dense graphs we have a run time of O(n3/2 )
using flow techniques, but since matrix multiplication can be performed in time o(n3/2 ), Schwartzs algorithm
leads to an asymptotically faster algorithm.

6.4 Randomness for Primality Testing

We next consider applying randomization to primality testing:


Input: n.
Question: Is n prime?
We point out that until recently, primality testing was another problem where there was no known determin-
istic polynomial time algorithm, but a polynomial time randomized algorithm has been known for quite some
time. However, recently, a polynomial time deterministic algorithm has been introduced for this problem.
However, the run time of that technique (at least as has been shown thus far) is (log12 n), which does not
give us a practical way to do primality testing. Thus, it is still interesting to look at randomized techniques.

6.4.1 Simple algorithms



Algorithm 1: Test every integer in [2, , n] to see if it divides n.
If the answers are NO for all tests, n is prime.

Problem: The running time is exponential, since an n bit number can be represented using log n bits.
Before we discuss a second algorithm, we state a theorem that serves as the basis of Algorithm 2:

Theorem 51 (Fermats Little Theorem) If n is prime then a [1, n), an1 1 (mod n).

This gives us the following algorithm:


Algorithm 2:

Choose a uniformly at random from [2, n 1].


if an1 6 1 (mod n), then n is composite.
otherwise n is prime.

Run time: It runs in polynomial time, because an1 mod n can be computed using O(log3 n) bit operations.
Problem: Carmichael numbers.
There exists an infinite set of composites called Carmichael numbers, that have the property that an1 1
(mod n)a [1, n). Thus, the algorithm returns prime for such numbers. Fortunately, Carmichael
numbers are rare. The first 5 Carmichael numbers: 561, 1105, 1729, 2465, and 2821. However, for these
bad inputs no matter how many times we do run the algorithm, the algorithm always returns prime.
Thus, we describe a better algorithm that is based on the Miller-Rabin test.

6.4.2 Algorithm 3 - The Miller-Rabin test

Miller-Rabin test: [CLRS, Sec 31.8]


n1
If n is composite, then for at least half of the a (1, n), either an1 6 1 (mod n) or k, s.t. r = 2k
is an
integer and 1 < gcd(ar 1, n) < n.
Algorithm 3:

repeat t times
pick a (1, n) uniformly at random.
if the Miller-Rabin test says n is composite, return composite and stop
return prime.

After this algorithm is repeated t times, we see that if n is prime, then Pr[return prime]=1, but if n is
composite, then Pr[return prime] 2t
Running time: We already saw that we can check if an1 6 1 (mod n) in O(t(log n)3 ) bit operations.
We only need to try at most log n different values of k, and as for gcd, Euclid invented an algorithm for gcd
with a running time of O((log n)2 ) in 300 B.C. Thus, repeating this test t times requires O(t(log n)3 ) bit
operations.

6.4.3 Algorithm 4 (Adelman-Huang test)

In addition to the Miller-Rabin test, there is another test which also runs in polynomial time. This is not as
efficient as the Miller-Rabin test. It is the Adelman-Huang test (not described here). It is in some ways
the opposite of Miller-Rabin test in terms of performance, that is:
If n is composite: Pr[ returns composite]=1,
If n is prime: Pr[returns composite] 21 .

6.4.4 Algorithm 5

Algorithm 3 and 4 can be combined into an algorithm that is always correct and has polynomial expected
running time. This is an example of a Las Vegas Algorithm.

while(true)
choose a (1, n) uniformly at random.
if Miller-Rabin return composite, says it is composite STOP.
if Adelman-Huang return prime, says it is prime STOP.

This Las Vegas algorithm is correct because it only stops and gives an answer when a test returns an answer
that is guaranteed correct.
It does not have an upper bound on running time, but it is guaranteed to eventually return the correct
answer and stop with a probability of 1. This should be contrasted with Kargers MinCut algorithm, which
has an upper bound on its running time, but is not guaranteed to return the correct answer every time.

6.4.5 Finding large primes and the Prime Number Theorem

We next quickly consider how to find large prime numbers. A very simple way to find prime numbers is to
pick a random number, and test if it is prime. This process can be repeated until a prime number is found.
The efficiency of this procedure depends on how many random numbers we need to choose before we find a
prime. Fortunately, prime numbers are relatively frequent, and thus this procedure is relatively efficient.
n
The following prime number theorem roughly tells us that for the first n numbers we have ln n primes.

Theorem 52 (The Prime Number Theorem) Let (n) be the number of primes in (1, n].

(n)
lim = 1.
n n/ ln n

Because of this property of prime numbers, the strategy of picking random numbers works well for finding
primes. Note that to find VERY large primes, more complicated techniques are required.

6.5 Tail Inequalities

Definition: A Tail Inequality is a bound on the probability that a random variable is far from its expec-
tation.
Example: for quicksort, we saw a bound on the expected number of comparisons of 2n ln n. Recall that
c(n) is the number of pairwise comparisons performed by quicksort on inputs of size n, and E[c(n)] 2n ln n.
It has also been shown that:

Pr [ | c(n) E [c(n)] | > E [c(n)] ] < n(lnlnn)

This is a tail inequality, since it means that the probability that c(n) deviates very far from E[c(n)] is
small. This is depicted in Figure 6.5. The shaded region under the solid line is the tail of the probability
distribution. In this region, the actual number of comparisons is far worse than the expected value. Our
focus will be on general tools that can be used to bound the tail of probability distributions. However, to
start with, we give a specific example.
Pr

Tail
c(n)
2 (In n) n2

Figure 6.5: Tail Inequality

6.5.1 Example: An Occupancy Problem

Toss m balls independently and uniformly at random into n bins. Each ball is equally likely to land in any
bin. We are interested in the following question: How many balls are in the fullest bin?
Questions of this sort actually show up quite often. For example, consider the problem of placing m data
items independently into n hash table locations, where each data item is equally likely to land in any location.
The balls into bins process is equivalent to the following question: What is the maximum number of items
that are placed in the same hash table location?
We shall consider the specific case where m = n. Let Zi be number of balls in bin i. Let Z be the random
variable that denotes the number of balls in the fullest bin. Thus, we see that Z = maxni=1 (Zi ). We analyze
Z by finding a bound for the probability that a particular bin has more than k balls in it. Let us take Z1 as
an example. Due to symmetry, every Zi behaves identically. We first look at the expectation of Z1 . Since
there are n balls and n bins, for every ball, the probability of occupying bin 1 is 1/n, and there are n such
balls. Thus, E[Z1 ] = 1. We shall be interested in the probability that Z1 is very far away from 1.
Pr[Z1 k] = Pr[ i1 , i2 , . . . , ik s.t. balls i1 , . . . , kk land in bin 1]. Call the event that this happens for balls
i1 , . . . , kk the event Wi1 ,i2 ,...,ik . Thus, since we have a specific k balls, Pr[Wi1 ,...,ik ] = n1k . Note that if there
are more than k balls in bin 1, then multiple events of the form Wi1 ,i2 ,...,ik happen.

The number of choices for i1 , . . . , ik are nk . We now use a technique  known as a union bound: Pr[A B]
P
Pr[A] + Pr[B]. This gives us that Pr [Z1 k] i1 ,...,ik nk = nk nk .
We next use the inequality

 n k    
n ne k
.
k k k

k
( ne
k )

e k
This gives us that Pr [Zi k] nk
= k
 2e ln n  
e k
Claim: For k = ln ln n , k < n2 .
Therefore, Pr[Z k ] n1 . It also turns out that we can prove (using different techniques) that it is quite
likely that Z will not be much less than k either.
6.5.2 Markovs Inequality

We next examine our first general purpose tail inequality, called Markovs inequality. Since this inequality
only uses the expectation of the random variable (and not the variance of that random variable) it is usually
not strong enough to provide a useful bound on probability. However, it is useful in proving other tail
inequalities, such as Chebyshevs Inequality and Chernoff Bounds.

Theorem 53 [Markovs Inequality] Let X be a non-negative random variable with E[X]. Pr[X ] E
[X]
.

Proof:

Let Z = if X
=0 otherwise

We see that E[X] E[Z] = Pr[X ], and thus Pr[X ] E


[X]
.

6.5.3 Chebyshevs Inequality

Theorem 54 Chebyshevs Inequality:


Let X be a random variable with E[X] = and Var[X] = E[(X )2 ] = 2 . For any positive value t,
2
Pr[|X | t] t2 .

2
Proof: Let Y = (X )2 . We see that E[Y ] = 2 . Thus, by Markovs Inequality, Pr[Y t2 ] t2 .
2
However, Y t2 only if |X | t. Therefore, Pr[|X | t] t2 .
Chebyshevs Inequality usually gives us tighter bounds than Markovs Inequality, and it works for random
variables that can take on negative values. Therefore, if we know the variance of a random variable, it gives
us a more effective way to bound the probability of the random variable being far from its expectation.

6.5.4 An Application of Chebyshevs Inequality: Median Finding

We provide an example of using Chebyshevs Inequality by considering the following problem:


Input: An unsorted but ordered set S of size n.
Output: The median element m of S.
Goal: An algorithm that finds m in time O(n).
The input is unsorted but ordered means that there is some underlying ordering for S, but that S is given to
us in the form of a list where the order of the elements is arbitrary. In particular, the position of the elements
in the unsorted list is not related to the position of the elements in the underlying ordering. Examples of
such orderings are real numbers with respect to the usual less than operation or words with respect to
alphabetical order. The median m is defined to be the element of S that lies exactly in the middle of the
underlying order.
Clearly we can find the median by sorting S in time (n log n), but our goal is a linear time algorithm. We
will use a randomized algorithm to do this. There are non-random algorithms that find the median in linear
time, but they are complicated and have larger constants. See [CLRS01] for details.
Linear Time Algorithm Idea

Our algorithm will still find the median by using sorting, but since sorting is expensive, our algorithm will
only sort a subset of S small enough to ensure that it does not break the linear time bound.
The algorithm works by sorting a small subset R of S, where each element of R is selected uniformly at
random. We ensure that R is sufficiently small that we can sort it in time O(n). Once R is sorted, we select
elements a and b from R in such a way that the median of S lies between a and b, but that the number of
elements in S between a and b is not too large.
We can then find in linear time the ranks of a and b in S: rankS (a) and rankS (b) respectively. We can also
find in linear time the subset S 0 of S, where S 0 contains the elements of S that lie between a and b. We
choose a and b in such a way that it is likely that S 0 is small enough to sort in linear time. Once S 0 is sorted,
we can use the rank of a in S to find the rank of the median m of S in S 0 . Specifically, if rankS (a) = x, then
m is the element of S 0 with rankS 0 (m) = n2 (x 1).
The figure below illustrates the relationship between S, R, and S 0 . Note that while the input S is unsorted,
this figure shows the elements of S in their sorted order.

S=
m
R=
a mR b
S0 =
a m b

The Randomized Median Finding Algorithm

The randomized median finding algorithm is defined in more detail as follows.

3
1. Take a sample R of size n 4 of the set S uniformly at random, with replacement.
2. Sort R.
3
3. Find elements a and b in R such that rankR (a) = 12 n 4 n and
3
rankR (b) = 12 n 4 + n.
4. Find the set S 0 = {x S : a x b}.
3 3
5. If rankS (a) n2 and rankS (b) n2 and rankS (a) n2 2n 4 and rankS (b) n2 + 2n 4 ,
6. Then, sort S 0 . Return the element m of S 0 such that rankS 0 (m) = n2 rankS (a) + 1.
7. Else, return failure.

Note that the first two tests in step 5 ensure that the median of S lies between a and b, and that the second
two tests ensure that S 0 is small enough to maintain the linear time bound. We next analyze the running
time of each step. Note that we assume the elements of S are stored in an array, and thus we can access any
element of that array in constant time.

3
1. O(n 4 ).
3 3
2. O(n 4 log n 4 ).
3. O(1).
4. O(n).
5. O(n).
3 3
6. O(n 4 log n 4 ).

3 3 3 3
Thus, total running time is O(n 4 log n 4 +n) = O(n), since n 4 log n 4 = o(n). We now consider the probability
that the algorithm fails to find the median.

1
Theorem 55 Pr [algorithm fails] 1 .
n4

Proof: The algorithm fails only if one of the following is true.

n
(i) rankS (a) > 2.
n 3
(ii) rankS (a) < 2 2n 4 .
n
(iii) rankS (b) < 2.
n 3
(iv) rankS (b) > 2 + 2n 4 .

We will address only the first two cases, since the other two are analogous. (i) occurs only if fewer than
1 34
2n n elements of R are to the left of the median m. Let X be the number of elements in R that are
Pn 43
to the left of m. Then X = i=1 Xi , where

1 if the ith sample is to the left of m
Xi =
0 otherwise

3
Clearly, E [Xi ] = 12 and by linearity of expectation, E [X] = n24 . Also, from the definition of variance, we
see that Var [Xi ] = 14 . We use this to find the variance of X. While we always have linearity of expectation
(i.e., the expectation of a sum is the sum of the expectations), we do NOT always have linearity of variance.
However, when the random variables that are being summed are independent, then variance does sum
linearly. Actually, a weaker condition, called pairwise independence, is sufficient. In this case, the Xi s are
3
n4
independent since we chose R with replacement, and thus Var [X] = 4 .
We can now apply Chebyshevs inequality to bound the probability of case (i) occurring. Since (i) only
3
occurs if X n24 n, we can bound the probability of case (i) using the bound
" #
3 1 43

n 4 4n 1
Pr X n 2 = 1
2 n 4n 4

3
For case (ii), let v be the element in S such that rankS (v) = n2 2n 4 . (ii) occurs only if there are more than
1 34
2n n elements in R that are to the left of v. We define Y to be the number of elements in R that are
to the left of v. By a similar argument to that used for (i) we can show that
" 3
! #

n4 1
Pr Y 2 n n 1 .
2 4n 4
1
By symmetry, cases (iii) and (iv) each also only occur with probability at most 1 . Thus, by a union
4n 4
1
bound, the probability the algorithm fails is at most 1 .
n4
6.5.5 Chernoff Bounds

We now turn our attention to another tail inequality, the Chernoff bounds. These bounds can often give us
stronger bounds than Chebyshevs inequality, but they only apply to specific types of random variables. A
more complete treatment of Chernoff bounds is given in [MR95].
Consider tossing a coin n times, where for each toss, Pr [heads] = p. Such a sequence of independent events
is said to have a binomial distribution. If n is the number of tosses and p is the probability of heads on each
toss, then let B(n, p) be the random variable that denotes the number of heads seen.
For any > 0 a Chernoff bound for this process is as follows:
 np
e
Pr [B(n, p) (1 + )np]
(1 + )(1+)

In the special case where 0 < 1, there are other (and easier to work with) versions of the Chernoff bound
as follows:

2
1. Pr [B(n, p) (1 )np] e np/2

2
2. Pr [B(n, p) (1 + )np] e np/3

Example 1: The Red Sox Have a Winning Season

Assume that the Red Sox win each game with a probability of 41 , independently. If we define a winning
season as one in which the Sox win more than half their games, what is Pr [Sox have a winning season]?
1
We can apply Chernoff bounds
  problem. Here p = 4 , n is the number of games in the season, and
to this
n
we are trying to find Pr B(n, p) 2 . Using the second special case with = 1, we have,
h ni
Pr B(n, p) en/12 .
2

If we had used Chebyshevs inequality to analyze this problem, our bound would have been n3 . In this
example, if n = 166 the bound that the Chernoff bound gives us on the probability of a winning season is
approximately one in a million.

Example 2: Coupon Collectors Problem

In this problem we have n types of coupons. At each trial we get a random type of coupon, uniformly
and independently of how many of each type we have so far. We want to know how many trials it will
take until all types are collected. We will show here that after 4n ln n trials, the probability that we are
missing any type of coupon is at most n1 . Stronger results on this problem are known. In particular, using
other techniques, we can show that it is quite likely that we need at least n ln n o(n ln n), and at most
n ln n + o(n ln n) trials.
PN
Proof: Let X be the number of coupons of type 1 after N = 4n ln n trials. Then X = i=1 Xi , where

1 if the ith coupon is of type 1
Xi =
0 otherwise
1
Then Pr [Xi = 1] = n. So N p = 4 ln n. Using the first special case of the Chernoff bounds with = 1, we
get
1
Pr [no coupon of type 1] e2 ln n = .
n2
So Pr [missing any type of coupon] n n12 = 1
n, since there are n types of coupons.
Chapter 7

NP-Completeness

We next study the concept of NP-Completeness. This is somewhat different than any of the techniques we
have examined thus far. In particular, up to now, we have focused on studying algorithms and techniques
for algorithms, and have demonstrated several results implying that efficient algorithms exist for certain
problems. We shall next consider the question of trying to show that, for certain problems, efficient algorithms
do not exist.

7.1 Preliminaries

If we want to show that efficient algorithms do not exist, we need to say precisely what we mean by an
efficient algorithm. In the context of NP-Completeness, we define an efficient algorithm to be an algorithm
which runs in polynomial time. Note that this would include a running time of O(n1000 ), which, in terms of
an algorithm that someone would consider using, would not be considered efficient. Why then, are we using
this as our definition?

People have observed over the years that practical problems either have an algorithm with a low degree
polynomial running time, or do not have a polynomial time algorithm at all.

A polynomial running time algorithm typically requires some kind of insight into structure of the
problem. For example, for the MST problem, we could enumerate all possible trees, and then choose
the minimum weight tree, which is an exponential time algorithm. However, by using our understanding
of the structure of the MST problem, we were able to design a polynomial time (and actually very
efficient) algorithm for this problem.

It turns out that it is often difficult to tell the difference between a problem that has an efficient algorithm
and one that does not. Consider the following examples:

Min-Cut: We can solve this problem in polynomial time.

Max-Cut: (Find a cut of maximum total weight.) There is no known polynomial time algorithm for
this problem.

Shortest Path: We can solve this problem in polynomial time.

95
Longest Path: (Find the longest path from vertex a to vertex b in a weighted, directed graph.) There
is no known polynomial time algorithm.

7.1.1 Decision Problems

In the context of NP-Completeness, it is more convenient to look at decision problems, which are problems
where the answer is either a yes or a no (instead of, for example, a value to be computed). Looking at
these types of problems does not restrict us in the type of problems we can describe. Algorithms for solving
decision problems can be used to provide algorithms for solving optimization problems, and vice versa.
For example, the Max-Cut problem can be stated as a decision problem:

Input: unweighted, undirected graph G, integer k.


Question: Does G have a cut of size k?

So can the Longest Path problem:

Input: weighted, directed graph G, vertices a, b, integer k.


Question: Does G have a path from a to b of length k?

It is easy to see that if we can find the optimal solution to a problem, we can also answer the related decision
problem. However, it turns out that in many cases the converse is also true: a solution to the decision version
of many problems also gives us a solution to the optimization version. We demonstrate this by describing it
for the specific example of the Max-Cut problem:

Find k* = size of Max-Cut. This can be found using the algorithm for the decision version and binary
search.

For each edge e of G do

G = G e
If Max-Cut(G, k*) = yes then G = G

Return cut defined by G

It is clear that if we can answer the decision problem in polynomial time, then this algorithm also runs in
polynomial time.

Claim 56 The cut G returned by this process will be a Max-Cut in the graph.

Proof: The proof is by induction on the number of Max-Cuts.


Base Case: Assume there is a unique Max-Cut. Every edge in that cut cannot be removed. Every edge
not in that cut can be removed.
Inductive Step: If we examine an edge that is in all max cuts, that edge will not be deleted. An edge
that is not in any max cuts will be deleted. The first edge encountered that is in some max cuts but not
in all will be deleted, after which the number of max cuts has decreased. By the inductive hypothesis, the
algorithm will then return the max cut.
7.2 The Classes P and NP

7.2.1 Definition

We now define the classes P and NP. Intuitively, P is the set of problems where we can answer the decision
problem efficiently. NP is the set of problems where we can verify a proof that we have a YES instance
to the problem efficiently.

Definition 57 A decision problem P if and only if polynomial time algorithm A such that
(X is a YES instance of ) if and only if (A(X) = YES).

Definition 58 A decision problem NP if and only if polynomial time algorithm A such that (X is a
YES instance of ) if and only if (Y with |Y | = O(|X|k ), and A(X, Y ) = YES).

Y is a witness or a proof that somehow convinces us that X is a YES instance of the problem. This
proof can be verified in polynomial time using the algorithm A. Note that if a problem is in NP, it is not
necessarily the case that we can actually find the witness Y in polynomial time. Instead, if we are given the
witness Y , then we can verify it quickly. Thus, even though we define NP in terms of an efficient algorithm,
it does not imply that we can solve the problem efficiently.
Example: Max-Cut NP
For this problem, think of Y as the description of a specific cut of size k. The algorithm A is as follows:
ReturnYES if and only if Y is a cut in the graph given in X and |Y | k. It is easy to design a polynomial
time algorithm that does this. Furthermore, every graph G that is a YES instance has a Y that allows
such an algorithm A to return YES, and every graph that is a NO instance has no such Y . Note that
the existence of such an algorithm A does not necessarily mean that we can efficiently find the cut Y (or
even determine whether or not such a cut exists). Instead, the algorithm A verifies that Y is in fact a valid
witness to the fact that the graph has a large cut.
Example: Longest path NP
Think of Y as the description of a specific path from a to b with length k. The algorithm A is as follows:
Return YES if and only if Y a path from a to b and |Y | k.

7.2.2 Properties of P and NP


1. If N P , then has an exponential time solution. This is the case, since we could always enumerate
through all possible witnesses Y , and check each one.
2. P N P . To see why, note that if a problem P , then we can use the algorithm A(X) implied
by the definition of P to give the algorithm A(X, Y ) required by the definition of N P . The algorithm
A(X, Y ) simply ignores the input Y , and runs the algorithm A(X) on the input X.

There are some problems in NP (such as the Max-Cut problem) for which there is no known polynomial
time solution. We would like to be able to show that such problems are not in P, but no one has thus far
been able to do this. This leads us to the following question, which some people think is the most important
unsolved question in mathematics: Is P = NP? Most people think that P 6= NP. If we could prove this,
then we could also show that the Max-Cut problem is not in P.
Note: NP stands for nondeterministic polynomial time, which comes from an alternative but equivalent
definition of NP that uses something called a Non-Deterministic Turing Machine.
Figure 7.1: P and NP. We would like to show that there exists a problem that is in NP, but not in P, as
the Max-Cut problem is depicted here.

Figure 7.2: Polynomial time reduction.

7.2.3 Polynomial Time Reduction

While we have not been able to prove that P 6= NP, we can give very strong evidence that certain problems
are not in P. In particular, we can show that if these problems are in P, then it is actually the case that P
= NP. Since we do not believe that this is true, it means that we also do not believe that these problems
are in P.
In order to do so, we want to be able to show that for two problems 1 and 2 , if 1 P then 2 P.
Note that this also implies that if 2
/ P then 1
/ P.

Definition 59 2 is polynomial time reducible to 1 , (2 p 1 ), if and only if polynomial time algorithm


f that transforms any instance X of 2 to an instance f (X) of 1 such that (X is a YES instance of
2 ) (f (X) is a YES instance of 1 ).

Note: The function f above maps any YES instance of 2 to a YES instances of 1 and any NO
instance of 2 to a NO instances of 1 . It is not necessarily either a one-to-one or an onto projection. See
Figure 7.2.
To show the implication if 1 P then 2 P holds, all that we need to show is that (2 p 1 ). In
order to do this, all that we need to do is to demonstrate that f exists. To see why, note that if f exists,
and 1 P , then we can solve problem 2 as follows: take an input to 2 , and use f to map it to an input
Figure 7.3: Diagrammatic proof that if 2 p 1 and 1 P , then 2 P .

to 1 . Then solve 1 , and return the resulting answer. This is guaranteed to be the correct answer to the
original question. This is depicted in Figure 7.3.

7.2.4 NP-Completeness

Definition 60 A decision problem is NPHard if and only if 0 NP, 0 p .

Definition 61 A decision problem is NPComplete if and only if

is NPHard

NP

Note that if problem is NP-Complete and P, then P = NP. This follows from the fact that P NP,
and that if everything in NP can be reduced to a problem that is in P, then NP P. Since we believe that
P 6= NP, showing that a problem is NP-Complete gives us strong evidence that the problem is not in P.

7.3 Examples of NP-Complete problems

7.3.1 The SAT Problem

The satisfiability problem, SAT, is as follows:

Input: A Boolean formula (a formula of Boolean variables with operations , , ).

Question: Is satisfiable?

Here are two examples of Boolean formulas:

= (X1 X2 ) (X2 ), which is satisfiable.

= X1 (X1 ) X2 , which is not satisfiable.


Figure 7.4: Reduction from to

SAT was the first problem shown to be NP-Complete. In particular, the following is one of the most
important theorems in all of Computer Science:

Theorem 62 (Cook 1971) SAT is NP-Complete.

The proof of the above theorem is beyond the scope of this course. Please refer to [CLRS], Chapter 34 for
details. The implications of Cooks Theorem have had an enormous impact on all of Computer Science.
Instrumental in this was a paper from 1972 by Karp, showing that the NP-Completeness of SAT can be used
to show that many other important problems are N P complete. In the time between 1972 and the present,
literally thousands of problems have been shown to be N P complete.

How to show that a problem is NP-Complete

To show that a problem is NP-Complete:

1. Show that NP.


2. Choose some already known to be NP-Complete. Show that p .

Step 2 demonstrates that 0 NP, 0 P . This is the case, since we know that 0 p , 0 NP.
Note that 0 p and p implies that 0 P , as depicted in Figure 7.4.

7.3.2 3-SAT

Definition 63 The 3-SAT problem is defined as follows:

Input: A boolean formula in Conjunctive Normal Form (CNF) with 3 literals per clause. Conjunctive
Normal Form means that the formula is the logical AND of a set of clauses, where each clause is the
logical OR of a set of literals. A literal is either a variable or the negation of a variable.
Question: Is satisfiable? (i.e. Is there some assignment of values to variables that makes the value of
true?)
clique of size 3

clique of size 4

Figure 7.5: Sample graph with two cliques.

Example: = (a b c) (a b d) . . ..
Just like the SAT problem, the 3-SAT problem is also NP-complete. Due to the convenient form of the
3-SAT problem, this fact can be used to prove that other problems are NP-complete.

7.3.3 The Clique Problem

Definition 64 A clique of size k in graph G is a completely connected subgraph of G with k vertices.

Figure 7.5 shows an example.

Definition 65 The clique decision problem is:

Input: Graph G(V, E) and integer k.


Question: Does G contain a clique of size k?

Note that asking if a graph has a clique of size k is equivalent to asking if a graph has a clique of size at
least k, since we could also take a subset of k vertices from a larger clique.

Theorem 66 Clique is NP-complete.

Proof: We first prove that Clique NP. The algorithm A(G, y) from the definition of NP verifies that y
is a clique of size k in the graph described by G = (V, E). The verification can be done in polynomial time
by checking that y V , that u, v y, hu, vi E, and that |y| = k.
We next prove that 3-SAT p Clique. To do so we describe a polynomial-time reduction that converts an
input instance x of 3-SAT to a corresponding input instance of the clique problem.

CNF formula polynomial time transformation- G , k



Function f

The transformation must have the following properties:

1. It must be polynomial time.


2. If is satisfiable then G must have a clique of size k .
a a a

b c b c b d

Figure 7.6: Graph G derived from the example 3-CNF formula . Edges are shown only for the first clause
of .

3. If is not satisfiable then G does not a clique of size k . (This is equivalent to showing that if G
has a clique of size k , then must be satisfiable.)

We use the following transformation:

Let k = the number of clauses in .

Let G = (E, V ) be a graph such that there is one node per literal in . In particular node nij V
corresponds to the jth literal in the ith clause in the formula. The edge hnij , nhk i E if and only if
i 6= h (i.e., the literals belong to different clauses) and lij 6= lhk , (i.e., both literals can consistently
be set to true at the same time).

Example: In Figure 7.6, we see the graph G that is produced by the transformation on the formula

= (a b c) (a b c) (a b d)

Note that the transformation also produces k = 3.


We need to show that the properties of transformation are satisfied.

1. To compute k , we only need to count the number of clauses. The graph G can also be computed
from in polynomial time. To create nodes of the graph, assign one node for each literal. To create
the edges, consider each possible node pair and check whether there exists an edge between the two
nodes.

2. Claim 67 If has a satisfying assignment, then G has a clique of size k .


At least one literal has to be true in each clause under the satisfying assignment. The literals are
consistent as they cannot be true and false at same time, i.e. x = true and x = true is not possible.
Pick one true literal per clause and we form a corresponding set of nodes for those literals in G . As
no node is a complement of any other node, and all nodes are in different clauses, we have edges from
each node to all other nodes in the set. This gives us a clique of size k.

3. Claim 68 If G has a clique of size k , then is satisfiable.


No edges in G connect vertices in the same clause and so each vertex in the clique corresponds to
exactly one literal per clause. Also since no edges are present between a node for a literal and a node
for the complement of the literal we can safely set all the literals that correspond to nodes in the clique
to true. Each clause is thus true and so is satisfied.
Vertex Cover Variant 1

Vertex Cover Variant 2

Figure 7.7: A graph and two possible vertex covers.

Thus, we have reduced the 3-SAT problem in polynomial time to the clique problem and the transformation
used satisfies all the required properties. This concludes the proof of the theorem that clique is NP-complete.

7.3.4 The Vertex Cover Problem

Definition 69 A vertex cover of the undirected graph G = (V, E) is a subset V 0 V of the vertices of G
such that the vertices in V 0 touch every edge in E.

In other words, (u, v) E, either u V 0 or v V 0 . See Figure 7.7 for an example.

Definition 70 The vertex cover problem.

Input: A graph G = (V, E), and an integer k.

Question: Does G contain a vertex cover V 0 of size k?

Note that asking if G has a vertex cover of size k is equivalent to asking if there is a vertex cover of size k,
since we can always add vertices to a smaller cover.

Theorem 71 The vertex cover problem is NP-complete.

Proof:

1. The vertex cover problem NP. To see this, we interpret the witness Y as a subset of the vertices
that forms a vertex cover. We can verify that Y is a valid vertex cover of G in polynomial time by
inspecting each edge e E to verify that it is touched by a vertex in W . It is also easy to verify that
|Y | = k.

2. We show that 3-SAT p V ERT EX-COV ER. In particular, given any 3 CN F formula , our
transformation produces (G , k ) such that G has a vertex cover of size k ) if and only if is
satisfiable. This transformation proceeds as follows:

Let k = V + 2m, where V is the number of variables and m is the number of clauses.
Build G as follows. For each variable X add a subgraph as depicted in Figure 7.8. For each clause
(l1 l2 l3 ) add a subgraph as depicted in Figure 7.9. In addition, connect each literal in the
clause subgraph to the appropriate side of the variable subgraph.
T X F

Figure 7.8: The subgraph for a variable.

l2
To F-node of variable y
To T-node of variable x

l1 = y l3 = x

Figure 7.9: The subgraph for a clause.

Consider the example where = (a b c) (a b c) . This results in k = V + 2m = 3 + 2 2 = 7, and


the graph G depicted in Figure 7.10.
The transformation of the formula into G , k can be performed in polynomial time since to construct G
only V variables, m clauses and 3 m literals must be considered to create and connect the corresponding
nodes of G .

Claim 72 If is satisfiable, then G has a vertex cover of size V + 2m.

Proof: If is satisfiable, there must exist an assignment A = {x1 = T rue, x2 = F alse, . . .} of the variables
that makes (A) true. The subset W of the vertices of G is now chosen as follows:
For each variable x in , we have two vertices xT , xF in G . If x = true in the assignment A, include xT in
W otherwise xF .
After having done this for each variable, W contains V vertices. These vertices cover the segments of G that
correspond to variables in and all the edges that connect nodes in the variable and clause components
of G that correspond to true literals under the assignment A in (A).
For each clause (l1 , l2 , l3 ) in , we have three vertices vl1 , vl2 , and vl3 . Since the assignment A makes (A)
true, there must be at least one literal li in each clause which is true. The corresponding vertex vli does
not need to be included in the vertex cover W , because the edge from vli to the corresponding variable
vertex xT (or xF , if the li = x) is already covered by xT (or xF ) which is in W . So we include the remaining

T a F T b F T c F

b b

a c a c

Figure 7.10: The graph corresponding to (a b c) (a b c).


vertices of the clause vlj and vlk in W . As a result, all the edges in the clause part of G are covered and
also the yet untouched edges connecting the variable and clause components of the graph. Thus W is
a vertex cover which contains V + 2m vertices and the above claim is proven.

Claim 73 If G has a vertex cover of size V + 2m, then is satisfiable.

Proof: Suppose that a vertex cover W of size V + 2m on G is given. Looking at some variable x in
and its corresponding vertices xT and xF in G , we can see that one of these vertices xT ,xF must be in W ,
otherwise the edge connecting the two vertices would not be touched by any vertex in W . That means W
has to contain at least V variable vertices and each variable contributes at least one variable vertex to
W . Looking at some clause (l1 , l2 , l3 ) in , and its corresponding vertices vl1 , vl2 , and vl3 in G , we can see
that at least two of these vertices must be in W , otherwise one edge of this clause segment of the graph
would not be touched by any vertex in W . So W has to contain at least 2m clause vertices and each clause
contributes at least two clause vertices to W .
Since W contains only V + 2m vertices it follows that each variable contributes exactly one variable vertex
and each clause contributes exactly two clause vertices to W . Thus, we can choose an assignment A where
we set each variable based on which of its variable vertices is in the set W . In particular, if xT W , then
x = true, and otherwise xF must be in W and x is set to f alse.
We now see that with this assignment of variables (A) must evaluate to true. To see this, consider any
clause c, and recall that there are only two clause vertices of c, say vl1 and vl2 , in W . So, for W to be a
valid vertex cover, the edge e coming out of vl3 must be covered by the variable vertex at the other end of e.
Thus, from the way we constructed the assignment A, we see that there must be at least one literal in each
clause that evaluates to true. This implies that the entire formula evaluates to true.
The last two proven claims conclude the proof that VERTEX-COVER is NP-complete.

7.3.5 The Subset-Sum Problem

The subset-sum problem is different from the other problems we have looked at so far in the context of
NP-Completeness. First, it deals with integers, rather than graphs. Also, it is an NP-Complete problem
that has an algorithm that runs in time that is polynomial in the size of the integers appearing in the input.
Note that this is not the same as running in polynomial time.

Definition 74 The subset-sum problem.

Input: A set S of n integers, {s1 , s2 , . . . , sn }, and a target integer t.


P
Question: Is there a subset S 0 S such that t = sS 0 s?

Theorem 75 The subset-sum problem is NP-Complete.

Proof:

1. We first show that the subset-sum problem is in NP. To see this, note that verification that a witness
Y S sums up to t can be done in polynomial time, since we can add a set of integers in time
proportional to the description of those integers.
# variables # clauses

Si
number in decimal
first part second part

Figure 7.11: The two parts of an integer s.

Set S First Part Second Part


s1 1 001
s0 1 1 010
s2 10 110
s0 2 10 001
s3 100 001
s0 3 100 100
s4 1000 100
s0 4 1000 010
h1 1
h2 1
h3 10
h4 10
h5 100
h6 100
t 1111 333

Table 7.1: Example reduction from 3SAT input to a subset-sum problem.

2. We next show that 3-SAT p subset-sum. Given , construct a set of integers S and a target integer
t such that there is a valid sum if and only if is satisfiable. The integers (represented in the decimal
system) have two parts, as depicted in Figure 7.11. For each variable Xi two integers are constructed,
si to represent Xi , and s0i to represent Xi . For both of these integers, si and s0i , the first part (most
significant digits) is equal to 10i1 . This means that the ith digit from the right is a 1 and the rest
are 0s.
The second part (the least significant digits) consists of m digits, where m is the number of clauses in
. It is constructed as follows:

If Xi is in clause j, then the j th digit from the right in si is a 1 otherwise the j th digit is 0.
If Xi is in clause j, then the j th digit from the right in s0i is a 1 otherwise the j th digit is 0.

Additionally, we add helper integers hi to S. For each clause j, we add two copies of the integer 10j1 .
# of variables # of clauses
z }| { z }| {
Finally, the target integer is of the form: t = 11..........11 33..........33

Example: The set of integers S in Table 7.1 are the result of using these rules for the formula =
(x1 x2 x3 ) (x1 x2 x4 ) (x2 x3 x4 ). Note that for this input to 3SAT, t = 1111333, since we have
4 variables and 3 clauses.
Now we need to show that given , this construction yields an appropriate set S of integers and target
integer t.
Claim 76 is satisfiable if and only if there is a valid subset S 0 . In other words, if we started with a
yes instance, this process of transforming the formula into a set of integers leads to a yes answer in the
subset-sum problem, and otherwise it leads to a no answer.

Proof: (): Assume has a satisfying assignment. We use this satisfying assignment to construct the
set of integers that sum to the correct value. For each pair, include either si or s0i . The one we include
corresponds to the true literal in the satisfying assignment. The sum of the integers in the subset is of the
following form:

1 1 1 1 1 y1 y2 y3 ....ym

where each 1 in the first part corresponds to either si or s0i and each yi in the second part corresponds to
the number of true literals in a clause. Also, i yi 1, because for to be satisfiable, there must be at
least one true literal per clause. We can add zero, one, or two helper integers to each yi so that it sums to
3 and so enough of these helper integers are added such that S 0 sums to t. Thus, we can construct a valid
solution subset S 0 for the subset-sum problem (S, t).
(): Assume we have a valid sum. From this sum we construct a set of variables that satisfies the original
formula. First, we note that there are no carries when we add up each column, because for any digit of
the integers the sum cannot be more than 5 (there can be a maximum of three true literals and two helper
integers). This implies that if we have a valid sum that adds to t then we must have exactly one of si or s0i ,
since we must achieve a sum of 1 for each corresponding column.
We can set the variables according to which of these two is included in the sum. This is guaranteed to satisfy
the original formula. To see that this is the case, note that if it does not satisfy the original formula, then
there must be some column where no si or s0i has a 1 in that column. However, there are at most 2 helper
integers with a 1 in that column, and thus without a 1 in one of the si s or s0i s, there is no way for that
column to sum to 3. Therefore the original formula is satisfied.
This concludes the proof that subset-sum is NP-Complete.
The key thing to note from this problem is the size of the integers we obtain from the reduction. Given
a 3SAT input, we can produce the integers in polynomial time. This is because the number of digits in
the integers is linear in the size of the 3-SAT formula. However, note that the actual value of the integers
is exponential in this size. This is why the fact that we can solve the subset-sum problem in time that is
polynomial in the size of the integers appearing in the input does not imply that P= NP.

7.3.6 The Max-Cut Problem

We next prove that the max-cut problem is NP-Complete. To do this, we first prove that a problem called
not-all-equal 3-SAT (NA= 3-SAT) is NP-Complete. Well then reduce NA= 3-SAT to max-cut.

Not All Equal 3-SAT (NA= 3-SAT)

Definition 77 The NA= 3-SAT problem.

Input: A 3-CNF formula .

Question: Is there a satisfying assignment to that has at least one false literal per clause?
Note that this implies that each clause has at least one true literal, and at least one false literal.

Theorem 78 NA= 3-SAT is NP-Complete.

Proof:

1. The NA= 3-SAT problem is in NP. We can verify that an assignment is NA= in polynomial time. So,
NA= 3-SAT is in NP.

2. 3-SAT p NA= 3-SAT.


0
We transform an input to the 3-SAT problem into an input to the NA= 3-SAT. For each clause
(li1 li2 li3 ) of , we add two variables xi and yi and replace the clause with three new clauses to 0
as follows, where the new variable is shared between clauses.

= . . . (li1 li2 li3 ) . . .


0
= . . . (li1 li2 xi ) (xi li3 yi ) (xi yi ) . . .

0
Note that the transformation from to can be done in polynomial time because there is constant amount
of work being done per clause.

0
Claim 79 is satisfiable if and only if, is NA= satisfiable.

0
Proof: (): Given a satisfying assignment to , we construct a satisfying assignment to . Set = true.
Consider some clause i. In the satisfying assignment to , there are 7 possible settings of li1 , li2 , and li3 ,
the three literals of clause i. These setting are all the possible settings except the case where li1 , li2 , and
li3 are all false. For each of these settings, there is a way to set xi and yi such that 0 is N A = satisfied.
To see this, we can go through all 7 possibilities. For example, if li1 , li2 , and li3 are all true, then we can
set xi = false and yi = false. This results in a valid assignment. Similarly, we can do this for the other 6
possible settings of li1 , li2 , and li3 .
(): We first show that we can assume = true. To show this, we claim that if 0 is NA= satisfiable
then there is an assignment of literals such that the formula is NA= satisfiable when = true. In any valid
0 0
assignment, has at least 1 true and 1 false literal per clause. If is false in the valid assignment , we
can flip all the variables in the formula. This gives us an assignment with equal to true while still having
at least one true and one false literal per clause.

Claim 80 At least one of li1 , li2 , or li3 must be true.

Proof: In 0 , if is true then either xi or yi is false. If xi is false then li1 or li2 must be true. If xi is true
and yi is false then li3 must be true. Hence, at least one of the three literals must be true.
0
Thus, we can take the assignment from and map it back to and have at least one true literal per clause.
Therefore, is satisfiable.
This concludes the proof that NA= 3-SAT is NP-Complete.
a a

b b

c c

Figure 7.12: Reduction of NA= 3-SAT to max-cut. The dashed line defines a maximum cut of the graph.

Proof that Max-Cut is NP-Complete

Theorem 81 NA= 3-SAT p Max-Cut.

Proof: Let be an input to NA= 3-SAT. First, to simplify the reduction, we convert into a restricted
0
but equivalent form. Convert into a formula such that

1. No clause has the form (X X ...), where X is any literal. This is easy to ensure because the clause
0
will always be satisfied, so we can simply remove it from .
2. No two clauses share the same pair of literals. That is, no two clauses have the form (a b c) and
(abd), for literals a, b, c, and d. If contains any clauses of this form, we can introduce two variables
x and y and substitute the equivalent expression (a b c) (a x d) (b x y) (b x y) for
(a b c) (a b d).
Note that the resulting formula is NA= satisfiable if and only if the original formula is NA= satisfiable,
since in any such assignment, it must be the case that x and b are assigned to the same value.
0
Now, we can use to construct an input to the max cut problem. Let k = v + 2m, where k is the input to
0 0
the max-cut problem, v is the number of variables in and m is the number of clauses in .
Define a graph G as follows:
0
1. Add two vertices Xi and Xi for every variable in .
2. Add an edge between two vertices Xi and Xj if and only if either (a) Xi and Xj occur in the same
clause, or (b) Xi = Xj .

Here is an example of this reduction. Suppose = (a b c) (a b c). Then G is given in Figure 7.12
and k = v + 2m = 7.

0
Claim 82 is NA= satisfiable if and only if there is a MAX-CUT of size at least k.
Figure 7.13: Representation of a clause.

NP NP NP

NPC NPC

Problem P = NP

P P

A B C

Figure 7.14: Possible structures of NP.

0
Proof: (): Let us show that if is NA= satisfiable, then G has a cut of size at least k. Suppose that
has some satisfying assignment. Construct a cut in G as follows. Put the vertices that correspond to true
literals on one side of the cut, and the vertices that correspond to the false literals on the other side.
0
The cut will cross one edge for each variable in . Each clause can be represented as a triangle as shown
in Figure 18.3. Since each clause has at least one true literal, and at least one false literal, two literals will
be on one side of the cut and one literal will be on the other. Therefore, the cut will cross 2 edges for each
clause (triangle). None of these variable edges and clause edges are shared, since we specified that no clause
contains a variable and its complement. Also, none of these edges are shared between clauses because we
specified that no pair of literals appears in more than one clause. Thus, the size of the cut is v + 2m.
0
(): We show that if G has a cut of size at least k, then is NA= satisfiable. Each variable can contribute
at most one edge to the cut. Each clause can contribute at most two edges to the cut. So the maximum cut
possible has size v + 2m. If there exists a cut of this size, then it must cut every clause and every variable.
This means that we can construct a NA= satisfying assignment for by choosing either side of the cut to
represent true literals. This must result in at least one true literal per clause and at least one false literal
per clause.
This concludes the proof that Max-Cut is NP-Complete.

7.4 More on NP-Completeness

A natural question to ask is whether or not there are problems that are in NP, but not in P, but also are
not NP-Complete. We cannot answer this question (yet), since to do so would imply that P6=NP. However,
we are able to say something about this question. There are three possible structures of NP, as depicted in
Figure 7.14.
We cannot ellimiate A or C, but we do know that B is not possible. In particular, it has been shown that
if P 6= NP, then infinitely many languages that are neither NP-Complete nor in P.
Nonetheless, most natural languages in NP have been shown either to be in P or NP-Complete. There are
a few important exceptions to this rule of thumb, including factoring (and related problems, such as discrete
log), as well as graph isomorphism.

7.4.1 More NP-Complete problems

It turns out that it is often difficult to tell the difference between an NP-Complete problem, and one that
is solvable in polynomial time. In fact, there are many pairs of problems that seem quite closed related, but
one of them can be solved in polynomial time, and the other turns out to be NP-Complete.
Here is a list of examples of this phenomenon. On one side we put the problems in P, which are often referred
to as tractable problems; on the other side we put the NP-Complete problems, which are often referred to
as intractable problems. Here, tractable and intractable means whether the problems are really solvable in
practice.

P(tractable) NP-Complete (intractable)


2-SAT 3-SAT
Eulerian Path Hamiltonian Path (vertex)
(edge) TSP
Shortest Path Longest Path
Min-Cut Max-Cut
Min-Bisection
Max-Bisection
Matchings 3-D Matchings
MST Steiner Tree
Edge-Cover Vertex-Cover

2-SAT/3-SAT: We have seen that 3-SAT is NP-Complete. On the other hand, 2-SAT is solvable in
polynomial time, where 2-SAT is a version of the satisfiability problem that is in Conjunctive Normal
Form, but only has two literals per clause instead of three literals per clause.
Eulerian Path/Hamiltonian Path/TSP: In the Hamiltonian path problem, we are given a graph, and
we want to determine if there is a path through that graph that goes through every vertex exactly once.
Deciding whether or not a graph has this kind of path is NP-Complete. A Eulerian path goes through
every edge exactly once; determining if a graph has such a path is solvable in polynomial time. The
TSP (Traveling Salesman Problem) is similar to the Hamiltonian Path problem. Here, we are given a
list of cities, as well as a distance function between cities, and we want to find the shortest tour of all
of those cities, i.e., a way of visiting every city exactly once in such a way that we minimize the total
distance traveled. We shall revisit this problem when we discuss approximation algorithms.
Shortest Path/Longest Path: The Shortest Path between a pair of points can be found in polynomial
time (for example, by Dijkstras algorithm). It turns out that the Longest Path problem is NP-
Complete.
Min-Cut/Max-Cut/Min-Bisection/Max-Bisection: We saw that the Min-Cut problem is solvable in
polynomial time. On the other hand, Max-Cut is NP-Complete. A closely related problem to this is
the Max-Bisection problem. A bisection of a graph is a partition of the graph into two equally sized
sets of vertices; in the Max-Bisection problem, the objective is to find a cut that maximizes the number
of edges that go across the cut. The Min-Bisection problem is also NP-Complete.
3SAT

NotAllEqual 3SAT Clique 3D Matching


SubsetSum

MaxCut VertexCover IndependentSet Steiner Tree

Knapsack
Hamiltonian Cycle
MinBisection

MaxBisection Longest Path Travelling Salesman

Figure 7.15: Tree of NP-Complete problems

Matchings/3D Matchings: We know how to find a maximum sized matching in polynomial time. There
is a 3D version of the matching problem that is NP-Complete.
MST/Steiner Tree: We can find the minimum spanning tree in polynomial time. There is a fairly
closely related problem, called the Steiner Tree problem, that is NP-Complete. In the Steiner Tree
problem, we still want to find a tree, but it does not have to be incident to all of the vertices in the
graph. In particular, there are optional nodes and required nodes in the graph. Our tree has to span
all of the required nodes, but we can either use or not use the optional nodes.
Edge-Cover/Vertex-Cover: In the Vertex-Cover problem, we want to find the minimum sized set of
vertices that touch every edge. This problem is NP-Complete. On the other hand, the Edge-Cover
problem is solvable in polynomial time. In this problem, we want to find a minimum sized set of edges
that touches every vertex.

To show that all of these problems are NP-Complete, we start with 3-SAT and reduce one problem to
another. The result is a tree of NP-Completeness results, as is depicted in Figure 7.15
The reason that we use 3-SAT as the root of the tree is partly because Cooks theorem (the first NP-
Completeness result) dealt with satisfiability, and partly because 3-SAT has a very specific structure, which
is very useful in performing reductions.
Chapter 8

Approximation Algorithms

As we have seen, there are many important problems that are NP-Complete. In fact, these problems are too
important to simply ignore: we need to develop methods to deal with these problems. Our main technique
for doing so will be to study approximation algorithms: techniques for finding a solution that is guaranteed
to be close to optimal. Before we do so, however, we shall briefly examine two other techniques: assuming
a probability distribution over the input, and restricting the allowed inputs to special cases.

8.1 Other possible approaches

8.1.1 Restrict the input

One thing we can do is to restrict the input. In other words, instead of solving the problem in its full
generality, we attempt to solve only certain classes of inputs. One such example is 2-SAT, which is solvable
in polynomial time. We can think of 2-SAT as a restriction of SAT, the more general satisfiability problem.
There are a number of different possible restrictions that can be made on a graph. For example, we might
restrict our attention to graphs that are acyclic, graphs that have bounded degree, or graphs that are planar
(i.e., can be drawn on the plane in such a way that none of the edges of the graph intersect). It turns out that
many NP-Complete problems can be solved fairly efficiently with these kinds of restrictions. For example,
if we look at the CLIQUE problem, the CLIQUE problem turns out to be fairly easy for all of these three
possible restrictions on the graph.
Another type of restriction that we might make is to assume that the integers that appear in the input are
polynomial in the input size. This is actually something weve done for the knapsack problem. For that
problem, we designed an algorithm that had a running time that was polynomial in the size of the input if we
assumed that all of the integers that appear in the input, e.g. the knapsack capacity, were polynomial in the
size of the input. So this does give us a polynomial-time algorithm of a restricted kind, but this restricted
kind can be quite useful in practice.
Unfortunately many of these restrictions are not very useful. For example, it turns out that in many cases,
the graph problems that we really want to solve do not adhere to the restrictions described above. However,
the restriction on the integers in the input actually does end up being useful often. Perhaps for this reason,
there is some terminology that goes along with this particular restriction, which we describe here briefly.
We have the following definitions:

113
Definition 83 An algorithm runs in pseudo-polynomial time if the running time is polynomial in the input
size and any integer in the input.

Example 1: For the knapsack problem we have a pseudo-polynomial time algorithm (using dynamic pro-
gramming) to solve it.

Example 2: The Subset-Sum problem is an NP-Complete problem. But it is also solved in pseudo-
polynomial time by using a dynamic programming algorithm similar to the knapsack solutions.

It is not the case that all problems with integers in the input can be solved in pseudo-polynomial time. In
particular, we have the following definition:

Definition 84 A problem is Strongly NP-Complete if it remains NP-Complete even when all integers in
an input of length n are polynomial in n.

For example, lets say that we reduce 3-SAT to a problem . If all the integers that appear in the instance
of are polynomial in the size of the original input to 3-SAT, then we have shown that is strongly NP-
Complete. Note that this is not the case for the reduction to the subset sum problem that we saw earlier,
as the integers that we produce could have size exponenetial in the size of the original problem.
Exercise: If a strongly NP-complete problem can be solved in pseudo-polynomial time, then P=NP.
One example of a strongly NP-complete problem is called the Bin-Packing problem.
Definition: The Bin-Pack problem (strongly NP-complete).
Input: A set of items of integer size, bin size B, and an integer K.
Question: can we partition the items into at most K sets (bins) such that no bin has total size larger
than B (i.e., the size of each set B)?
Another term that is not closely related, but sounds similar to pseudo-polynomial time is quasi-polynomial
time. Pseudo-polynomial time is a description of what the running time is polynomial in. Quasi-polynomial
time, on the other hand, is a description of how fast the running time is. In particular, quasi-polynomial
time lies between polynomial time and exponential time, as follows:

Polynomial: nO(1) = 2O(log n) .


O(1)
Quasi-polynomial: 2(log n) .

Exponential: 2O(n) .

8.1.2 Assume a probability distribution over the input

We also can assume that we have some probability distribution over the input. For example, for the clique
problem, we could assume a random graph and attempt to find the largest clique in this random graph.
The good news is that many NP-Complete problems turn out to be efficiently solvable in the average case.
For example, if we assume that every possible input graph is equally likely, then it is quite likely that we
can find the largest clique in a given graph in polynomial time.
There are two reasons why this is not usually an acceptable solution. First, it is not at all clear what
the right probability distribution to use should be. Certainly there are many instances where the uniform
probability distribution is not appropriate, but, even worse, there are many times when we are unable to
make any predictions as to what inputs are likely and what inputs are unlikely. Second, it turns out that
the worst case inputs can be quite important. In particular, people have observed over the years that the
problems that really do show up in practice, e.g., the clique problems that we want to solve, or the vertex
cover problems that we want to solve, actually are instances that are difficult to solve.

8.2 Introduction to approximation algorithms

If we have an NP-Complete problem, we cannot expect to find the optimal solution. However, we might
be able to find a solution that is only a little bit worse than the actual optimal. In fact, in some cases, we
can design algorithms that have a provable bound on how close solution returned by the algorithms is to
the optimal value. When this is possible, it is often the best strategy for dealing with these problems. As a
result, this area of algorithms research has seen quite a bit of attention in recent years.

8.2.1 Greedy approximation algorithm for Vertex-Cover problem

As an example of this, we next consider the vertex-cover problem. In the vertex-cover problem, we want to
find a set of vertices that are incident to every edge in the graph. Note that up to now, weve only considered
the decision version of the vertex-cover problem. Since it does not make any sense to approximate a yes
or no answer, when we discuss an approximation algorithm, we focus on the optimization problem.
Example: Optimization version of the vertex-cover problem
Input: A graph G = (V, E).
Output: A set U V of minimum size such that e E, e has at least one endpoint in U .
We first give a natural approximation algorithm for this problem, but we then demonstrate that the resulting
algorithm is actually not the best approach. A very natural approach for this problem is to start with the
vertex having the highest degree. More generally, we could build up a solution, one vertex at a time, by
always adding the vertex that covers the most edges that have not been previously covered. This leads to
the following algorithm:
Greedy Algorithm:

1. U =
2. while E 6= do
3. let v be the maximum degree vertex
4. U = U + v;
5. G = G v;
6. return U

Note that when we add a vertex to U , we remove all the edges from E that are incident to that vertex.
The approximation solution of the problem instance in Figure 8.1(a) has size of 4 (shown in Figure 8.1(b)).
Notice that this is not an optimal solution. The optimal solution has size of 3 as shown in Figure 8.1(c).
It turns out that the performance of this greedy algorithm can actually be fairly poor. We next construct
an input where this greedy algorithm does quite poorly. The specific input is shown in Figure 8.2. Here,
the number of the vertices in each column is shown in the first row. The graph has m vertices in the first
column. The second column has b m m
2 c vertices, and, in general, the k-th column has b k c vertices, where
approximation optimal
solution solution

(a) (b) (c)

Figure 8.1: Comparison of greedy approximation solution and optimal solution.

k = 1, 2, ..., m. Thus there is a single vertex in the last column. Every vertex in the k-th column (k > 1)
has k edges to the vertices in the first column. There is at most one edge from the vertices in any given
column to any vertex in the first column. The single vertex in the last column has m edges, all of which go
to distinct vertices in the first column. Also note that every vertex in the first column has degree (m 1).
In the above graph, all the edges go to the first column, so that the vertices in the first column cover all the
edges. This is the minimum vertex cover, and thus the optimal solution has size of m.
How well does the greedy algorithm do? Since the vertex in the last column has degree m, which is the
maximum degree, the greedy algorithm will start with this vertex. After this vertex is put into U , and all
the edges incident to that vertex are removed, the vertices in the first column now have degree at most
(m 2), which is less than the degree of vertices in the (m 1)-st column. Thus, in the next step, the greedy
algorithm will pick the vertices in the (m 1)-th column, and so on. The greedy algorithm ends up with a
solution consisting of all vertices of the graph, except those in the first column. Thus, the size of the vertex
cover returned by the greedy algorithm is:
m m m m m m m
1+b c + . . . + b c + b c = (m + b c + b c + . . . + b c + b c) m
m1 3 2 2 3 m1 m
1 1 1
m(1 + + + . . . + ) 2m
2 3 m
= (m log m)
The inequality above comes from the fact that each of the fractions is at most 1 larger than the value obtained
by rounding down.

8.2.2 Performance ratio for approximation algorithms

To measure whether an approximation algorithm performs well, we use the performance ratio of an algorithm.
Our definition depends on whether the problem is a maximization problem or a minimization problem.

Definition 85 The performance ratio of an algorithm is


Calg (x)
max f or a minimization problem,
x,|x|=nCopt (x)
Copt (x)
and max f or a maximization problem,
x,|x|=n Calg (x)

where Calg (x) is the cost of the algorithm solution on input x, and Copt (x) is the cost of the optimal solution
on input x.
m m ... m m
m 2 3 m-1 m

...
m

...

... ... ...

Figure 8.2: Specifically constructed vertex-cover problem where greedy algorithm does poorly.

The reason that we have different measures for the two types of problems is to allow us to compare mini-
mization problems and maximization problems on a similar scale. For both of these measures, large values
denote bad approximations and small values denote good approximations.
Example: According to the definition above, the performance ratio of the greedy approximation algorithm
for the Vertex-Cover problem is

 
m log m
= (log m) (8.1)
m

Why is this stated as (log m) instead of (log m)? The reason is that there may be some other input that
has the same size, but has an even worse performance ratio, e.g,. m2 .

Definition 86 An f(n)-approximation is an approximation with a performance ratio of f(n).

Example: It turns out that the bound in equation (19.1) is tight. Thus the greedy approximation algorithm
for vertex-cover problem is a log n-approximation.

8.2.3 A better approximation for the vertex-cover problem:

Algorithm:
1. S = ;
2. while E 6= do
3. pick any edge e = (u, v)
4. S = S + u + v;
5. G = G u v;
6. return S;
When we remove the two endpoints of the chosen edge, any edges that are incident to the endpoints are also
removed. Also, note that at step 3, any edge in the graph can be chosen: it does not matter what procedure
we use for this step.

Claim 87 This algorithm is a 2-approximation.

Proof: To prove this claim, we need to argue that for any input, we have a solution that is at most a factor
of 2 worse than the optimal solution for that input.
Let E 0 be the set of edges chosen by the algorithm. The cost of the algorithm is:

Calg (G) = 2 |E 0 | (8.2)

Note that this result does not make any assumption about how we actually choose the edges. Whenever the
algorithm chooses an edge, it removes both of the endpoints as well as any other edges that are touching
those endpoints. So any pair of edges in E 0 do not share a vertex. Thus E 0 is a matching. For the matching
E 0 , each edge in E 0 has to be covered by some vertex. So for every edge in E 0 , one of the two endpoints has
to be in the optimal solution, and there is no overlap of these endpoints. Therefore, the cost of the optimal
solution on any graph G has to be at least the size of E 0 , i.e.,

Copt (G) |E 0 |. (8.3)

We see that (8.2) and (8.3) imply that for any G, Calg (G) 2Copt (G).
Finally, we show that the performance ratio is at least 2, by exhibiting one particular input where the
performance ratio of the algorithm is no better than 2. Such an input is a matching: the algorithm will take
all of the vertices as its solution, and this is exactly twice as many nodes as the optimal solution of just
choosing one endpoint of every edge in the matching.
This simple algorithm is very close to the best known approximation algorithm for the vertex-cover problem.

8.2.4 Approximation for Independent-Set problem

We know every problem in NP can be reduced to a vertex-cover problem. So, in some sense, everything
in NP is no harder than the vertex-cover problem. Since we have a 2-approximation for the vertex-cover
problem, does this mean that we also have a 2-approximation for every problem in NP? Unfortunately not:
the vertex-cover approximation algorithm does not provide an approximation for every problem in NP. We
next see an example of this by examining the Independent-Set problem.
Independent-Set problem:
Input: An undirected graph G = (V, E).
Output: A set U V of maximum size such that no two vertices in U are connected by an edge.
The Independent-Set problem is equivalent to the clique problem on the complement graph. Also, the
decision version of the independent-set problem is polynomial-time reducible to the decision version of the
vertex-cover problem.

Claim 88 Independent-Set p Vertex-Cover.


Optimal solution for
Independent-Set

Approximation solution for


Independent-Set
Approximation solution for
Vertex-Cover Optimal solution for
Vertex-Cover

(a) (b) (c)

Figure 8.3: Illustration that the 2-approximation for the Vertex-Cover problem does not provide a good
approximation for the Independent-Set problem.

Proof: If U V is an independent set, from the definition of an independent set, we see that no edge in the
graph G has both endpoints in the set U . Thus, if we have an independent set U , then every edge in G has
at least one endpoint in V U , which gives us that V U is a vertex cover. Thus U is an independent set if
and only if V U is a vertex cover. This implies that a graph G has an independent set of size k if and only
if it has a vertex cover of size |V | k. Thus, we can do the following reduction, which can be completed in
polynomial time:

1. G G;
2. k |V | k;

This is guaranteed to map yes instances of the Vertex-Cover problem to yes instances of the Independent-
Set problem, and no instances to no instances. Therefore, the Independent-Set problem is polynomial
time reducible to the Vertex-Cover problem.
Since we have a 2-approximation for the vertex-cover problem, we next try to use this 2-approximation
algorithm in conjunction with this reduction to provide an approximation algorithm for the independent-set
problem.
Approximation algorithm for the Independent-Set problem:

1. Find S, the approximation solution of Vertex-Cover problem, using the 2-approximation algorithm for
the Vertex-Cover problem.
2. Return V S.

This algorithm certainly gives us a valid independent set. But how does this algorithm perform? Lets look
at the example in Figure 8.3 (a). Here, the performance ratio of the algorithm for the vertex-cover problem
is roughly 2. The size of the vertex-cover found by this algorithm is n 1 (Figure 19.4 (b)), i.e., |S| = n 1
(n = |V |), since it puts every vertex but one into the cover. Then the algorithm will only return one vertex
as the approximate solution for the Independent-Set problem, i.e., Calg (G) = 1. On the other hand, the cost
of the optimal independent set on this particular graph is n+1 n
2 (Figure 8.3 (c)), i.e., Copt (G) = d 2 e. Thus, we
n
cannot have better than a 2 -approximation. This is really a terrible approximation, since a naive algorithm
would be to always return a single vertex, which is always an independent set of size 1, and thus is guaranteed
to be an n-approximation. The designed algorithm is only a factor of 2 better than this naive algorithm.
It turns out that we are not able to do much better than this naive algorithm, and thus there is a very big
difference between these two different NP-Complete problems as to how well they can be approximated.

8.3 Approximation for the max-cut Problem

Given an input graph G = (V, E), the solution to a max-cut problem is a bi-partition of the vertices in the
graph G such that the number of edges between the two cells of the partition is maximized. Though the max-
cut problem is NP-Complete, it is possible to construct a 2-approximation: a solution that is guaranteed to
be within a factor of 2 of the optimal solution. The algorithm starts with two sets such that one set contains
all the vertices in the graph and the other set is empty. The vertices are subsequently moved from one set
to another as long as the size of the cut increases. The 2-approximation for the max-cut problem is given as
follows:

1. S = /* set of vertices on one side */


2. while v such that switching v to other side improves the cut:
3. switch v.
4. return the resulting cut, S.

Note that a vertex v can be switched multiple times: it is possible that switching v once improves the cut,
but then after other vertices have also been switched, switching v back now improves the cut again.
We needs to shown that the running time of this algorithm is polynomial in the input size and that it always
returns a cut that has size within a factor of 2 of the optimal solution. For each switch v operation in the
above algorithm, the size of the cut increases by at least 1. As the maximum size of any cut is |E|, the
number of iterations can be at most |E|. The algorithm therefore runs in polynomial time.

Theorem 89 The max-cut algorithm is a 2-approximation.

Proof: The size of the cut found by the max-cut algorithm needs to be shown to be within a factor of 2 of
the optimal solution. Let G = (V, E) be the input graph. Let a(v) be the set of edges from v V that cross
the cut. Let b(v) be the set of edges from v V that do not cross the cut. Note that v a(v) b(v), since
otherwise v would have been moved by the algorithm. Summing over all vertices v V , we get

X X
a(v) b(v). (8.4)
vV vV

The LHS (left-hand-side) of Equation 8.4 counts each edge that crosses the cut twice: once for each endpoint.
Similarly the RHS counts each edge that does not cross the cut twice. Thus, Equation 8.4 implies that more
edges cross the cut than do not cross the cut, or in other words that Calg |E|/2. Furthermore, the optimal
solution can at best have all of the edges crossing the cut: Copt |E|. Therefore the above described max-cut
algorithm is a 2-approximation.
It should be noted here, however, that there is a more complicated approximation algorithm that achieves a
1.1383 performance ratio.
8.4 Approximation for the Metric Traveling Salesman Problem
(MTSP)

The Metric Traveling Salesman Problem (MTSP) is:


Input: n cities and a distance function, d, such that the distance function is restricted to conform to the
triangle inequality: i, j, k dij dik + dkj .
Output: A tour of minimum length that passes through every city exactly once.
Some NP-Complete problems turn out to be in P after we make a restriction such as the triangle inequality
on the input. Thus, even though it is known that the Traveling Salesman Problem is NP-Complete, the
MTSP problem may be in P, or it may be NP-Complete. However, it turns out that we can show that it
is NP-Complete by reducing the hamiltonian-cycle problem to the MTSP problem. The decision version of
the hamiltonian-cycle problem can be described as:
Input: An undirected graph G = (V, E).
Problem: Is there a cycle that goes, exactly once, through each vertex in the graph?

Theorem 90 MTSP is NP-complete.

Proof: Given a subset of edges in the graph and an integer k, it can be verified in polynomial time if the
edges pass through every vertex in the graph exactly once and if the cost of the tour is k. Therefore
M T SP NP.
We next show that hamiltonian-cycle p M T SP . Let G = (V, E) be the input to the hamiltonian-cycle
problem. To transform the input G to an input of MTSP, let cities be equivalent to the vertices in V . Let

1 if (i, j) E
dij =
2 otherwise

The maximum value of dij is 2. Additionally, the shortest path in the graph consisting of two edges has
length 2. Therefore the triangle inequality holds.
To see that we do indeed have a valid reduction, note that if there is a Hamiltonian cycle in G, the corre-
sponding MTSP tour exists with a length of |V | since the distance of each edge is 1. Similarly, if a MTSP
tour of length |V | exists, the tour uses only edges with a distance of 1. Therefore a Hamiltonian cycle exists.

8.4.1 First Algorithm for MTSP

We start with a simple approximation algorithm for the MTSP problem, called MTSP1. This algorithm
is a 2-approximation, but we shall see shortly how to improve this to a 23 -approximation, using a slightly
more complicated technique. MTSP1 works as follows: we think of the cities as the vertices of a completely
connected graph G = (V, E), and we assign a distance to each edge i, j as the TSP distance dij between the
vertices that edge connects.

1. Compute a Minimum Spanning Tree (MST) on G.


2. Construct a pseudo-tour of the MST by starting at an arbitrary vertex v and by walking around the
perimeter of the MST with the MST edges always to the right. The tour uses each edge twice as shown
by the directed-loop in Figure 8.4 around the MST. Since the MST contains all the vertices in the
graph G and there are no cycles in a MST, the tour is guaranteed to visit all the vertices.
Figure 8.4: MTSP1: Pseudo-tour

Figure 8.5: MTSP1: Extracted Tour

3. Extract a tour by short-cutting the repeated vertices in the pseudo-tour. In particular, a vertex in
the pseudo-tour is skipped if it has already been visited. In the resulting tour, each vertex will be
used exactly once. Figure 8.5 shows the extracted tour in bold solid and dashed lines. The solid
lines represent the edges common to both the pseudo-tour and the extracted tour. The dashed lines
represent the short-cutting used to construct the extracted tour.

Claim 91 The MTSP1 algorithm is a 2-approximation.

Proof: As the pseudo-tour visits each edge twice, the cost of the pseudo-tour is twice the cost of the MST.
The extracted tour has cost at most twice the cost of the MST as the triangle inequality means we can take
a short cut without increasing the cost. Additionally, the cost of an optimal tour on a given graph G cannot
be less than the MST of the graph, since we can always construct a spanning tree by taking a tour and
removing any edge of that tour. Thus, we get the following:

length of extracted tour length of pseudo-tour = 2 cost(M ST )

Since
cost(M ST ) length of optimal tour
length of extracted tour 2 length of optimal tour

Thus, algorithm M T SP 1 is a 2-approximation.

8.4.2 Second Algorithm for MTSP (MTSP2: a 32 -Approximation)

A better algorithm for M T SP can identify a tour of length at most 32 times the length of optimal tour. Such
an algorithm uses a Eulerian tour. A Eulerian tour (Euler 1736) can be defined as follows.
Figure 8.6: MTSP2: MST and Matchings

Figure 8.7: MTSP2: Eulerian Tour

Definition 92 A Eulerian tour is a path that traverses every edge of the graph exactly once and returns
back to the initial vertex.

A Eulerian tour can thus visit any vertex in G several times. A graph G contains a Eulerian tour if and only
if G is a connected graph and every vertex in the graph G has even degree. Furthermore, if a graph contains
a Eulerian tour, we can find one in polynomial time. The following algorithm is due to Christofides [NCF76].
We again start with a completely connected graph G = (V, E), and a distance function d as defined by the
TSP distance.
M T SP 2 Algorithm

1. Compute a Minimum Spanning Tree(MST) on G.


2. A Eulerian tour cannot be constructed on a MST as there are odd degree vertices in any tree, since a
leaf node has odd degree. However, the MST can be augmented in the following way such that every
vertex in the resulting graph has even degree.

a. The set D of odd degree vertices is identified from the MST. The arrows in Figure 8.6 show the
odd degree vertices. Note that |D| is even, since the sum of all the degrees in D is even. This
is the case since the sum of the degrees of all vertices in the graph G is equal to twice the edges
in G and therefore is an even number. Removing the even degree vertices still leaves us with an
even number.
b. A minimum weight matching on D is computed and is added to the MST. Such a matching can
be found in polynomial time. An example matching is shown in Figure 8.6 as bold lines on the
MST. It should be noted that it is permissible to use a matching that contains an edge already
present in the original MST. In such cases, the newly found edge must be considered as a distinct
edge on the graph.

3. The resulting graph is connected as it contains a MST and each vertex in the graph now has even
degree. Thus, a Eulerian tour can be identified on the resulting graph. Such a tour is guaranteed to
Figure 8.8: MTSP2: Extracted Tour

visit all vertices in the graph as the resulting graph contains a MST. Figure 8.7 shows a Eulerian tour
on the resulting graph.

4. A tour is extracted from the Eulerian tour by short-cutting the repeated vertices in the Eulerian tour
such that each vertex is used exactly once. Starting at an arbitrary vertex, we follow the sequence
of vertices visited by the Eulerian tour, except that a vertex in the Eulerian tour is neglected if it
has already been visited. Figure 8.8 shows the Eulerian tour and the extracted tour. The solid lines
represent the edges common to both the pseudo-tour and the extracted tour. As before, the dashed
lines represent the short-cutting used to construct the extracted tour.

3
Claim 93 The M T SP 2 algorithm is a 2 -approximation.

Proof: The cost of the Eulerian tour is equal to the sum of the cost of the matching edges and the cost of
the MST. Furthermore, the triangle inequality allows us to short-cut repeated vertices without increasing
cost, and thus the cost of the extracted tour is at most the cost of the Eulerian tour. In other words,

cost of the extracted tour cost of the Eulerian tour cost(M ST ) + cost of matching (8.5)

We already saw that

cost(M ST ) cost of optimal tour (8.6)

The cost of the matching on D can be computed by initially computing the cost of an optimal tour on the
vertices in D. By the triangle inequality, the cost of an optimal tour on the vertices in D is less than or
equal to the cost of the optimal tour on all the vertices in G. Since D consists of an even number of vertices,
any tour on D can be partitioned into two distinct matchings. One of these matchings must have at most
half of the weight of the original tour. From this, we see that

1
cost of D matching cost of optimal tour (8.7)
2
From Equations 8.5, 8.6 and 8.7,
3
cost of extracted tour cost of optimal tour
2

The M T SP 2 algorithm is therefore a 32 -approximation.


This algorithm leads us to two natural questions: (1) does this approximation algorithm extend to more
general versions of the TSP problem, and (2) can we do better for more specific versions of the TSP?
To address question (1), we prove the following:
Claim: For general version of TSP, there is no k-approximation for any constant k, unless P= NP.
Proof: The proof can be provided by a reduction from the Hamiltonian-cycle problem. The Hamiltonian-
cycle input is a graph G = (V, E). The corresponding TSP  input is obtained by assigning the cities to
1 if (i, j) E
vertices V in G. The distance function d is defined as dij =
(k + 1).|V | otherwise
If there is a Hamiltonian cycle, then there is a TSP tour of length |V |. However, if G doesnt have a
Hamiltonial cycle, the tour is of length > (k + 1) |V |. Thus, if we have a k-approximation, we can tell
the difference between the two possible cases. This would give us a polynomial time algorithm that decides
whether or not a Hamiltonian cycle exists.
Note that this reduction assumes that we know the value of k. Think about this as follows: assume that
someone describes to you a k-approximation. This k-approximation has some specific value of k associated
it with it. You can then use this value of k to construct the reduction. Thus, for any k-approximation,
there is a corresponding reduction that uses that value of k. In fact, the combination of the approximation
algorithm and the reduction can be used to solve the Hamiltonian-cycle problem in polynomial time.
To address question (2) above, we note that if a stronger restriction on the distance function d is used, better
approximation algorithms can be obtained. Consider the Euclidean TSP, where cities are points in a plane.
Such a problem is still NP-Complete. However, for  > 0, a (1 + )-approximation can be obtained. The
algorithm runs in nO(1+1/) time.

8.5 Approximation for the set-cover problem

We next consider an approximation algorithm for the following set-cover problem.


Input: a finite set U and a collection C = {S1 , S2 , . . . , Sm } of subsets of U .
Output: a minimum sized cover C 0 C such that every element of U is contained
in at least one element of C 0 .
Here is a small example:

U = {e1 , e2 , e3 , e4 }
C = {{e1 , e2 }, {e1 , e3 }, {e1 , e4 }, {e1 , e2 , e4 }}
C0 = {{e1 , e2 }, {e1 , e3 }, {e1 , e4 }}
C 00 = {{e1 , e3 }, {e1 , e2 , e4 }}

Both C 0 and C 00 are solutions but C 00 has the minimum possible size. Another example is when the set U
is defined to be the edges of a graph, and C contains one set Si for each vertex i, such that Si contains
all of the edge that are incident to the vertex i. Note that this second example is actually the vertex-cover
problem we considered earlier. This example demonstrates that the set-cover problem is NP-Complete.
We here consider approximation algorithms for the set-cover problem. In fact, we shall consider the more
general case of this problem, where each set Si has a weight wi , and our objective is to minimize the total
weight of the set-cover C 0 . Although we saw for the vertex-cover problem that a greedy approach is not
always best, we shall introduce a greedy algorithm for set-cover. It turns out that for set-cover, we do not
expect to do better than this simple approach.
For a greedy algorithm, a natural objective is to choose the set that covers the largest number of previously
uncovered elements. For example, a natural first step would be to choose the largest set to be in the cover.
However, we also want to take the weight of the sets into account. Thus, another natural first step would
be to choose the set of minimum weight. We shall combine these two objectives by choosing the set that
minimizes the quantity wi /|Si |. The idea is that we want to minimize the cost (weight) per element covered.
This leads us to the following algorithm:
Approximation Algorithm for Set-cover:
Set R = U, C =
While R 6=
Let Si be the set mimimizing wi /|Si Ri |.
R = R Si .
C = C {Si }.
Return C .

Pd 1
Theorem 94 Let d = maxi |Si |. This algorithm is an H(d )-approximation, where H(d ) = i=1 i .

Proof: We shall formalize the intuition we described for the greedy algorithm. In particular, e U , let
ce = |SiwR|
i
, where Si is the first set that is chosen that covers the element e, and the elements in the set
R are those that are in the set R at the time that Si is chosen (right before the elements of Si are removed
from R.) We can think of the cost ce as the cost paid for covering element e.

Claim 95
X X
wi = ce .
Si C eU

This claim follows from the act that the weight of each selected set is distributed in cost over the elements
that are newly covered by that set. Thus, the weight of each set is completely accounted for. Note that the
left hand side of Claim 95 is the cost of the solution returned by the algorithm. Thus, this Lemma says we
can argue about the cost of the algorithm in terms of the right hand side as well.

Lemma 96
X
Sk , ce H(|Sk |) wk .
eSk

Note that this Lemma refers to all sets Sk : both those chosen by the greedy algorithm, as well as those sets
not chosen by the greedy algorithm. Roughly, this Lemma says the following: the elements that P are in the
set Sk could have been covered with total cost wk . However, instead, the algorithm uses total cost eSk ce .
The Lemma says that the algorithm cost cannot be more than the cost by using the set wk by more than a
factor of H(|Sk |). We shall make this more formal below, after we prove this Lemma.
Proof: Let d be the number of elements in the set Sk . We can assume through renumbering that Sk =
{e1 , e2 , . . . , ed }. Furthermore, we can also assume that the elements of Sk are numbered by the order that
they are covered by the algorithm, where ties (i.e., two elements being covered during the same iteration of
the algorithm) are broken arbitrarily.
Note first that for all j, 1 j d, in the iteration where ej is covered, it must be the case that |Sw k
i R|

wk
dj+1 . This is because the elements are numbered in the order they are covered, and thus none of the
elements ek , . . . , ed can be covered yet. Note that it is possible that |Sw k
i R|
wk
< dj+1 , since some of the
elements before ej may be covered at the same time that ej is covered (for example, consider what happens
when the first set chosen is Sk itself).
Let Si be the set chosen to cover the element ej . Since ej was not covered prior to Si being chosen, the set
Sk has not been chosen yet. The set Sk may be chosen at this iteration, or a set with a smaller cost per
newly covered element may be chosen. Thus, |SiwR| i
|SkwR|
k wk
. This imples that cej dj+1 . This gives us
that
d d
X X wk wk wk wk
cej = + + ...+ = H(d) wk .
j=1 j=1
dj+1 d d1 1

We are now ready to prove the theorem. Let C opt be the sets in the optimalPset cover (that minimizes the
total weight). Let wopt be the cost of this optimal solution, and so wopt = Si C opt wi . From Lemma 96,
1 P
we see that Si C opt , wi H(d ) eSi ce . This gives us that

X 1 X 1 X X 1 X 1 X
wopt
ce = ce ce = wi .
H(d ) H(d )
H(d )
H(d )
Si C opt eSi Si C opt eSi eU Si C

Here, the second step (equality) follows from a rearrangement of the terms of summation. The third step
() follows from the fact that the optimal solution must be a set cover, and thus each element of the set U
must appear in at least one set of C opt , and hence at least once in the double summation. The final step
(equality) follows from Claim 95.
We point out that we have not shown here that the greedy algorithm does not do better than an H(d )-
approximation. However, our very first attempt at an approximation algorithm was a greedy algorithm for
the vertex-cover problem. The greedy algorithm considered there was actually exactly the greedy algorithm
we considered here for the set-cover problem, for the special case of set-cover described above that corresponds
to the vertex cover problem. As a result, the lower bound of (ln n) we described for the approximation
ratio of that algorithm also applies to the greedy algorithm we consider here. (Recall that H(d ) ln d .)
Similarly, the analysis of the greedy set-cover algorithm we provide here, demonstrates that the greedy
algorithm for vertex-cover does not do much worse than it does for the example we considered there.
We point out that for the vertex-cover problem, we saw a rather simple way to improve the approimation
ratio from ln n to 2. However, this does not seem to be possible for the more general set-cover problem (even
without weights). As we shall see below, the performace of this rather simple greedy algorithm is basically
the best that we can hope for.

8.6 Polynomial Time Approximation Schemes and the knapsack


problem

Definition 97 A problem has a polynomial time approximation scheme (PTAS) if and only if  > 0 it has
a polynomial time (1 + )-approximation.

An example of this is the Eucledian TSP problem, where  > 0 there is a (1 + ) approximation with a
1
running time of nO(1+  ) . The running time for this PTAS is exponentially dependent on 1 . Thus, for large
values of  (which do not give a very good approximation) the running time is reasonable, but if we let  get
arbitrarily small, the running time for this PTAS increases considerably. A specific type of PTAS with better
running times than the Eucledian TSP PTAS is a fully polynomial time approximation scheme (FPTAS),
where the running time remains polynomial in 1 .
Definition 98 A problem has a fully polynomial time approximation scheme (FPTAS) if and only if  > 0
it has (1 + )-approximation, such that the runtime of the (1 + )-approximation is polynomial in 1 and also
polynomial in the size of the input.

An example of this is the FPTAS algorithm for the knapsack problem we shall examine next. This algorithm
3
has a running time of O( n ).

8.6.1 FPTAS for the Knapsack Problem

Many problems which are not strongly NP-Complete do have an FPTAS. An example of this is the knapsack
problem. Recall that the knapsack problem is defined as follows:

The knapsack problem:


Input: A set of items numbered as 1, 2, . . . , n;
a weight wi for each item i;
a value vi for each item i;
and a knapsack capacity, C. P
Output: A subset B of the items with the maximum total value such that iB wi C.

Previous Knapsack Algorithm

We have already seen a dynamic programming algorithm to solve the knapsack problem. It was described
for the simplified version of the Knapsack problem where the weight of each item is equal to the value of the
item, but the idea behind that algorithm also works for the more general version of the problem considered
here. To do so, define knap(i, w ) to be the maximum value obtained using items 1 to i and capacity at most
w. We see that
Knap(i + 1 , w ) = max(Knap(i, w ), Knap(i, w wi+1 ) + vi+1 ).

We can compute knap(i, w ) row by row in a table of size n C. We obtain the final solution by looking at
the entry knap(n, C ). The runtime of this algorithm is O(n C), and it will always give an exact solution.
(Notice that C can be as large as exponential in the input size.)
It turns out that this algorithm does not lead directly to FPTAS - we shall see shortly why. However, a
similar approach can be modified to give us an FPTAS.

New Knapsack Algorithm

We now look at an alternative algorithm which we then convert to an approximation scheme. In this
alternative algorithm, instead of taking knap(i, w ) to be the maximum value of a subset with items 1
through i with a given capacity we define vknap(i, v ) to be the minimum weight required to achieve a value
of at least v using items 1 to i. A way of computing these values row by row is by using the following formula:

vknap(i + 1 , v ) = min{vknap(i, v ), vknap(i, v vi+1 ) + wi+1 , }

where if v < vi+1 , vknap(i, v vi+1 ) = 0


1 . . .
2 . . .
.
.
.
n-1 . . .
n . . .
nV

Figure 8.9: vknap(i, v ).

To see that this is the case, note that to achieve value v with items 1 through i + 1, either we do so using
only items 1 through i, or we achieve value v vi+1 with items 1 through i, and the remaining value with
item i + 1. For the first row of the table, we use the following:

w1 for v v1
vknap(1 , v ) =
for v > v1

In the old version of the algorithm, the number of columns was the total capacity. In the new version we
define V = maxi (vi ). The maximum possible value we can achieve is n V . Thus, we only need to consider
values of v up to V , and so we now have nV columns. To find the optimal solution (the maximum value
that does not exceed the knapsack capacity), we examine the last row of the table, scanning from the left
until we find a value that exceeds the Knapsack capacity. The optimal solution is the value before it. In
other words we find the largest column, vmax , such that vknap(n, vmax ) C . (Notice that vknap(n, v ) is
non-decreasing as v increases.)
Running time of the algorithm. The size of the table is n2 V (n rows by n V columns). Every entry in
the table can be computed in constant time, and the scan over the last row takes time O(nV ). Thus, the
runtime of this algorithm is O(n2 V ).

Knapsack Approximation

The idea of the approximation algorithm for Knapsack is to reduce the precision with which we compute
the value, and thus reduce the number of columns. In particular, we set the k least significant bits to 0 in
every value that appears in the input. This does not affect the optimal value that can be achieved too much
because the lower order bits do not add up to very much compared to the higher order bits, but reducing
the number of bits reduces the amount of time required by the algorithm. By truncating the last k bits of all
the values we reduce the number of columns (and hence the amount of work done) in the table by a factor
2k .
The tradeoff here is that when k is small, our approximation is good ( is small), while when k is large, the
runtime is reduced. The solution we obtain will still be valid because we have not changed the weights: the
same solution is guaranteed to still fit into the knapsack. While the value achieved will not necessarily be
optimal, it will not be too far off from the optimal value.
Copt
Since this is a maximization problem, we need to find k such that Calg (1 + ), for a given .

Let vi0 = vi with the k lowest order bits set to zero.


Let Calg be the value of the solution returned by this algorithm.
0
Let Calg be the optimal value of the modified problem.
Let Copt be the optimal value of the original problem.
Let B 0 be the optimal subset of items in the modified problem.
Let B be the optimal subset of items in the original problem.
First, note that since the value of each item can only be decreased by truncating bits, it must be the case
0
that Calg Calg . Furthermore

X
0
Calg = vi0 .
iB 0

Since the set B 0 represents the maximum solution to the modified problem, the solution formed by the set
B must be less than or equivalent to it (in the modified problem). Therefore:
X X
0
Calg Calg = vi0 vi0
iB 0 iB

Since each vi0 has the k smallest bits set to 0, vi0 > vi 2k for all vi , and so
X X
vi0 > (vi 2k ) Copt n 2k
iB iB

Therefore:
Calg Copt n 2k
Copt Copt

Calg Copt n 2k
Copt n 2k + n 2k
=
Copt n 2k
n2k
= 1+
Copt n2k
n2k
1+
V n2k

n2k
Since we are looking for a (1 + )-approximation, we want a value of k such that (1 + V n2k ) will be no
larger than (1+).

n2k
Claim 99 If k log( V
2n ) and  1, then (1 + V n2k ) is at most (1 + ).

Proof:
Copt n2k
1+
Calg V n2k

As k log( V
2n )

Copt n V
2n
1+
Calg V n V
2n
V
2
1+ V
V 2
As  1,

V V
2 2

Therefore,
V
Copt 2
1+ V
Calg 2
1+

To recap, the FPTAS for the Knapsack problem is as follows: Given a problem instance and an approximation
ratio (1 + ):

1. Set the value of k, i.e., let k = blog( V


2n )c;

2. Truncate the last k bits from all the values (not the weights);
3. Solve the smaller problem (using the vknap version), and use the items in the resulting solution.

Running time of the algorithm: The running time is still the size of the table, with each table entry taking
constant time. In the algorithm, we shall actually remove the last k bits of the values, rather than simply
setting them to 0. Each value would then be shifted right k bits, and the new maximum value is 2Vk . We
still have n rows, but now we only have nV nV
2k columns, so the table has size n 2k . Substituting in our value
2 2 3
for k, the run time is O( n2kV ) = O( nVV ) = O( n ), which is polynomially dependent on 1/.
2n

We now briefly mention why we defined a modified version of the dynamic programming algorithm, instead
of using the original version. In the original version of the algorithm, the running time is proportional to
the number of different weights we need to consider, and thus, to design an FPTAS, we would need to round
the weights of the items. We can either round up or down. If we round down, then we may end up with a
subset of items whose (unrounded) weights exceed the knapsacks capacity. If we round up the weights we
get a valid solution but the effect of the rounding is that we might not include an item with a small weight,
but very large value. This means that it is not possible to bound the approximation ratio if the weights are
rounded up.

8.7 Hardness results for approximability

NP-Complete problems seem to fall into very distinct categories in terms of how well they can be approxi-
mated. In order of decreasing quality of approximation, these include, but are not limited to:

1. FPTAS - This is the best we can hope for, since the running time is polynomial in 1 . The example of
this we have seen is the knapsack problem.
2. PTAS - This is still very good, but it is not polynomial in 1 . The example of this we mentioned was
the Euclidean TSP.
3. Constant - This is worse, but still quite respectable. We have seen several examples of this: Max-
Cut, Vertex-Cover, and Metric TSP. We have seen a 2-approximations for the first two, and a 23 -
approximation for the third. For all three problems, it is known that there is a constant such that the
problem cannot be approximated better than that constant (unless P= NP).
4. log n - Here n is the size of the input. The example of this we have seen is the set-cover problem; again,
it is quite likely that this approximation algorithm is the best possible.

5. Worse then log n - A number of problems are such that the best known is considerably worse than
log n, and often as bad as n , for some constant . This type of approximation is not very useful.
However, for some problems, such as the clique problem, this is the best that we can do (unless P=
NP).

8.7.1 Problems without an FPTAS

We first demonstrate that P 6= N P , then there is not an FPTAS for every problem. In fact, relatively few
problem have an FPTAS. The following theorem demonstrates (roughly) that if there exists an FPTAS for
any strongly NP-Complete problem, then P = NP. Specifically:

Theorem 100 Consider a problem such that,

1. is strongly NP-Complete.

2. all the values in the input and the output are integers.

3. inputs I, Copt (I) is polynomial in |I| and the largest number in I, where Copt (I) is the cost of the
optimal solution for the input I.

If there is an FPTAS for then P= NP.

For example, a problem with these properties is the clique problem with integer weights (where the objective
is to find the largest total weight clique). This problem is strongly NP-Complete (since it is NP-Complete
even with unit weights). An upper bound on Copt is the product of the largest weight and the number of
vertices, which is polynomial in the size of the input graph and the largest number in the input. In fact, all
of the strongly NP-Complete problems we have seen are covered by this theorem.
Proof: We here assume a maximization problem; the case of a minimization problem is similar. Since is
strongly NP-Complete, to show that P=NP, we need to show that we can design an algorithm for that
runs in polynomial time for all inputs I where all integers in I are polynomial in the input size.
An algorithm for the problem on input I is:

1
1. Let  = U(I)+1 , where U(I) is the upper bound on the optimal solution on input I. Note that we can
assume that U (I) is polynomial in |I|.

2. Run FPTAS(, I), and use the solution returned by the approximation algorithm.

Let Calg (I) be the cost of the solution that the algorithm gives for input I. Let Copt (I) be the cost of the
optimal solution for the input I. Since (1 + )Calg Copt , we get Copt Calg Calg . As Calg < 1,
Copt Calg < 1. Thus, since all values are integers, the solution given by the algorithm is the optimal
solution. Since U (I) is polynomial in |I|, and we have an FPTAS, the running time is polynomial in the
input size. Thus if there is an FPTAS for a strongly NP-Complete problem of the type described in the
theorem statement, then we can solve the problem exactly.
8.7.2 Problems without a PTAS

Unfortunately, even problems with a PTAS do not seem to occur as frequently as we would like: there are
many problems where it is known that the best possible approximation algorithm (assuming P6= NP) is a
constant. In the last few years, there has been considerable work and progress in determining exactly what
is the best approximation ratio possible for a number of problem. Most of these results (as well as the even
stronger lower bounds described below) are based on Probabilistically Checkable Proofs.
In some ways, these techniques lead to a similar high level strategy as we saw in proving problems to be
NP-Complete. There, we showed that 3-SAT was NP-Complete, and then used that to show that many
other problems were also NP-Complete. Here, Probabilistically Checkable Proofs are used to show that a
small number of problems are NP-Complete, and then those problems are reduced to other problems, using
a type of reduction suitable for showing results on approximability.
To show lower bounds of a constant on approximability, we use as the starting problem a variant of 3-SAT.
While it does not make any sense to approximate 3-SAT (as it has a solution of the form Yes/No), the
variation of the 3-SAT that can be approximated is the Max-3-SAT. The Max-3-SAT problem is defined as:
Input : A boolean formula in 3-CNF form.
Output : The maximum number of clauses that can be simultaneously satisfied.
Note that if we use a random assignment to the variables, then every clause in a 3-CNF formula has a 87
probability of being satisfied. This gives us a 87 approximation for the Max-3-SAT: simply use a random
assignment of the variables. The expected number of clauses that will be satisfied by this algorithm is 87 of
all the clauses. Perhaps surprisingly, we can show thhat this is the best that we can do. Specifically, using
Probabilistically Checkable Proofs, we can show that  > 0, if P 6= NP, there is no ( 78 )-approximation
for the Max-3-SAT problem.
We next give a simply example of using this result to prove the hardness of approximation for other problems.
In particular, we prove the following result for the clique problem. We point out that there are actually
much stronger inapproximability results known for the this problem, as are described below.

8
Claim 101 If P 6= NP, the clique problem cannot be approximated better than 7.

Proof: To show the clique problem was NP-Complete, we reduced the 3-SAT problem to the Clique problem.
We showed that all the m clauses of a boolean formula are satisfied if and only a clique of size m exists in
the corresponding graph. However, from the same reduction, it also follows that k, k clauses of a Boolean
formula can be simultaneously satisfied if and only if there is a clique of size k in the corresponding graph.
Thus, if we can approximate the clique problem better than 78 we can also approximate Max-3-SAT better
than 78 . So, if P 6= NP, clique cannot be approximated better than 87 .

8.7.3 Problems with no constant factor approximation

For the set-cover problem, we saw a ln |U |-approximation, but we did not see a constant factor approximation.
In fact, the simple greedy algorithm we saw is likely to be the best possible for this problem. In particular,
we know that if P6= NP, then there is no polynomial time c ln |U |-approximation algorithm for this problem,
for some c > 0. Furthermore, we know that if NP6 DT IM E(nlog log n ), then there is no polynomial time
(1 ) ln |U |-approximation to the set-cover problem, for any  > 0. DT IM E(nlog log n ) is the class of
problems that can be solved in deterministic time O(nlog log n ); it is almost as unlikely that NPis contained
in this class as that it is contained in P, and thus it is quite likely that the greedy approximation is optimal.
Red
Red

Blue
Green

Figure 8.10: A graph whose chromatic number is 3

The set-cover problem serves the same purpose for problems where (log n) is the best approximation ratio
possible as max-3-sat did for problems where a constant approximation ratio was the best possible. In
particular, the set-cover problem can be used to show lower bounds on approximability for a number of
other problems.

8.7.4 Problems with no log n-approximation

Examples in this class include Clique, Independent Set, and the Graph-Coloring problem. Problems in this
class have no good approximations (unless P=NP).

Clique
1
Claim 102 If P6=NP, then there is no |V | 2  -approximation algorithm for clique, for any  >0.

Claim 103 If NP6=ZPP, then there is no |V |1 -approximation algorithm for clique, for any  >0.

Definition 104 ZPP is the class of problems that can be solved by Las Vegas randomized algorithms
in polynomial time.

The best known approximation for Clique is a O( log|V2 |V


|
|
)-approximation.

Max Independent Set


In the independent-set problem, we want to find the largest set of vertices without any edges between
any pair of vertices in the set. This problem is really just a disguised version of the clique problem: to
convert from one to the other, just use G, the complement of the graph G. Thus, this is just as hard
to approximate as clique.
Graph-Coloring
The graph-coloring problem is also in this class. For this problem, the input is an undirected graph
G = (V, E) and we wish to find the chromatic number of G, which is the minimum number of colors
required to color G. A coloring is a mapping f :

f : V colors

such that if v1 is adjacent to v2 then f (v1 ) 6= f (v2 ). That is, adjacent vertices must have different
colors. The graph in Figure 8.10 is an example of a graph whose chromatic number is 3.
Graph-coloring is clearly related to clique since the chromatic number of a graph is at least as large as
the maximum clique.
1
Claim 105 If P 6= NP, then there is no |V |( 7 ) -approximation algorithm for Graph-Coloring, for
any  > 0.
Claim 106 If NP 6= ZPP, then there is no |V |(1) -approximation for any  > 0.

An interesting special case of this problem is planar graphs. By the Four Color Theorem, such graphs
are always 4-colorable. We can determined 1-colorability and 2-colorability in polynomial time. How-
ever, determining whether the chromatic number of a planar graph is 3 or 4 is NP-Complete.

8.7.5 Useful resources

The website http://www.nada.kth.se/viggo/problemlist/compendium.html maintains a very useful


list of many important problems and specifically what is currently known about how well these problems
can be approximated. It includes information on general problems and special cases.
Chapter 9

Linear Programming

Linear Programming is a widely-used and very powerful technique. For example, some of the problems we
have developed algorithms for in this course (such as the maximum flow problem) can be solved using Linear
Programming.

9.1 Intoduction

We will start with a specific example of Linear Programming and then use this to motivate the general form
of the problem. This example is the Diet Problem: design a diet such that a daily nutritional requirement
will be satisfied with minimum cost.
input: an integer n, the number of different foods;
an integer m, the number of required nutrients;
an m n matrix whose entries are aij , the amount of nutrient i in a single unit of food j;
an m-vector whose entries are bi , the minimum daily requirement of nutrient i; and
an n-vector whose entries are cj , the cost of a unit of food j.
Pn
goal: Find the vector x = {x1 , x2 , .., xn } <n that minimizes j=1 cj xj , where xj is the quantity
of food j consumed per day. Note that this is equivalent to minimizing the cost spent per
day on food. Pn
requirements: For each i {1, ..., m}, j=1 aij xj bi , (which means that the amount of nutrient i
consumed in a day is at least the minimum required) and
for each j {1, ..., n}, xj 0 (which means that the intake of each food items is non-negative).
We see that this formulation expresses our goal of spending as little money as possible, but still meeting the
minimum daily nutritional requirements.

9.1.1 General Linear Programming problems

General Linear programming problems have the following form:


Let c, x, and ai be vectors. Minimize c x, subject to the following constraints:

ai x bi , or ai x = bi , or ai x bi and

137
3

1 2

Figure 9.1: A simple graph

xj 0, or xj unconstrained, or xj 0

An important requirement here is that each component of the vector x can be any real number. Note that
if x can only take integer values, then the problem becomes NP-Complete. On the other hand, with real
values, we can solve the problem in polynomial time.

9.1.2 Another example: Maximum flow

The familiar Max-Flow problem can be solved by stating it as a linear programming problem:
Input: A directed graph G = (V, E),
a matrix Cu,v holding the edge capacities(0 if no edge from u to v), and
source node s and sink node t.
We convert this to the following Linear Programming problem:
Variables: fuv : flow from u to v for every pair of vertices (u, v). We shall use f to denote
the vector of all fuv s.
F: value of the flow.
Objective function: Maximize variable F.
Subject to constraints: capacity: f C
skew-symmetry: fuv + fvu = 0, and
conservation of flow: B f + F d = 0.
We next explain the conservation of flow constraints. The rows of the matrix B represent vertices of G
and the columns represent ordered pairs of vertices (u, v) such that u 6= v. For every column (u, v) of B, if
(u, v) E then row u is a +1, row v is a -1, and all other rows are 0. If (u, v) 6 E then that column contains
only zeroes.
For example, for the graph depicted in Figure 9.1, the matrix B would be (all skipped columns are 0s):
B (1, 2) (1, 3) (1, 4) (2, 1) . . . (3, 2) . . . (4, 2) ...
1 +1 +1 +1 0 ... 0 ... 0 ...
2 1 0 0 0 . . . 1 . . . 1 ...
3 0 1 0 0 . . . +1 ... 0 ...
4 0 0 1 0 ... 0 . . . +1 ...
and d is a |V |-vector which has +1 at vertex t, 1 at vertex s and 0 everywhere else.
The columns of the matrix B and the entries of the vector f are ordered in such a way that the column
of B corresponding to the pair (u, v) lines up with the entry of f corresponding to the flow from u to v.
As a result, the constraint corresponding to each row i of the matrix B represents the conservation of flow
constraint for vertex i. This is because the +1 entries cause the total flow out of a node to be a positive
value, and the -1 entries cause the total flow into a node to be a negative value. For their sum to be 0 these
quantities have to exactly cancel.
9.1.3 Standard Form of a Linear Program

In order to solve linear programs, we consider a specific form for expressing the problem.

Definition 107 The Standard Form for Linear Programming is as follows:


Minimize: cx
Subject to: Ax = b, and x 0,
where: x is a n-vector, c is a n-vector, b is a m-vector, and A is a m n matrix.. We assume that the rows of
A are linearly independent (since otherwise we can remove some constraints). Also, we assume that m < n
(if m = n we can simply invert matrix A), and that a feasible solution exits.
Every Linear Programming problem can be converted to this form. To do so, we use the following conversion
process:

If xj 0 is one of the constraints then replace it with xj 0, and multiply all appearances of xj by
-1. In other words, multiply the jth column of A by 1 and the jth entry of c by 1.

If xj is unconstrained then replace all occurrences of it with x0j x00j and add constraints x0j 0 and
x00j 0, where x0j and x00j are new variables.

If ai x bi is one of the constraints then replace it with ai x + y = bi and y 0, where y is a new


variable.

If ai x bi is one of the constraints then use the same method: replace it with ai x y = bi and y 0.

9.2 Feasible Solutions

Definition 108 The set of feasible solutions of an LP problem is the set F = {x <k such that A x = b
and xi 0, 1 i n}.

In order to solve the LP problem, we have to find the elements of the set F that minimize the optimization
function Cx; for the time being, we will ignore optimization and simply consider F . The following definitions
are depcited in Figures 9.2, 9.3, and 9.4.

Definition 109 A half-space in <k is a set of the form


k
X
{x <k : aij xj bi } (9.1)
j=1

Definition 110 A polyhedron in <k is the intersection of half-spaces.

Definition 111 A polytope is a bounded polyhedron.


Figure 9.2: A half-space in <2

Figure 9.3: A polyhedron in <2

Figure 9.4: A polytope in <2


9.2.1 F as a <nm polyhedron

Claim 112 Any linear program in standard form with n variables and m equations has a feasible set that
can be described as a polyhedron in <nm .

Although the problem is given in standard form, to demonstrate this claim, we consider linear programs
given in canonical form:
Given an m n0 matrix A, an m-vector b, and an n0 -vector c, find an n0 -vector x such that c x is minimized
and

Ax b (9.2)
x 0 (9.3)

Note that Ax b defines m half-spaces., and so the set of feasible solutions to a canonical form LP problem
0
consists of the intersection of m half-spaces in <n . Thus, to prove the claim we only need to show that
any LP problem given in standard form with n variables and m equations is equivalent to a LP problem in
canonical form with n m variables and m equations.
Proof: Consider an LP problem in standard form. Assume that the m rows of A are linearly independent
otherwise there are some redundant constraints and we can reduce the number of rows of A. There must
therefore be m linearly independent columns of A; assume that they are the first m; we can ensure this
through reordering of the variables. Then we can write A as



A= B
(9.4)

where B is an invertible m m matrix.


After left-multiplying both sides of A x = b by B1 , we have B1 A x = B1 b or A0 x = b0 , and the
first m columns of A0 form the identity matrix Im :
0
x1 b1


.. = ..

Im (9.5)
. .


xn b0m
Looking at a single row i in this equation we have
n
X
xi + a0ij xj = b0i (9.6)
j=m+1

for all i in 1 . . . m. In order to convert the LP in standard form to canonical form, the = sign has to be
converted into a . We have assumed that xi 0 for 1 1 . . . n; so for all i we can drop xi from the
above equation resulting in
X n
aij xj b0i (9.7)
j=m+1

for all i in 1 . . . m which is a linear programming problem in canonical form with the n m variables
xm+1 . . . xn and m equations. Thus, the feasible set can be represented as the intersection of half-spaces in
<nm . This completes the proof.
9.2.2 An Example

Consider the set of constraints



1 1 1 1 0 0 0
1 0 0 0 1 0 0
A =
0
(9.8)
0 1 0 0 1 0
0 3 1 0 0 0 1

4
2
b =
3

(9.9)
6

giving a set of equations

x1 +x2 +x3 +x4 = 4


x1 +x5 = 2
(9.10)
x3 +x6 = 3
3x2 +x3 +x7 = 6

It is easy to convert these constraints into canonical form; since the 4 rightmost columns of A already form
an identity matrix, we only need to remove x4 from the first equation, x5 from the second, and so on, to get
a set of equations in 3 variables:
x1 +x2 +x3 4
x1 2
(9.11)
x3 3
3x2 +x3 6

The solutions to these equations form a polyhedron in <3 . Note that since n = 7 and m = 4 this is <nm .
Each constraint defines a half-space in <nm . Figure 9.5 shows the polyhedron defined by the intersection
of these half spaces, which in this case is a polytope. The feasible set F is the set of points in the polytope.
Consider any point of this polytope. An important aspect of this representation to keep in mind is that
any such point corresponds to a specific value for all n of the original variables. In this example, it is clear
that the point defines x1 , x2 and x3 . Furthermore, once those three variable are defined, the remaining four
variables are also defined. For example, x4 = 4 x1 x2 x3 .
Thus, we have two different ways to view the linear programming constraints: algebraically, i.e., in terms
of the equations as represented by (9.10) and (9.11), or geometrically, i.e., in terms of the polyhedron as
represented by Figure 9.5. These two interpretations are equivalent; understanding both as well as the
correspondence between the two will be central to our understanding of linear programming.
To further develop this relationship, note that in the polyhedron, there is a plane corresponding to each
of the n variables being zero, so n planes form the boundary of the polytope. Clearly the origin planes
represent one of x1 . . . x3 being zero. If we look at the original set of equations, we can see that the other
planes represent the values of x1 . . . x3 required to satisfy a row of A x = b when one of the x4 . . . x7 is set
to 0. So the plane x3 = 3 represents the solutions to the equation x3 + x6 = 3 when x6 = 0, 3x2 + x3 = 6
represents the solutions to the equation 3x2 + x3 + x7 = 6 when x7 = 0, and so on.
From this, we see that a vertex in a feasible polyhedron can be described both geometrically and algebraically
as follows:
Geometrical description: A vertex is an intersection of at least (n m) hyperplanes. A vertex is called
degenerate if it is an intersection of more than (n m) hyperplanes. For example, in three-dimensional space
x3
x =3
3
(x 6 = 0)

(1,0,3)
(0,0,3) x1 + x 2+ x 3 = 4
(x 4 =0)

(0,1,3)
(2,0,2)

3x 2 + x 3 = 6 x1 =2 (x 5 = 0)
(x =0)
7
(0,0,0)

(2,0,0) x1

(0,2,0) (2,2,0)

x2

Figure 9.5: Example Polyhedron


Figure 9.6: Non-Degenerate vertex

Figure 9.7: Degenerate vertex

(n m = 3), a pyramid shaped polyhedron has a degenerate vertex in its apex, which is an intersection
of 4 faces of the pyramid (see Figures 9.6 and 9.7). In what follows we will assume that all vertices are
non-degenerate for simplicity.
Algebraic description: Since a hyperplane corresponds to a variable being set to 0, a vertex corresponds to
a feasible solution in which at least (n m) variables are set to zero. A non-degenerate vertex has exactly
n m variables set to zero.
Now that we have some understanding of the feasible set F , we return to our original optimization problem.

9.3 Finding the Optimal Value

Lemma 113 If F is a polytope, c x is minimized at a vertex of the polytope; if F is not a polytope, either
c x is minimized at a vertex, or c x can be made arbitrarily small.

When c x can be made arbitrarily small, the problem is called an unbounded problem.
Instead of a formal proof, we here only give an intuitive argument as to why this is the case. Lets consider
an LP problem in two-dimensions whose objective function is c x. Any equality of the form c x = d, where
d is a constant, is a line in 2D space, and varying d produces parallel lines. As we decrease the value of d the
line moves in a preferred direction as shown in Figure 9.8. As we move the line in the direction such that
c x is minimized, the last point or set of points in the polyhedron that we see is either a vertex or an edge
that is parallel to the line c x = d. If there is such a parallel edge (as in Figure 9.9), the set of points still
contains at least one vertex. Therefore, the minimum is still achieved at a vertex.
We can imagine an analogous situation in three dimensions, in which we move a plane through three di-
direction of
minimization of
cx c1 x 1 + c 2 x 2 = d

c1 x 1 + c 2 x 2= d

Figure 9.8: Optimization on a 2D Feasible Set

c1x1 + c2x2= d

Figure 9.9: Optimization of a 2D Feasible Set - Special Case


direction in which polygon
is unbounded

Figure 9.10: Optimization of a non-Polytope Feasible Set

mensional space; we might have a line or a plane parallel to the moving plane, but we will have at least one
vertex in the set of points which minimizes c x. The same is true for higher dimensions as well.
If F is not a polytope (as in Figure 9.10) then, in some cases, it is possible to move the line indefinitely in
the direction of optimization, if the polyhedron is unbounded in this direction. If the polyhedron is bounded
in the direction of optimization, then c x is minimized at a vertex. Therefore, for the LP problem to be
unbounded it is necessary, but not sufficient, for F to be unbounded.

9.3.1 A simple algorithm

Lemma 113 suggests the following simple algorithm for finding the optimal solutions to an LP problem:

1. For each vertex x of F


2. Compute C x
3. Return minimum value found.

This algorithm will give us a correct answer but has exponential running time. The number of vertices to
be considered can be exponential in the degree n m. To illustrate this, lets consider the example of a
d-dimensional cube (Figure 9.11). For d = 1, we have a single face (in this case, an edge) and two vertices.
For d = 2, we have a square with 4 faces and 4 vertices. In a three-dimensional cube, we have 6 faces and
8 vertices. Thus, we notice that on the addition of each dimension, the number of vertices increases by a
factor of 2, and therefore is exponential in d.

9.3.2 Simplex algorithm


1. Start at some vertex x
d=1 d=2
d=3

Figure 9.11: d-dimensional hypercubes

2. Repeat:

3. Find a neighboring vertex y (i.e., a vertex reachable from x via an edge of the polyhedron) such
that C y < C x

4. x=y

5. until no such y is found

6. Return x (or unbounded)

This algorithm returns the correct answer because polyhedron F is convex (since it is the intersection of
half-spaces). This means that if a vertex is locally optimal, it is globally optimal as well. Furthermore, it
is guarenteed to examine each vertex at most once, since the cost function is strictly increasing during the
course of the algorithm. Its running time will be dependent on the number of vertices that are visited.
In practice, this algorithm works very well (often as fast as linear time) for many pivot rules (rules according
to which the vertex y is chosen when there is more than one vertex matching the condition in step 3).
However, for each of the standard pivot rules, it is possible to construct a pathological example for which
this algorithm runs in exponential time.

Representing Vertices

In order to actually implement the Simplex algorithm, we take a closer look at what vertices really represent
in terms of the algebraic interpretation of the linear programming problem. Recall that a vertex is the
intersection of at least mn hyperplanes. For simplicity, we here assume that our vertices are the intersection
of exactly n m hyperplanes and will not deal with the degenerate case of an intersection of more than n m
hyperplanes. Recall that a hyperplane is a set of points in the feasible region where one of the variables
of x has been set to zero. Thus, a vertex has the property that n m variables of x are set to zero, or
equivalently, that there are m non-zero variables, which we refer to as xB(1) , xB(2) , . . . , xB(m) .
Note that if we have some subset of our variables in x set to zero then we only need concern ourselves with
the columns of A corresponding to the non-zero variables: AB(1) , AB(2) , . . . , AB(m)
 . Also note that there are
n n
m ways to pick m of our n variables to be non-zero, but not all of the m possibilities will correspond
to actual vertices of our polyhedron. Stating the same thing in terms of the geometric interpretation of the
problem, it is not the case that n m hyperplanes always define a vertex of the polyhedron. There are two
reasons why this might occur:
l2

l1

l3

Figure 9.12: Three planes in <3 that do not intersect at a point.


x1 = 0
line 1

line 2

x3=0

line 3

Figure 9.13: Intersection of hyperplanes not a vertex

1. The hyperplanes either intersect in an infinite number of points or never intersect at all (e.g. parallel
lines in <2 ).
Example. In three dimensions, consider the projection of three planes in <2 depicted in Figure
9.12. Each point of intersection corresponds to a shared line: l1 , l2 , and l3 . In <3 , these lines are
perpendicular to this piece of paper.
The algebraic interpretation of this is that the columns of A corresponding to the non-zero variables
are not linearly independent.

Definition 114 A basic solution is a set of m non-zero variables such that the corresponding columns
are linearly independent.

2. The corresponding hyperplanes may intersect, but the intersection may occur outside of the feasible
region. In this case we call the intersection an infeasible solution.
Example. Figure 9.13 shows a polyhedron in <2 . If x1 and x3 are set to zero, we get the point that
is the intersection of lines 1 and 3. However, this point is outside the feasible set.
The algebraic interpretation of this is that the point where the hyperplanes intersect (i.e., when we
solve for x), some of the coordinates of x are negative.
v5
x1 = 0
x5 = 0 v1

v4 x2 = 0

x4 = 0 v2
v3 x3 = 0

Figure 9.14: walking from vertex to vertex

Definition 115 A basic feasible solution (BFS) is a basic solution that lies in the feasible region.

A basic feasible solution corresponds to a vertex in our polyhedron. Furthermore, we see the following:

Claim 116 A BFS is uniquely specified by a set of m non-zero variables.

Proof: Let B = [AB(1) AB(2) . . . AB(m) ]. B is called the basis B for the BFS. Let x = {xB(1) , xB(2) , . . . xB(m) }.
Since all the xi not in x are zero, we have: A x = B x = b. Since we have a BFS, the columns of B are
linearly independent. Thus B must have an inverse, and we can solve for x :

x = B1 b (9.12)

Thus, from Equation 9.12, we can determine the values of the m non-zero variables. Since the remaining
variables are all set to zero, the values for all n variables for a BFS can be obtained from the set of non-zero
variables.
n

This gives us another interpretation of linear programming. Out of all possible m choices of m non-
zero variables, we restrict ourselves to those where (a) the corresponding set of columns of A are linearly
independent, and (b) when we use these columns to solve for x, the values of x that we obtain are all
non-negative. Out of these sets of choices, we want to find the choice that minimizes the objective function.
To start the Simplex algorithm, we must find a set of m variables that gives us a BFS. Finding linearly
independent columns is not difficult: we can do this in a greedy fashion. However, ensuring that the solution
n
is feasible (i.e., that the values of x are non-negative) is harder, and we dont want to look at all m
possibilities, since this would be exponentially many. However, once we have a starting vertex, it is much
easier to move from that vertex to find a neighboring vertex. We shall discuss that problem first, and then
return to the problem of finding an initial vertex.

Finding a Neighboring Vertex

Assuming we can find a BFS we want a method for moving from that vertex to a neighboring vertex in a
high-dimensional space.
Example. Consider the polyhedron in <2 depicted in Figure 9.13, where n = 7 and m = 5. We wish to
move from v4 to v5 . v4 corresponds to the constraints: x4 = 0 and x5 = 0. The idea is to relax the constraint
x4 = 0. To do so, we gradually increase x4 until some other variable (in this case x1 ) gets set to 0. This
corresponds to starting at v4 and walking along the edge (v4 , v5 ) of our polyhedron until we reach v5 .
We now generalize to a higher-dimensional space. We start with a set of m non-zero variables. We want to
exchange a new variable that was initially set to zero with one of our non-zero variables. As in the example
above, we increase our new variable until one of our original non-zero variables becomes zero. Because we
require that all of our xi are greater than zero, we can be sure that by increasing a variable we will move in
the correct direction.
Let xB(1) , xB(2) , . . . , xB(m) be our nonzero variables and let AB(1) , AB(2) , . . . , AB(m) be their corresponding
columns. Let xj be the new variable we want to exchange and let Aj be its corresponding column. Then
given our constraint Ax = b we know

m
X
xB(i) AB(i) = b. (9.13)
i=1

Since the AB(i) s are m linearly independent columns with m entries each, they form a basis for <m . Thus
we can express Aj as a linear combination of these columns:

m
X
Aj = tij AB(i) , tij <. (9.14)
i=1

Adding and subtracting Aj from (9.13) we get:

m
X m
X
xB(i) AB(i) + Aj tij AB(i) = b. (9.15)
i=1 i=1

Rearranging terms in (9.15) we get:

m
X
Aj + (xB(i) tij )AB(i) = b (9.16)
i=1

Increasing the parameter corresponds to moving along an edge of the polyhedron. As we increase , we add
in more of the variable xj , and as a result we also change the other variables to ensure that the constraint
Ax = b is still satisfied. As we increase , if there is some tij that is positive, then for some i, xB(i) tij
will eventually be 0. This corresponds to reaching another vertex along the edge of travel. Define

 
xB(i)
0 = min . (9.17)
{i | tij >0} tij

0 will be the first value of at which xB(i) tij becomes zero. Note that if there is a vertex in the direction
of movement, then we will have {i | tij > 0} nonempty, and if all tij < 0, then the problem is unbounded in
that direction.
Let l be the value of i that achieves the minimum in (9.17). Then xl is the variable that gets set to zero first.
In our new solution we will replace xl with xj , which corresponds to setting B(l) = j. We can summarize
this process with the following equation:

0 , if i = l
x0B(i) = (9.18)
xB(i) 0 tij , 6 l
if i =

Finding an initial vertex

Here we go back to the problem of finding the initial vertex so that we can proceed with the main iteration.
We want to find some point x that satisfies Ax = b which has m non-zero variables. For this step, we do not
care about minimizing the optimization criteria. Consider the following approach. First, recall the original
problem in standard form:

Objective Function: min c x


Subject to: Ax = b
x0

Now, lets consider an alternative linear programming problem defined as follows:

Objective Function: min y1 + y2 + . . . + ym


Subject to: Ax + y = b
y 0 and, x 0

where y is a vector of length m. In this new problem, finding an initial BFS is trivial: x = 0, y = b. Thus,
we can easily start the Simplex algorithm for this new problem, and find an optimal solution to it.
The important thing to note here is thatP if there is an optimal solution to the alternative problem that
m
achieves cost 0, it must be the case that i=1 yi = 0. Since that point must have m non-zero variables,
these m non-zero variables must be variables of x. Thus, the point that serves as the optimal solution to
the new problem provides a BFS solution to the original problem. On the other hand, if there is no solution
for the alternative problem that has cost 0, then there is no BFS for the original problem.

9.3.3 Running Time Analysis

Despite intense study, the running time of the Simplex algorithm has not been completely understood. We
do know that a move between two vertices can be performed in O(mn) time (linear in the input size). This
is accomplished with the use of a data structure called a tableau[K91]. Thus, the total running time of the
algorithm is O(mn) times the number of traversed vertices.
The number of traversed vertices depends on the pivot rule. The pivot rule is the method used to decide
which edge to traverse, given the choice between multiple possible edges. There are various alternatives for
pivot rules that one can consider. Here are some of them:

We could choose the edge that gives us the steepest gradient in the cost function.
We could choose the edge that gives us the largest total change.
We could choose randomly from possible candidates.

It turns out that these alternatives are not the best thing to do in some special cases. Klee and Minty [KM72]
showed that in the case of a perturbed hypercube (also known as the Klee-Minty polytope), Simplex with
the first pivot rule visits all vertices even though the cost function is monotonically decreasing throughout
the course of the Simplex algorithm. For most natural pivot rules, there are example inputs that are known
to require exponential time.
An open question remains: Is there an efficiently computable pivot rule such that Simplex is guaranteed to
run in polynomial time?
Despite all the debate about the running time, in practice, the Simplex algorithm is very efficient and is
widely used in many mathematical or data analysis packages. Very recent work on trying to explain the
discrepancy between the worst case running time and why Simplex is so efficient in practice can be found in
[ST2001].
We have not shown that linear programming is in P , but this is the case. The first polynomial time algorithm
was introduced by Khachian [K79], which uses the ellipsoid method. Despite its theoretical importance,
this algorithm has a high degree polynomial running time, and is generally considered impractical. A better
alternative was later introduced by Karmarkar [K84], which uses the interior-point method. For further
information on Simplex and these related algorithms, please see: [PS82],[S88], or [K91].

9.4 Duality

We next cover the concept of duality. The dual of a problem is a very closely related problem such that the
solutions to the two problems are tightly coupled. An example of this we saw earlier in the course is the
Max-Flow Min-Cut Theorem, where we saw that the maximum flow in a graph between a pair of vertices
is equal to the minimum cut between those two vertices. This is an important concept in the design of
algorithms, and has a number of uses. In particular, it can be used for:

1. Designing polynomial time algorithms for a problem.

2. Proving the optimality of a solution (as we did for the Max-Flow problem).

3. Finding the solution to a linear program efficiently.

Recall that any linear program with m constraints over n variables can be expressed in canonical form as
follows:

Minimize C x
such that A x b (9.19)
x0

This is known as the original, or primal, representation of the problem. For each such LP, there is also a
related dual linear program, which has the following form:

Maximize yb
such that yAC (9.20)
y0

Observe that this is a just a transformation of the original problem 9.19, with the following modifications:

1. We maximize instead of minimize.


2. We exchange the roles of C and b
3. The constraints are instead of .
4. The vector y that is being solved for has m components and n constraints, instead of the vector x that
is used for the original problem, which has n components and m constraints.
5. yA is computed in place of Ax.

It turns out that the dual of the dual is the original primal. We leave it as an exercise to show this.
We show shortly that the optimal answer (if one exists) for both problems is the same. This characteristic is
exploited by primal-dual algorithms, which run both the primal and dual versions of a problem simultaneously
and uses the solution from the first of these two to complete. This can avoid some instances where the input
would be slow to converge on the optimal solution in one of the two problems.

9.4.1 Vitamin Sellers Problem

Recall the original diet problem from lecture 23:

xi is the amount of food i to consume.


Cx is the cost of the diet, which should be minimized.
Ax b represents the nutritional constraints of the diet.
x0 specifies that the diet must consist of non-negative food quantities.

The objective is to find a diet satisfying the specified nutritional goal while minimizing the cost incurred by
that diet. This problem has a dual which we will call the Vitamin Sellers Problem. This dual describes
the nutrient problem from the perspective of a seller of vitamins who wishes to maximizing the cost of the
vitamins being sold, while staying competitive with the cost of buying the nutrients in food form. This new
problem is defined as follows:

yi is the selling price of nutrient i.


yb is the total cost of the daily required nutrients.
yA C represents the cost constraints.
y0 specifies the cost must be non-negative.

The objective function is the cost of the required nutrients, and we wish to maximize this cost. The cost
constraints represent the fact that the price of each vitamin must be set to be competitive with the cost of
real food. In particular, the cost of selling the vitamins in a given food item must be no more expensive
than actually buying that food item.

9.4.2 Weak and Strong Duality

Note that in the optimal solution to the vitamin sellers problem, the cost of buying those vitamins can be no
more expensive than satisfying the equivalent vitamin requirements by consuming real food. This is actually
part of a general property of duality, expressed in the following theorem:

Theorem 117 Weak Duality. If x0 and y0 are feasible solutions for the primal and dual problems, then
Cx0 y0 b.
Proof: First, we can use the fact that y0 0,

Ax b
y0 Ax0 y0 b

Similarly, using x0 0,

yA C
y0 Ax0 Cx0

And therefore,

Cx0 y0 Ax0 y0 b

Furthermore, it is possible to make a stronger claim about duality.

Theorem 118 Strong Duality. Exactly one of the following is true:

1. The primal and dual have finite optima which are equal

2. The primal is unbounded, and the dual is infeasible

3. The primal is infeasible, and the dual is unbounded

The proof of the Strong Duality theorem will not be covered in this course due to lack of time, but may
be found in [PS82]. Note that this theorem supports the primal-dual algorithm strategy described in the
previous section, because a solution from either the primal or dual can be used to infer the solution to the
other. Next, we will present the dual of a familiar problem, Max-Flow.

9.4.3 Max-Flow Linear Programming Dual

Consider the following linear programming problem for the network flow problem:
Variables: ui for each vertex
wij for each
P edge
Objective function: minimize ij cij wij where cij are the edge capacities of the input graph
Subject to constraints: i, j wij uj ui
ut us 1
wij 0
Exercise: show that this is the dual of the linear program described as equivalent to the Max-Flow problem.
Based on our previous experience with this problem, our intuition is that the dual of the Max-Flow problem
should correspond to the Min-Cut problem. This is in fact the case.

Claim 119 The optimal solution of the dual of the Max-Flow problem corresponds to the Min-Cut problem.

Although not shown here, the basic idea is that the optimal assignment of our variables will correspond to
a minimum cut. In particular, there is an optimal solution of the following form:
ut = 1
us = 0
i, ui {0, 1}

Therefore, an S T cut can be defined as,

S = {ui |ui = 0}
T = {ui |ui = 1}

wij = max(0, uj ui ), and so wij will be 1 for


Since all of the capacities cij are non-negative, we know that P
edges which cross the cut and 0 for those that do not. Thus ij cij wij is the total capacity across this cut.
Minimizing this value corresponds to finding the minimum cut.

You might also like