You are on page 1of 36

BLOCK INTRODUCTION

Searching and sorting techniques are two centreal issues in data structure. The block is organized around
these topics. The first unit takes up two searching techniques: Sequential search and Binary search. The last
two units describe internal and external sorting techniques. External sorting method are employed to sort
records of files which is too large to fit in the; main memory of the computer . These method involve as
much external processing by CPU . External sorting method dependent to a large extent on system
consideration like the type of device and the number of such device that can be used at a time.

UNIT 1 SEARCHING TECHNIQUES


Structure
1.0 Introduction
1.1 Objectives
1.2 Sequential Search
1.3 Binary Search
1.4 Summary
1.5 Exercises
Suggested Readings

1.0 INTRODUCTION
Information retrieval in the required format is the central activity in all computer applications. This
involves searching, sorting and merging. This block deals with all three, but in this block we will be
concerned with searching techniques.

Searching methods are designed to take advantage of the file organisation and optimize the search for a
particular record or to establish its absence. The file organisation and searching method chosen can make a
substantial difference to an application's performance.

We will now discuss two searching methods and analyze their performance. These two methods are:

- The sequential search


The binary search

1.1 OBJECTIVES
After going through this unit you will be able to:

list searching methods

discuss algorithms of Binary Search

analyse the performance of searching methods

1.2 SEQUENTIAL SEARCH


This is the most natural searching method. Simply put it means to go through a list or a file till the required
record is found. It makes no demands on the ordering of records. The algorithm for a sequential search
procedure is now presented.

ALGORITHM: SEQUENTIAL SEARCH

This represents the algorithm to search a list of values of to find the required one.

INPUT: List of size N. Target value T

OUTPUT: Position of T in the list - I

BEGIN
Set FOUND to false

Set I to O

While (I < = N) and (FOUND is false)

If List [I] =T
FOUND = true
Else
I=I+1

If FOUND is false.

T is not present in List.

END

This algorithm can easily be extended for searching for a record with a matching key value.

Analysis of Sequential Search

Whether the sequential search is carried out on lists implemented as arrays or linked lists or on files, the
criterial part in performance is the comparison loop step 2. Obviously the fewer the number of
comparisons, the sooner the algorithm will terminate.

The fewest possible comparisons = 1. When the required item is the first item in the list. The maximum
comparisons = N when the required item is the last item in the list. Thus if the required item is in position I
in the list, I comparisons are required.

Hence the average number of comparisons done by sequential search is

1 + 2 + 3..... + I +... + N
N
N ( N +1)
=
2* N
= ( N +1) / 2

Sequential search is easy to write and efficient for short lists. It does not require sorted data. However it is
disastrous for long lists. There is no way of quickly establishing that the required item is not in the list or of
finding all occurrences of a required item at one place.
We can overcome these deficiencies with the next searching method namely the Binary search.

1.3 BINARY SEARCH


The drawbacks of sequential search can be eliminated if it becomes possible to eliminate large portions of
the list from consideration in subsequent iterations. The binary search method just that, it halves the size of
the list to search in each iterations.

Binary search can be explained simply by the analogy of searching for a page in a book. Suppose you were
searching for page 90 in book of 150 pages. You would first open it at random towards the later half of the
book. If the page is less than 90, you would open at a page to the right, it it is greater than 90 you would
open at a page to the left, repeating the process till page 90 was found. As you can see, by the first
instinctive search, you dramatically reduced the number of pages to search.

Binary search requires sorted data to operate on since the data may not be contiguous like the pages of a
book. We cannot guess which quarter of the data the required item may be in. So we divide the list in the
centre each time.
We will first illustrate binary search with an example before going on to formulate the algorithm and
analysing it.

Example: Use the binary search method to find 'Scorpio' in the following list of 11 zodiac signs.

This is a sorted list of size 11. The first comparison is with the middle element number 6 i.e. Leo. This
eliminates the first 5 elements. The second comparison is with the middle element from 7 to 11, i.e. 9
Sagittarius. This eliminates 7 to 9. The third comparison is with the middle element from 9 to 11, i.e. 10
Scorpio. Thus we have found the target in 3 comparisons. Sequential search would be taken 10
comparisons. We will now formulate the algorithm for binary search.

ALGORITHM BINARY SEARCH

This represents the binary search method to find a required item in a list sorted in increasing order.

INPUT: Sorted LIST of size N, Target Value TP

OUTPUT: Position of T in the LIST = I

BEGIN
{
MAX = N-1;
MIN=0;
FOUND=0;
WHILE (( FOUND == 0) && (MAX > = MIN))
{ MID = (INT) (( MAX + MIN))

IF T = = LIST[MID]
{ I = ++ MID;
FOUND = 1;
}

ELSE

{IF ( T < LIST[MID])


MAX = MID - 1;

ELSE
MIN = MID + 1;

}
}

END

It is recommended that the student apply this algorithm to some examples.

Analysis of Binary search

In general, the binary search method needs no more than [log2n] + 1 comparisons This implies that for an
array of a million entries, only about twenty comparisons will be needed. Contrast this with the case of
sequential search which on the average will need (n +1 ) / 2comparisons.
The conditions (MAX = MIN) is necessary to ensure that step 2 terminates even in the case that the
required element is not present. Consider the example of Zodiac signs. Suppose the 10th item was Solar (an
imaginary Zodiac sign). Then at that point we would have

MID = 10
MAX = 11
MIN = 9

and, from 2.2 get

MAX = MID-1 = 9

In the next iteration we get

(2.1) MID = (9 + 9)DIV 2 = 9


(2.2) MAX = 9-1 = 8.

Since MAX < MIN, the loop terminates. Since FOUND is false, we consider the target was not found.

In the binary search method just described above, it is always the key in the middle of the list currently
being examined that is used for comparison. The splitting of the list can be illustrated through a binary
decision tree in which the value of a node is the index of the key being tested. Suppose there are 31 records,
then the first key compared is at location 16 of the list since (1 + 31)/2 = 16. If the key is less than the, key
at location 16 the location 8 is tested since (1 + 15)/2 = 8; or if key is less than the key at location 16, then
the location 24 is tested. The binary tree describing this process is shown below (Figure 1)

Figure 1. Searching Process In Binary Search

1.4 SUMMARY
This unit concentrates on searching techniques used for information retrieval. The sequential search method
was seen to be easy to implement and relatively efficient to use for small lists. But very time consuming for
long unsorted lists. The binary search method is an improvement, in that it eliminates half the list from
consideration at each iteration; It has checks to incorporated to ensure speedy termination under all possible
conditions. It requires only twenty comparisons for a million records and is hence very efficient. The
prerequisite for it is that the list should be sorted in increasing order

1.5 EXERCISES
1. Implement the sequential search algorithm to search a linked list (in C language).

2. Modify the above program to find all occurrences of a required item in the list.

3. Implement the binary search algorithm to search a list implemented as an array (in C).

4. Modify the above to find all occurrences of a given item.

5. Modify the binary search algorithm to search a list sorted in descending order.

SUGGESTED READINGS
How to Solve it BY Computer: R.G. Dromey - PHI

Data Structure's & Program Design: Robert. L. Kruse - PHI

UNIT 2 SORTING TECHNIQUES - 1


Structure
2.0 Introduction
2.1 Objectives
2.2 Internal Sort
2.2.1 Insertion Sort
2.2.2 Bubble Sort
2.2.3 Quick Sort
2.2.4 2-way Merge Sort
2.2.5 Heap Sort
2.3 Sorting on Several Keys
2.4 Summary
2.5 Review Questions
2.6 Programming Exercises
2.7 Suggested Readings

2.0 INTRODUCTION
Retrieval of information is made easier when it is stored in some predefined order. Sorting is, therefore, a
very important computer application activity. Many sorting algorithms are available. Differing
environments require differing sorting methods. Sorting algorithms can be characterized in the following
two ways:

1. simple algorithms which require the order of n2 (written as O(n2)comparisons to sort n items.

2. Sophisticated algorithms that require the O(nlog2n) comparisons to sort n items.

The difference lies in the fact that the first method moves data only over small distances in the process of
sorting, whereas the second method moves data over large distances, so that items settle into the proper
order sooner, thus resulting in fewer comparisons. Performance of a sorting algorithm can also depend on
the degree of order already present in the data.

There are two basic categories of sorting methods: Internal Sorting and External Sorting. Internal
sorting are applied when the entire collection of data to sorted is small enough that the sorting can take
place within main memory. The time required to read or write is not considered to be significant in
evaluating the performance of internal sorting methods. External sorting methods are applied to larger
collection of data which reside on secondary devices read and write access time are major concern in
determine sort performances.

In this unit we will study some methods of internal sorting. The next unit will discuss methods of external
sorting.

2.1 OBJECTIVES
After going through this unit you will be able to :

List internal sorting methods

Discuss and analyze the performance of several sorting methods

Describe sorting methods on several keys

2.2 INTERNAL SORTING


In internal sorting, all the data to be sorted is available in the high speed main memory the computer. We
will study the following methods of internal sorting:
1. Insertion sort

2. Bubble sort

3. Quick sort

4- Way Merge sort

5. Heap sort

2.2.1 INSERTION SORT


This is a naturally occurring sorting method exemplified by a card player arranging the cards dealt to him.
He picks up the cards as they are dealt and inserts them into the required position. Thus at every step, we
insert an item into its proper place in an already ordered list.

We will illustrate insertion sort with an example before (figure 1) presenting the formal algorithm.

Example 1: Sort the following list using the insertion sort method:

Figure 1: Insertion sort

Thus to find the correct position search the list till an item just greater than the target is found. Shift all the
items from this point one, down the list. Insert the target in the vacated slot.

We now present the algorithm for insertion sort.

ALGORITHM: INSERT SORT

INPUT: LIST[ ] of N items in random order.

OUTPUT: LIST[ ] of N items in sorted order.


1 BEGIN,

2. FOR I = 2 TO N DO

3. BEGIN
4. F LIST[I] LIST[I-1]

5. THEN BEGIN

6. J=I

7. T = LIST[I] /*STORE LIST[I]*/

8. REPEAT /* MOVE OTHER ITEMS DOWN THE LIST*/

9: J = J-1

10. LIST [J + 1] =LIST [J];

11. IFJ = 1THEN

12. FOUND =TRUE

13. UNTIL (FOUND = TRUE)

14. LIST [I] = T

15. END

16. END

17. END
2.2.2 BUBBLE SORT

In this sorting algorithm, multiple swappings take place in one pass. Smaller elements move or 'bubble' up
to the top of the list, hence the name given to the algorithm.

In this method adjacent members of the list to be sorted are compared. if the item on top is greater than the
item immediately below it, they are swapped. This process is carried on till the list is sorted.

The detailed algorithm follows:

ALGORITHM BUBBLE SORT

INPUT: LIST [ ] of N items in random order.

OUTPUT: LIST [ ] of N items sorted in ascending order.

1. SWAP = TRUE

PASS 0/

2. WHILE SWAP = TRUE DO

BEGIN

2.1 FOR.1 = 0 TO (N-PASS) DO


BEGIN

2.1.1 IF A[I] A [I + 1]
BEGIN

TMP = A[I]

A[I] A[I+1]

A[I+ 1] TMP

SWAP = TRUE
END

ELSE

SWAP = FALSE

2.1.2 PASS = PASS + 1

END

END

Total number of comparisions in Bubble sort are

= (N-.1) +(N-2) . . . + 2 + 1

= (N-1)*N / 2 =O(N2)

This inefficiency is due to the fact that an item moves only to the next position in each pass.

2.2.3 QUICK SORT


This is the most widely used internal sorting algorithm. In its basic form, it was invented by C.A.R. Hoare
in 1960. Its popularity lies in the- ease of implementation, moderate use of resources and acceptable
behaviour for a variety of sorting cases. The basis of quick sort is the 'divide' and conquer' strategy i.e.
Divide the problem [list to be sorted] into sub-problems [sub-lists], until solved sub problems [sorted sub-
lists] are found. This is implemented as
Choose one item A[I] from the list A[ ].

Rearrange the list so that this item is in the proper position i.e. all preceding items have a lesser value and
all succeeding items have a greater value than this item.

1. A[0], A[1] .. A[I-1] in sub list 1

2. A[I]

3. A[I + 1], A[I + 2] ... A[N] in sublist 2

Repeat steps 1 & 2 for sublist & sublist2 till A[ ] is a sorted list.

As can be seen, this algorithm has a recursive structure.,

Step 2 or the 'divide' procedure is of utmost importance in this algorithm. This is usually implemented as
follows:

1. Choose A[I] the dividing element.


2. From the left end of the list (A[O] onwards) scan till an item A[R] is found whose value is greater
than A[I].

3. From the right end of list [A[N] backwards] scan till an item A[L] is found whose Value is less
than A[1].

4. Swap A[-R] & A[L].

5. Continue steps 2, 3 & 4 till the scan pointers cross. Stop at this stage.

6. At this point sublist 1 & sublist2 are ready.

7. Now do the same for each of sublist 1 & sublist2.

We will now give the implementation of Quicksort and illustrate it by an example. Quicksort (int A[], int
X, int 1)

int L, R, V 1.
1. If (IX)
{

2. V = A[1], L = X-1, R = I; 3.

3. For (;;)

4. While (A[ + + L] V);

5. While (A[- -R] V);

6. If (L = R) /* left & right ptrs. have crossed */

7. break;

8. Swap (A, L, R) /* Swap A[L] & A[R] */ }

9. Swap (A, L, I)

10. Quicksort (A, X, L-1)

11. Quicksort (A, L + 1, I) } }


Quick sort is called with A, I, N to sort the whole file.

Example: Consider the following list to be sorted in ascending order. 'ADD YOUR MAN'. (Ignore blanks)

N = 10

0 1 2 3 4 5 6 7 8 9
A[ ] = A D D Y O U R M A N

Quicksort ( A, 0, 9)

1. 9>0
2. V = [ 9] = 'N'
L = 1-1 = 0
R = I= 9

4. A[ 3] = 'Y' > V; There fore, L = 3


5. A [8] = 'A' > V; There fore, R = 8
6. L<R
8. SWAP ( A,3,8) to get

0 1 2 3 4 5 6 7 8 9
A[ ] = A D D A O U R M Y N

4 A [4] = 'O' > V, There fore, L =4


5. A[7] = 'M' < V, There fore, R = 7
6. L<R
7. SWAP (A,4,7) to get

0 1 2 3 4 5 6 7 8 9
A[]= A D D A M U R O Y N

4 A [5] = 'U' > V;.. L = 5


5 A[4] = 'M' < V;R = 4
6 L < R ,.. break
9 SWAP ( A,5,9) to get.

0 1 2 3 4 5 6 7 8 9
A[ ] = A D D A M N R O Y U

at this point 'N' is in its correct place.

A[5], A[0] to A[4] constitutes sub list 1.

A[6] to A[9] constitutes sublist2. Now

10. Quick sort (A, 0, 4)


11. Quick sort (A, 5, 9)

The Quick sort algorithm uses the O(N Log2N) comparisons on average. The performance can be improved
by keeping in mind the following points.
1. Switch to a faster sorting scheme like insertion sort when the sublist size becomes comparitively
small.

2. Use a better dividing element I in the implementations. We have always used A[N] as the dividing
element. A useful method for the selection of a dividing element is the Median-of three method.

Select any3 elements from the list. Use the median of these as the dividing element.

2.2.4 2-WAY MERGE SORT


Merge sort is also one of the 'divide and conquer' class of algorithms. The basic idea into this is to divide
the list into a number of sub lists, sort each of these sub lists and merge them to get a single sorted list. The
recursive implementation of 2- way merge sort divides the fist into 2 sorts the sub lists and then merges
them to get the sorted list. The illustrative implementation of 2 way merge sort sees the input initially as n
lists of size 1. These are merged to get n/2 lists of size 2. These n/2 lists are merged pair wise and so on till
a single list is obtained. This can be better understood by the following example. This is also called
CONCATENATE SORT.

Figure 2 : 2-way.merge sort

We give here the recursive implementation of 2 Way Merge Sort.

Mergesort (int List [ ], int low, int high)

{
int mid,
1. Mid = (low + high)/2;
2. Mergesort (LIST, low, mid);
3. Mergesort (LIST, mid + 1, high);
4. Merge (low, mid, high, List, FINAL)
}

Merge (int low, int mid, int high, int LIST[], int FINAL)

int a, b, c, d;

a = low, b = low, c = mid + 1

While (a < = mid and c < = high) do


{
If ( LIST [a] < = LIST [c] ) then
{

FINAL [b] =LIST [a]


++a;
}
else
{
FINAL [b] = LIST [c]
++ c;

++b
}
If (a mid) then

For (d = c; d< = high; ++d)


{
B[b] = LIST [d]
++b;

}
Else
For (d = a; d<= mid; ++d)
{
B[b] = A[d]
++b;
}
}

To sort the entire list, Mergesort should be called with LIST,1, N.

Mergesort is the best method for sorting linked lists in random order. The total computing time is of the 0(n
log2n ).

The disadvantage of using mergesort is that it requires two arrays of the same size and type for the merge
phase. That is, to sort and list of size n, it needs space for 2n elements.

2.2.5 HEAP SORT


We will begin by defining a new structure the heap. We have studied binary trees in BLOCK 5, UNIT 1. A
binary tree is illustrated below.

3(a) : Heap 1
A complete binary tree is said to satisfy the 'heap condition' if the key of each node is greater than or equal
to the key in its children. Thus the root node will have the largest key value.

Trees can be represented as arrays, by first numbering the nodes (starting from the root) from left to right.
The key values of the nodes are then assigned to array positions whose index is given by the number of the
node. For the example tree, the corresponding array would be

The relationships of a node can also be determined from this array representation. If a node is at position j,
its children will be at positions 2j and 2j + 1. Its parent will be at position [J/2 |.

Consider the node M. It is at the position 5. Its parent node is, therefore, at position [5/2| = 2 i.e. the parent
is R. Its children are at positions 2x5 & (2x5) + 1, i.e.10 + 11 respectively i.e. E & I are its children. We see
from the pictorial representation that these relationships are correct.
A heap is a complete binary tree, in which each node satisfies the heap condition, represented as an array.
We will now study the operations possible on a heap and see how these can be combined to generate a
sorting algorithm.

The operations on a heap work in 2 steps.

1. The required node is inserted/deleted/or replaced.

2. 1 may cause violation of the heap condition so the heap is traversed and modified to rectify any,
such violations.

Examples

Insertion

Consider the insertion of a node R in the heap 1.

1. Initially R is added as the right child of J and given the number 13.

2. But R J, the heap condition is violated.

3. Move R upto position 6 and move J down to position 13.

4. R P, therefore, the heap condition is still violated.

5. Swap R and P.

6. The heap condition is now satisfied by all nodes to get.

3(b) : Heap 2
Deletion Consider the deletion of M from heap 2.

1. The larger of M's children is promoted to 5.


2.

Figure 3(c) : Heap 3


An efficient sorting method based on the heap construction and node removal from the heap in order. This
algorithm is guaranteed to sort n elements in N log N steps.

We will first see 2 methods of heap construction and then removal in order from the heap to sort the list.

1. Top down heap construction

- Insert items into an initially empty heap, keeping the heap condition inviolate at all steps.

2. Bottom up heap construction

- Build a heap with the items in the order presented.

- From the right most node modify to satisfy the heap condition.

We will exemplify this with an example.

Example: Build a heap of the following using both methods of construction.

PROFESSIONAL

Top down construction


Figure 4: Heap Sort (Top down Construction)

Figure 5: Heap Sort by bottom-up approach

We will now see how sorting takes place using the heap built by the top down approach. The sorted
elements will be placed in X [ ] and array of size 12.
1. Remove S and store in X [12 )

(b)

2. Remove S and store in X [11]

(c)
9. Similarly the remaining 5 nodes are removed and the heap modified, to get the sorted list.

AEEILN00PRSS

Figure 6 : Sorting process through Heap

2.3 SORTING ON SEVERAL KEYS


So far we have been considering sorting based on single keys. But in real life application we may want to
sort the data on several keys. The simplest example is that of sorting a deck of cards. The first key for
sorting is the suit-clubs, spades, diamonds and hearts. Then within each suit, sorting the cards in ascending
order from Ace, twos to king. This is thus a case of sorting on 2 keys.

Now this can be done in 2 ways.

1 (1) Sort the 52 cards into 4 piles according to the suit.


(2) Sort each of the 4 piles according to face value of the cards.
2 (1) Sort the 52 cards into 13 piles according to face value.
(2) Stack these piles in order and then sort into 4 piles based on suit.

The first method is called the MSD (Most Significant Digit) sort and the second method is called the LSD
(Least Significant Digit) sort. Digit here can be said to stand for key. Though they are called sorting
methods, MSD and LSD sort only decide the 'order' of sorting. The actual sorting could be done by any of
the sorting methods discussed in this unit.

2.4 SUMMARY
Sorting is an important application activity. Many sorting algorithms are available, each the most efficient
for a particular situation or a particular kind of data. The choice of a sorting algorithm is crucial to the
performance of the application.

In this unit we have studied many sorting algorithms used in internal sorting. This is not a conclusive list
and the student is advised to read the suggested volumes for exposure to additional sorting methods and for
detailed discussions of the methods introduced here.

The three important efficiency interia are

- use of storage space

- use of computer time

- programming effort

In the next unit we will discuss internal sorting.

2.5 REVIEW QUESTIONS


1. Formulate an algorithm to perform insertion sort on a linked list.

2. What initial order of data will produce the maximum number of comparisons in insertion sort.
3. Modify the insertion sort algorithm to use the binary search method to determine the position of
insertion.

4. Implement bubble sort in Pascal.

5. Define 'divide & conquer' with relation to sorting.

6. Modify the quicksort algorithm to use the Median of three dividing method.

7. Modify the mergesort algorithm to deal with linked lists.

8. Describe a heap.

9. Formulate algorithm for the heap operations insert and delete.

2.6. PROGRAMMING PROJECTS


1. Implement quicksort using contiguous lists and linked lists and compare the performance.

2. A sorting procedure is said to be 'stable' if whenever two items which have the same value will be
in the same order in the sorted list as in the unsorted one. Determine which of the methods
discussed here are stable.

3. Write a program in C which will allow a user to

a. enter the data to be sorted.

b. specify the storage method - contiguous list or linked list.

c. specify the sorting method (with an additional inputs required for a particular method).

Implement the above.

2.7 SUGGESTED READING


1. FUNDAMENTALS OF COMPUTER ALGORITHMS

HOROWITZ & SAHNI

2. FUNDAMENTALS OF DATA STRUCTURES IN PASCAL

HOROWITZ & SAHNI

3. DATA STRUCTURES & PROGRAM DESIGN

ROBERT L. KRUSE

4. THE ART OF PROGRAMMING VOLUME 3

SORTING & SEARCHING


DONALD. E. KNUTI-1
UNIT 3 SORTING TECHNIQUE - II
Structure
3.0 Introduction
3.1 Objectives
3.2 Data Storage
3.3 Sorting with Disk
3.4 Buffering
3.5 Sorting with Tapes
3.6 Summary
Review Questions
Suggested Readings

3.0 INTRODUCTION
In the previous unit, we were introduced to the importance of sorting and discussed many internal sorting
methods. We will now talk about external sorting. These are methods employed to sort records of files
which are too large to fit in the main memory of the computer. These methods involve as much external
processing as processing in the CPU.

To study external sorting, we need to study the various external devices used for storage in addition to
sorting algorithms. The involvement of external device make sorting algorithms complex because of the
following reasons:
The cost of accessing an item is much higher than any computational costs.

Depending upon the external device, the method of access has different restrictions.

The variety of external storage device types changes depending upon the latest available technology.
Therefore, external sorting methods are dependant on external factors also. External sorting methods should
have equal emphasis on the systems aspect as well as on the algorithms aspect.

In this unit, we will just be introduced to some data storage devices and then study sorting algorithms for
data stored on different devices.

3.1 OBJECTIVES
After going through this you will be able to:

Differentiate between internal sorting and external sorting

Discuss sorting with tapes

Sorting with disks

3.2 DATA STORAGE


External storage devices can be categorised into two types based on the access method. These are
sequential access devices (e.g. magnetic tapes) and rando access devices (e.g. disks). We know from the
unit on files (Block 5, Unit the meaning of sequential access and random access. We will just see the
characteristics magnetic tapes and then of disks.

3.2.1 Magnetic Tapes


Magnetic tape devices for computer input/output are similar in principle to audio tape recorders. Magnetic
tape is wound on a spool. Tracks run across the length of the tape. Usuall there are 7 to 9 tracks across the
tape width Data is recorded on the tape in a sequence of bits. The number that can be written per inch of
track is called the tape density - measured in bits per inch.

Information on tapes is usually grouped into blocks, which may be of fixed or variable size. Blocks are
separated by an inter- block gap. Because request to read or write blocks do not arrive at a tape drive at
constant rate, there must be a gap between each pair of blocks forming a space to be passed over as the tape
accelerates to read/write speed. The medium is not strong enough to withstand the stress that it would
sustain with instantaneous starts and stops. Because the tape is not moving at a constant speed, a gap can
not contain user data. Data is usually read/written from tapes in terms of blocks. This is shown in figure 1 .

Figure 1: Interblock gaps

In order to read or write data to a tape the block length and the address in memory to/from which the data is
to be transferred must be specified. These areas in memory from/to which data is transferred will be called
buffers. Usually block size will respond to buffer size.

Block size is a crucial facator in tape access. A large block size is preferred because of the following
reasons:

1) Consider a tape with a tape density of 600 bpi. and an inter block gap of 3/4". this gap is enough to
write 450 characters. With a small block size, the number of blocks per tape length will increase.
This means a larger number of inter block gaps, i.e. bits of data which cannot be utilised for data
storage, and thus a decreased tape utilisation. Thus the larger the block size, fewer the number of
blocks, fewer the number of inter block gaps and better the tape utilisation.
2) Larger block size reduces the input/output time. The delay time in tape access is the time needed
to cross the inter block gap. This delay time is larger when a tape starts from rest than when the
tape is already moving. With a small block size the number of halts in a read are considerable
causing the delay time to be incurred each time.

While large block size is desirable from the view Of efficient tape usuage as well as reduced access time,
the amunt of main memory available for use 1/0 buffers is a limiting factor.

3.2.2 Disks
Disks are an example of direct access storage devices. In contrast to the way information in recorded on a
gramophone record, data are recorded on disk platter in concentric tracks. A disk has two surfaces on which
data can be recorded. Disk packs have several such disks or platters rigidly mounted on a common spinder.
Data is read/written to the disk by a read/write head. A disk pack would have one such head per surface.

Each disk surface has a number of concentric circles called tracks. In a disk pack, the set of parallel tracks
on each surface is called a cylinder. Tracks are further divided into sectors. A sector is the smallest
adressable segment of a track.

Data is stored along the tracks in blocks. Therefore to access a disk, the track or cylinder number and the
sector number of the starting block must be specified. For disk packs, the surface must also be specified.
The read/write head moves horizontally to position itself over the correct track for accessing disk data

This introduces three time components into disk access.

1. Seek time:- The time taken to position the read/write head over the correct cylinder.

2. Latency time:- The time taken to position the correct sector under head.

3. Transfer time:- The time taken to actually transfer the block between Main memory and the disk.

Having seen the structure of data storage on disks and tapes and the methods of accessing them, we now
turn to specific cases of external sorting. Sorting data on disks and sorting data on apes. The general
methods for external sorting is the merge sort.

In this segments of the file are sorted using a good internal sort method. These sorted segments, called runs,
are written out onto the device. Then all the generated runs are merged into one. run.

3.3 SORTING WITH DISKS


We will first illustrate merge sort using disks and then analyse it as an external sorting method.

Example

The file F containing 600 records is to be sorted. The main memory is capable of sorting of 1000 records at
a time. The input file F is stored on one disk and we have in addition another scratch disk. The block length
of the input file is 500 records.
We see that the file could be treated as 6 sets of 1000 recordes each. Each set is sorted and stored on the
scratch disk as a 'run'. These 5 runs will then be merged as follows:

Allocate 3 blocks of memory each capable of holding 500 records. Two of these buffers B1 and B2 will be
treated as input buffers and the third B3 as the output buffer. We have now the following.

1. 6 runs Rl,R2,R3,R4,R5,R6 on the scratch disk.

2. 3 buffers Bl,B2 and B3.

- Read 500 records from R1 into B1.

- Read 500 records from R2 into B2.

- Merge B1 and B2 and write into B3.

- When B3 is full - write it out to the disk as run R11.

- Similarly merge R3 and R4 to get run R12.


- Merge R5 and R6 to get run R13.

Thus, from 6 runs of size 1000 each, we have now 3 runs of size 2000 each.

The steps are repeated for steps R11 and R12 to get a run of size 4000.

This run is merged with R13 to get a single sorted run of size 6000.

Pictorically, this can be represented as:

Input file F with 6000 records

Figure 2: Merge Sort

The divisions in each run indicate the number of blocks.

Analysis

T1 - Seek time
T2 - Latency time
T3 - Transmission time for 1 block of 5000 records
T - T1 + T2 + T3
T4 - Time to internally sort 1000 records
nTM - Time to merge n records from input buffers to the output buffer.
In stage 1 we read 6000/500 = 12 blocks
internally sort 6000/1000 = 6 sets of 1000 records
write 6000/500 = 12 blocks
Therefore, time taken in stage 1 = 24T + 6T4
In stage 2 we read 12 blocks
write 12 blocks
Merge 5 x 2000 = 6000 records
Time taken in stage 2 = 24T + 600OTM

In stage 3 we merge only 2 runs

Therefore we read 8 blocks

write 8 blocks
merge 2 x 2000 = 4000 records

Time taken in stage 3 = 16T + 400OTM

In stage 4 we read 12 blocks

write 12 blocks

merge 4000 + 2000 = 4000 records

Therefore time taken in stage 4 = 24T + 600OTM

Total time taken is = 24T + 6T4 + 24T 6000TM + 16T + 4000TM + 24T + 6000Tm
= 88T + 6T4 + 1000Tm

It is seen that the largest influencing factor is TM, which depends on the number of passes made over the
data or the number of times runs must he combined.

We have assumed a uniform seek and latency time for all blocks for simplicity of analysis. This may not be
true in real life situations.

Time could also be reduced by exploiting the parallel features available, i.e. input/output and CPU
processing carried out at the same time.

We will now focus on method to optimise the effects of the above factors, i.e. to say we will be carefully
concerned with buffering and block size, assuming the internal sorting algorithms and the seek and latency
time factors are the best possible.

1.K-way merging

In the above example, we used 2 way merging, i.e. combinations of two runs at a time. The number of
passes over the data can be reduced by combining more runs at a time, hence the K-way merge where K2.
In the game example, suppose we had used a 3 way merge then

- at stage 2 we would have had 2 runs of size 3000 each

- at stage 3 we would have had a single sorted run of size 6000

This would have affected our analysis as follows:

Stage 1 = 24T + 6T4


Stage 2 = 24T + 600OTM
Stage 3 = 24T + 600OTM
Total = 74T + 6T4 + 1200Otm

There is obviously an enoromous dip in the contribution of TM to the total time.


This advantage is accompanied by certain side effects. The time merge K runs , Would obviously be more
than the time to merge 2 runs. Here the value of TM itself could out of K in each step of the merge is
needed. One such method is the use of a selection tree.

A selection tree is a binary tree in which each node represents the smaller of its children. It therefore
follows that the root will be the smallest node. The way it makes is simple, Initially, the selection tree is
built from the lst item of each of the K runs. The one which gets to the root is selected as the smallest. Then
the next item in the run from which the root was selected enters the tree and the tree gets restructured to get
a new root and so on till the K runs are merged.

Example

Consider a selection tree for the following 4 runs:


Figure 3: K-way Merging

The new root is 2. This came from R1 in the next step, the next item from Rl will enter the tree and so on.
In going into a higher order merge, we save on input/output time. But with the increase in the order of the
merge the number of buffers required also increases at the minimum to merge K runs we need K + 1
buffers. Now internal memory is a fixed entity, therefore if the number of buffers increases, there size must
decrease. This implies a smaller block size on the disk and hence larger number of input/outputs. So
beyond an optional value of K, we find that the advantage of reduced number of passes is offset by the
disadvantage of increased input/output.

3.4 BUFFERING
We saw that the second factor exploit in external sorting is parallelism. In the K-way merge discussion, we
stated that K+ 1 buffers are enough to merge K-runs using K buffers for input and 1 for output. But this
number of buffers is not adequate to exploit parallel operation in a computer.

When the full output buffer is being written onto the disk, no internal merging activity takes place, because
there is no place to write the results of the merge to.

This problem can be overcome by having 2 output buffers, so that one can be filled while the other one is
being written.

Now consider the input buffers. We have assigned one buffer per run assuming the buffer corresponding to
one run as emptied, then again merging activity ceases till the input/output to fetch the next block from the
run is complete. We will show by example that simply assigning 2 input buffers per run will not solve this
problem.

Example Consider 2 runs with the following data.

R1 2,4,6,8,9,10
R2 3,5,7,16,21,26

Input buffer I1 and I2 are allotted to R1. Input buffers 12 and 14 are allotted to R2. The number of output
buffers are 01 and 02. We assume a timing situation whereby it is possible to simultaneously write an
output buffer, merge 2 runs and read an input buffer. The merging scenario can be represented as follows
(figure 4):
Figure 4

We have arrived at a situation where I1 and I3 have been exhausted at O1 is not yet full. Therefore, merge
will be halted ti1ll step37.

A better way of using these 2K buffers is now described. We will continue to have 1 dedicated buffer for
each run. The remaining K buffers are allocated to runs on a priority basis, i.e. the run for which the merge
will run out of records is the next one filled.

3.5 SORTING WITH TAPES


Sorting with tapes is essentially similar to the merge sort used for sorting with disks. The differences arise
due to the sequential access restriction of tapes. This makes the selection time prior to data transmission an
important factor, unlike seek time and latency time. Thus in sorting with tapes we will be more concerned
with arrangement of blocks and runs on the tape so as to reduce the selection or access time.

As in 6.3.4 we will use an example to determine the factors involved.

Example: A file of 6000 records is to be sorted. It is stored on a tape and the block length is 500. The main
memory can sort upto a 1000 records at a time. We have in addition 4 search tapes T1 -T4 The steps in
merging can be summarized as follows (figure 5):
Figure 5: Sorting with Tapes

Analysis

tis = time taken to internally sort 750 records.


t rw = time to read or write. One block of 250 records. Onto
tape starting from present position of tape read/write head.
t rew = time to rewind tape over a length corresponding to one block.
ntm = time to merge n records from input buffers to output buffers
using a 2-way merge.
A = delay caused by having to wait for T4 to be mounted in case we are
ready to use Tn before it is mounted.
Total time = 6tis + 132trw + 12000tm + 51t rew + A

The above computing time analysis assumes that no operations are carried out in parallel. The analysis
could be carried further or in the case of disks to show the dependence of sort time on the number of passes
made on the data.

Balanced Merge Sorts:


We see that the computing time in the above example depends as, in the case of disk sorting, essentially on
the number of passes being made over the data. Use of a higher order merge results in a decrease in the
number of passes being made without significantly changing the internal merge time. Hence we would like
to use as high an order merge as possible. In disk sorting, the order of merge was limited essentially by the
amount of main memory available for input/output buffers. A k-way merge required the use of 2k + 2
buffers. Another more severe restriction on the merge order in the case of tapes is the number of tapes
available. In order to avoid excessive seek time, it is necessary that runs being merged be on different tapes.
Thus a k-way merge requires at least k-tapes for use as input tapes during the merge.

In addition, another tape is required for the output generated during the merge. Hence at least k + 1 tapes
must be available for a k-way tape merge. Using k + 1 tapes for a k-way merge requires an additional pass
over the output tape to redistribute the runs onto k-tapes for the next level of merge. This redistribution pass
can be avoided by using 2k tapes. During the k-way merge, k of these tapes are used as input tapes and the
remaining k as output tapes. At the next merge level, the role of input-output tapes is interchanged. These
two approaches are now examined. Algorithms x1 and x2 perform a k-way merge with the k + 1 taoes ,
strategy while algorithm x3 performs a k-way merge using the 2k tapes strategy

Procedure x1;

[sort a file of records from a given input tape using a k-way


merge given tapes t1, ... t k+1 are available for the sort.]
label 200
begin
Create runs from the input tape distributing them evenly over tapes t1,..tk
Rewind t1...tk and also at the input tape;
If there is only one run goto 200; [the sorted file is on t1].
replace input tape by tk+1;
while true do
[repeatedly merge onto tk+1 and redistribute. back onto t1, .... tk].
begin
merge runs from t1. ... t k+1 onto tk+1;
rewind t1,............tk+1;
If number of + runs on tk+1 = 1 then goto 200;
[output on tk+1]
evenly distribute from tk+1 onto t1,---,tk;
rewind t1 ... tk+1;
end;
200 end; [of x1]

Analysis: To simplify the analysis we assume that the number of runs generated(m) is a power of k. One
pass over the entire file includes both reading and writing. The number of passes in the while loop is log km
merge passes and log km-1 redistribution. passes . . the total number of passes is 21og km. If the time to
rewind the entire input tape is trew, then the non-overlapping rewind time is approximately 21ogkm.trew

A cleverer way to tackle the problem is to rotate the output tape, i.e. tapes are used as output tapes in the
cyclic order, K+ 1, 1, 2,..., k. This makes the redistribution from the output tape make less than a full pass
over the file. Algorithm x2 describes the process.
Procedure x2

[Same definitions as for x1]

label 200

begin

Create runs from the input file distributing them evenly over tapes t1 .... tk ;

Rewind t1,..,tk and also the input tape;


If there is only one run then goto 200;
[sorted file on t1]
replace input tape by tk+1;
i = k + 1 ; i is index of the output tape]
while true do

begin
merge runs from the k tapes tj;
I < =j < = k + 1 and i onto ti;
rewind t1 ...,tk+1;
If number of runs on ti = 1 then
goto 200 [output on t1]
evenly distribute (k-1)1k of the runs on ti onto
tape tj, k = j < = k + 1 and j i and j i mod (k + 1) + 1;
rewind tapes tj, 1 < = j < = k + 1 and j i;
i: = i mod (k+ 1) + 1;
end;

200 end; [end of x2]


Analysis: The only difference between algorithms x1 and x2 is in the redistributing time. m is the number
of runs generated in line 5. Redistributing is done log km-1 times, but each such pass reads and writes only
(k-1)/k of the file. Hence the effective number of passes made over the data is (2-1/k) log km + 1/k. for two-
way merging on 3 tapes algorithm x2 will make 3/2 log km + 1/2 passes while x1 will make 21ogkm passes.
If trew is the rewind time then the non-overlapping rewind time for x2 is utmost (1 + 1/k)(log km ) trew +
(1-1/k) trew. Instead of distributing runs as shown we could write the first m/k runs on one tape, begin
rewind, write the next m/k runs on the second tape, begin rewind etc. In this case we can begin filling input
buffers for the next merge level while some of the tapes are still rewinding.

In case a k-way merge is using 2k tapes, no redistribution is needed and so the number of passes made is
only logkm+1. This implies that if 2k + 1 tapes are available than a 2k way merge will make (2-1/2k)log2km
+ 1/(2k) passes while a k-way merge utilizing 2k tapes will make only logkm+1 passes. The table compares
the number of passes being made in the two methods for some value of k.

k 2k-way k-way
1. 3/2 log2m + 1/2 -
2. 7/8 log2m + 1/4 log2M + 1
3. 1.124 log3m + 1/6 log3m + 1
4. 1.25 log4m + 1/8 log4m + 1

As is evident from the table, for k 2 a k-way merge using 2k tapes is better than a 2k-way merge using 2k +
1 tapes.
Algorithm x3 makes a k-way merge sort using 2k tapes.
Procedure x3;

[sort a file of records from a given input tape using a k-way


merge on 2k tapes]

begin

Create runs from the input file distributing them evenly over tapes t1 .. tk;
rewind t1 ... tk; rewind the input tape;
replace the input tape by tape t2k; i: = 0;
while total number of runs tik+1 ... t1k+k 1 do
begin
j: = 1-i;
perform a k-way merge from tik+1,... tik+k evenly
distributing output runs onto tjk+1,....tjk+k;
rewind t1 .. t2k;
i: = j ; [switch input and output tapes]
end;
[sorted file is on tik+1]

end; [end of x3]

Analysis: To simplify the analysis, assume that m is a power of k. In addition to the initial run creation
pass, the algorithm makes loglm merge passes. Let trew be the time to rewind the entire input file. If m is a
power of k then the time to rewind tapes t1,...t2k in the while loop takes trew/k. for each but the last loop
iteration. The last rewind takes time trew The total rewind time is therefore bound by (2 + (logkm - 1)/k)trew.

It should be noted that all the above algorithms use the buffering strategy developed in the k-way merge.
The proper choice of buffer lengths and the merge order (restricted by the number of tapes available) would
result in an almost complete overlap of internal processing with input/output time. At the end of each level
of merge, processing will have to wait till the tapes rewind. This wait can be minimized using the run
distribution strategy developed in algorithm x2.
Polyphase merging: The problem with balanced multiway merging is that it requires either an excessive
number of tape units or excessive copying polyphase merging is a method to eliminate virtually all the
copying by changing the way in which the small sorted blocks are merged together. The basic idea is to
distribute the sorted blocks produced by replacement selection somewhat unevenly among the available
tape units (learning one empty) and then to apply a 'merge = until-empty' strategy at which point one of the
input tapes and the output tape switch roles.

For example suppose that we have just 3 tapes, and we start with the initial configuration of sorted blocks
on the tapes as shown at the top of the figure. Tape 3is initially empty the output tape for the first merges.

Tape 1 B P S T U-J O - B H O -E F N S - H J O.
2 FHY - BNQ - FM.
3 .
Tape 1 EFNS.HJO.
2 .
3 B F H P S T U Y-B J N O Q. B F H M O.
Tape 1 .
2 B E F F H N P S S T U Y- B H J J N O O Q.
3 B F H M O.

Now after three two way merges from tapes 1 and 2 to tape 3, the second tape becomes empty. Then after
two two-way merges from tapes 1 and 3 onto tape 2, the first tape becomes empty. The sort is completed in
two more steps. First a 2-way merge from tapes 2 and 3 onto tape 1 leaves one file on tape 1, one file on
tape 2.Then a two-way merge from tapes 1 and 2 leaves the entire sorted file on tape 3.
This merge - until-empty strategy can be extended to work for an arbitrary number of tapes. The figure
indicates how 6 tapes might be used to sort with 497 initial runs. If we start as shown with tape 2 the output
tape, tape 1 having 61 initial runs etc, then after running a five-way "merge-until-empty", we have tape 1
empty, tape 2 with 61 runs etc. as shown in the second column of the figure. At this point we can rewind
tape 1 and make it the output tape and rewind tape 2 and make it an input tape. Continuing in this way we
arrive at the entire file sorted on tape 1 as shown by the last column. The merge is broken up into many
phases which don't involve all the data, but no direct copying is involved.

The main difficulty in implementing polyphase merge is to determine how to distribute the initial runs. The
table a can be built by working backwards: take the largest number in each column, make it zero and add it
to each of the other numbers to get the previous column. This technique works for any number of tapes (at
least 3): the numbers which arise are "generalized Fibonacci numbers". Of course the number of initial runs
may not be known in advance, and it probably won't be exactly a generalized Fibonacci number. Thus a
number of "dummy" runs must be added to make the number of initial runs exactly what 'Is needed for the
table.

Tape 1 61 0 31 15 7 3 1 0 1
Tape 2 0 61 30 14 6 2 0 1 0
Tape 3 120 59 28 12 4 0 2 1 0
Tape 4 116 59 24 8 0 4 2 1 0
Tape 5 108 47 16 0 8 4 2 1 0
Tape 6 92 31 0 16 8 4 2 1 0

Run distribution for sin-tape polyphase merge.

Remarks: A factor we have not considered is the time taken to rewind the tape. Before the merge for the
next phase can begin, the tape must be rewound and the computer is essentially idle. It is possible to modify
the above method so that virtually all rewind time is overlapped with internal processing and read/write on
other tapes. However, the savings over the multiway balanced merge are quite limited. Even polyphase
merging is better than balanced merging only for small P, and that not substantially. For P8, balanced
merging is likely to run faster than polyphase and for smaller P, the effect of polyphase merging is to save
two tapes.

3.6 SUMMARY
External sorting is an important activity especially in large businesses. It is thus crucial that it be performed
efficiently External sorting depends to a large extent on system considerations like the type of device and
the number of such devices that can used at a time. Thus the choice of an external sorting algorithm is
dependent upon external system considerations

The list of algorithms we have studied is not conclusive and the student is advised to consult the references
for detailed discussions of additional algorithms. Analysis and distribution strategies for multiway merging
and polyphase merging are to be found in Knuth, Volume 3.

3.7 REVIEW EXERCISES


1. Write an algorithm to construct a tree of losers for records Ri 1 < i < k with key values ki 1. Let
the tree nodes be Ti 0 < i < =kO with Ti, 1

2. Write an algorithm using a tree of losers, to carry out a k-way merge of k runs, k = 2. Show that if
there are n records in the k runs together, then the computing time is 0(nlog2 k).
3. a) Modify algorithm x3 using the run distribution strategy described in the analysis of
algorithm x2.

b) Let trw be the time to read/write a block and trew be the time to rewind over one block length. If
the initial run creation pass generates m runs for m a power of k, what is the time for k-way merge
using your algorithm? Compare this with the corresponding time for algorithm x2.

4. How would you sort the contents of a disk if only one tape (and main memory) were available for
use?

5. How would you sort the contents of a disk if no other storage (except main memory) were
available for use.

6. Compare the 4-tape and 6-tape multiway balanced merge to polyphase merge with the same
number of tapes, for 31 initial runs.

7. How many phases does 5-tape polyphase merge use when started up with 4 tapes containing 26,
15, 22 and 28 runs initially?

8. Obtain a table corresponding to the one in the text for the case of a 5-way polyphase merge on 6
tapes. Use this to obtain the correct initial distribution for 497 runs so that the sorted file is on tape
T1. How many passes are made over the data in achieving the sort? How many passes would have
been made by a 5- way balanced merge sort on six tapes (algorithm x2)? How many passes would
have been made by a 3-way balanced merge sort on 6 tapes (algorithm x3?

3.8 Suggested Reading


1. Fundamentals of Computer algorithms

Horowitz & Sahni

2. Fundamentals of Data Structures in Pascal

Horowitz & Sahni

3. The Art of Programming Volume

SORTING & SEARCHIN


DONALD. E. KNUTH

You might also like