Fault-Tolerant Data Structures

Fault-Tolerant Data Structures∗
Yonatan Aumann Michael A. Bender

Department of Computer Science Department of Computer Science
Bar-Ilan University State University of New York
Ramat-Gan 52900 Stony Brook, NY 11794-4400
Israel USA
aumann@cs.biu.ac.il bender@cs.sunysb.edu
Abstract
We study the tolerance of data structures to memory faults. We observe that many pointer-
based data structures (e.g., linked lists, trees, etc.) are highly nonresilient to faults. A single
fault in a pointer in a linked list or a tree may result in the loss of an unproportionately large
amount of data. In this paper we present a formal framework for studying the fault-tolerance
properties of pointer-based data structures, and provide fault-tolerant versions of the stack, the
linked list, and the dictionary tree.
1 Introduction
Motivation. Many commonly used pointer-based data structures are highly nonresilient to mem-
ory failures (e.g., disk sector failures, main memory erasures, accidental overwrites, etc.). Consider,
for example, a linked list. Losing even a single pointer makes the entire tail of the list unreachable.
Thus, one fault may result in the loss of an unbounded amount of data. Trees, stacks, and other
common pointer-based data structures exhibit similar fragility.
The objective is to make data structures more fault tolerant. Clearly, it is best to avoid faults
in the first place. However faults do occur, and their destructive effect should be minimized.
An Example. Consider the linked list. How can we make the link list more fault tolerant, so
that a single fault does not cause so much havoc? A naı̈ve solution is replication. If each data item
is replicated d + 1 times, the resulting data structure becomes resilient to d faults. This solution,
however, entails a high price both in space and in time: the data structure occupies a factor of d
more memory space and requires a factor of d more work for insertions and deletions.
A more efficient solution is the following. At each node, we maintain two pointers pointing out
of the node. One pointer points to the successor node in the list, as usual. The second pointer
∗
This work appeared in preliminary form in the Proceedings of the 37th Annual Symposium on the Foundations
of Computer Science (FOCS), pages 580–589, October 1996 [2].
1
points to the node that is d + 1 positions along the list. For this new structure, it can be proved
that with f faults at most O(f 2 ) nodes are lost, as long as f ≤ d. Specifically, the amount of
lost data is bounded and is a function of the number of faults. Thus, by adding only one extra
pointer to each node of the linked list, the data structure becomes resilient for up to d faults (in the
sense that the bulk of the data remains accessible). Unlike replication, the space overhead in this
structure is constant. Insertions and deletions, however, still take O(d) operations. In Section 4 we
provide a more efficient version of the linked list, where f faults result in only O(f log f log d) lost
nodes, and both insertions and deletions take constant time.
Our Results. In this paper we address the problem of fault-tolerance of data structures, providing
two main contributions:
1. We present a formal framework for studying the fault-tolerance properties of data structures.
We define a parameterized notion of fault tolerance, which measures the amount of lost data
as a function of the number of faults.
2. We introduce efficient fault-tolerant versions of three common data structures: the stack, the
linked list, and the binary search tree.
A full description of our results appears in Section 2 after we lay down the formal framework.
Applications. One of the major goals in file-system design is quick recovery from an inconsistent
state in the system’s metadata (the data structures of the file system). The metadata may reach an
inconsistent state, for example, as a result of power failures, undetected disk errors, or internal bugs
in the file system. Inconsistent metadata often causes the system to crash. In modern systems, the
computer may be unusable for well over an hour during reconstruction (e.g., accomplished using
the function fsck in UNIX).
We suggest that faster recovery may ultimately be achieved by replacing the the current data
structures with more fault tolerant ones, in this spirit of this paper. Note that because we are
concerned with fast recovery, the lost data does not have to be inaccessible forever, but only until
a more thorough recovery is completed. Meanwhile, the system is operational.
In many applications, continuous functionality of the system is a prime concern (e.g., airline
reservation systems). Such applications have hardware solutions that provide full fault tolerance
(e.g., lots of redundant hardware). Unfortunately, these solutions are often expensive and restricted
to mission-critical applications. Software-based fault tolerance in the spirit of this paper provides an
alternative cost-effective solution for the less demanding applications. It allows for quick recovery
of the system as a whole, while the limited amount of lost data can be recovered in the background,
using more lengthly procedures.
The prominence of the Web during the last decade has generated new applications, in which
memory failures occur regularly. For example, search engines, such as Google [15], sift through
large quantities of data gleaned from the Web. In order to manipulate this data cost-effectively,
many search engines, including Google, use inexpensive hardware, in which memory faults occur
2
regularly. Small numbers of memory failures will not cripple the application, but their damage
should be limited.
Related Work. Computing in the presence of memory faults is studied in many contexts. There
is a large body of literature on error-correcting codes, useful for memory or transmission fault
tolerance [6, 29]. For example, certain Reed-Solomon codes are currently used in compact discs.
Rabin [33] introduces the Information Dispersal Algorithm, which has applications to efficient fault-
tolerant disk storage and fault-tolerant routing. This algorithm breaks a large file into redundant
small pieces in an asymptotically space-efficient manner. Computing systems use redundant arrays
of inexpensive disks (RAIDs) to protect against failures in storage [31]. Extra “check” disks store
redundant information so that when a disk fails, its data can be reconstructed. Rabin [32] presents
a fingerprinting scheme via random polynomials for recognizing errors in memory.
Some data structures are already built robustly. Munro and Poblete [30] discuss a method of
representing search trees in an environment where pointers may be lost or maliciously altered. Their
representation permits any two field changes to be detected and any one to be corrected. Lock-free
data structures are concurrently accessed by multiple processes; they perform correctly even though
the data structure might be changed by one process while another process is accessing it [4, 20, 38].
There is also work on checking or certifying the performance of data structures [7, 1, 37].
For the data structures of file systems, replication is a central tool to obtain fault tolerance.
For example, in the Cedar Filesystem, described by Hagmann in [17], all metadata appears twice
on the disk. The filesystem is constructed under the assumption that at most one disk sector or
two contiguous sectors fail simultaneously.
Once a crash occurs in the file system, it is important to recover as quickly as possible. To aid the
recovery, some updates are stored sequentially in a buffer called a log. The idea of logging, originally
from database systems [16], is currently used in some file systems for fast recovery [17, 23, 35, 36].
Fault tolerance with respect to processor failures is widely studied, but is out of the scope of
this paper (see [22, 24, 5, 13, 8, 14]). The fault tolerance of networks, including many specific
architectures (e.g., the mesh, hypercube), was studied both with respect to routing parallel com-
puting (see [18, 19, 21, 34, 11, 12]). A network architecture especially designed for fault tolerance
is described in [28].
Kuttan and Peleg [26, 25] consider the problem of correcting an illegal state of a distributed
network, and define the notion of fault local mending. A distributed correction algorithm is fault
local mending if its time complexity depends only on the number of failed nodes, not on the size of
the entire network. In the discussion in [25] the authors raise the possibility of extending their notion
to sequential data structures, and introduce the concept of fault local mendable data structures.
This notion is related, but distinct from the one introduced in this paper.
Outline. The rest of this paper is organized as follows. In Section 2 we introduce the definitions
and the formal framework for studying fault tolerance of data structures. In Sections 3-5 we provide
fault tolerant versions of the stack, the linked list, and the binary tree. Section 6 shows how the
results for all three data structures can be improved using expanders. Finally, we conclude with a
3
discussion in Section 7.
2 Terminology: Faults, Reconstruction, and Emulation

2.1 Definitions
In this section we present the formal framework for studying the fault-tolerance properties of data
structures. To this end, we provide the following elements:
• A formal definition of data structures and faults.
• A formal definition of the notion of a reconstruction of a data structure after faults have been
detected.
• A quantitative measure of the fault tolerance of the data structure. The measure is based on
the amount of lost data as a function of the number of faults.
Data Structures and Data-Structure Schemes. A data structure is characterized by the

operations it supports and by their implementations. We focus on pointer-based data structures.
For these data structures, we view instances as directed graphs, with the data residing in the nodes
and the edges representing pointers. Accordingly, we define a data structure scheme, S, to be a
pair S = (H, P), where
• H is a family graphs that are valid instances of the data structure scheme,
n o
• P is a set of procedures P1S , . . . , PnS that create and manipulate graphs of H.
An instance of the data structure scheme is any valid graph of the family H. For an instance graph
H ∈ H, we distinguish between two types of nodes:
• Information nodes: nodes that contain information inserted into the data structure by the
user.
• Auxiliary nodes: nodes that contain auxiliary or structural data, for internal use by the
data structure and procedures (e.g., internal nodes in a tree where all data resides in the
leafs).
This distinction is important when some of the data is lost and the data structure needs to be
reconstructed. In this case, we seek to restore as many information nodes as possible, whereas
auxiliary nodes need to be restored only insomuch as they are necessary to maintain the correctness
of the structure.
4
Faults and Reconstruction. We assume that when a node becomes faulty all the data contained
in the node and all the outgoing pointers are lost. We also assume that faults are detectable upon
access, i.e. trying to access a faulty node results in an error message.
We must reconstruct the data structure after faults are detected. However, full reconstruction
may be impossible. For example, the graph may become disconnected and some nodes inacces-
sible. Thus, we want to reconstruct the data structure on the subset of the remaining accessible
nodes. Since some information is necessarily lost, we must define what we mean by reconstruction.
Intuitively, we require that:
1. The reconstruction includes as many (information) nodes of the original data structure as
possible, and
2. The reconstruction maintains the essential topological order among the reconstructed infor-
mation nodes. For example, in the linked list we require that if node v appeared before node
w in the original list then node v appears before node w in the reconstruction. Thus, we
maintain the “before-after” relation. In other structures, we may maintain other relations,
e.g., “above-under” in the tree.
Note that we cannot expect to maintain all topological relations of the original graph. For example,
in the linked list we cannot maintain the relation “two nodes after”, as the intermediate nodes may
be lost. Accordingly, we define reconstruction with respect to a given set of relations R, induced
by the topology of the graph.
Definition 1 Let S = (H, P) be a data structure scheme, and let R = {R1 , . . . , Rt } be a set of
relations on information nodes induced by the graphs of H (instance graphs of S). Let S = (V, E) ∈
H be an instance of S. We say that the graph S 0 = (V 0 , E 0 ) is a reconstruction of S with respect
to R if the following conditions hold:
• The graph S 0 is a valid instance of S, i.e. S 0 ∈ H .
• The information nodes in S 0 form a subset of the information nodes of S.
• For each Ri ∈ R, if Ri is a k-relation, then for each k-tuple ~v ∈ (V 0 )k of information nodes,
~v has the relation Ri in S iff it has the relation Ri in S.
We call information nodes that appear in S but not in S 0 lost nodes. Note that a node may be lost
even if it is not faulty. Specifically, a node may be inaccessible, in case all paths to the node are
blocked by faults.
We seek data structures for which all instances can be efficiently reconstructed with a minimal
loss of nonfaulty nodes.
Definition 2 Let S = (H, P) be a data structure scheme, and let R = {R1 , . . . , Rt } be a set of
relations on information nodes induced by the graphs of H. Let d be a constant and g : N → N be
a function. We say that S is (d, g)-fault tolerant with respect to R if there exists a reconstruction
algorithm A satisfying the following. For any instance S of S, if there are f ≤ d faults in S, then
algorithm A, on input of the faulty S, outputs a reconstruction S 0 of S with respect to R, such
5
that the number of lost information nodes is bounded by g(f ). The running time of reconstruction
algorithm A must be polynomial in f and d.
We seek data structures for which the function g(f ) and the running time of A are slowly
growing functions and independent of the size of the data structure S.
The Handle. In pointer-based data structures the nodes are accessed via pointers. Most of these
pointers are themselves located in other nodes of the graph. However, the data structure must also
allow access to the structure from the “outside world.” In the linked list, for example, the pointer
to the head of the list is in a fixed location and the entire structure is accessed through this pointer.
The queue has two such pointers, one pointer to the front of the queue and one to the end. These
pointers are in fixed locations, and there is a fixed number of such pointers. If all these pointers
are lost the entire data structure becomes unreachable.
We call the set of fixed pointers to a structure the handle of the structure. Formally, the handle
H(S) of data structure S is a set of labelled nodes {h1 , . . . , hk }. Each handle node hi stores a
pointer, which can point to any node of the data structure. In addition, the handle node may
store auxiliary information regarding the pointer or the data structure. For any data structure, the
number of nodes in the handle is an upper bound on the number of faults that the data structure
can sustain. This explains why Definition 2 has an upper bound, d, on the number of faults that
the data structure is required to withstand, even in the worst case.
Emulation. Most common data structures do not lend themselves to efficient reconstructions.
Thus, we introduce fault-tolerant versions of the data structures. The new versions emulate the
behavior of the original ones while supplying a higher degree of fault tolerance. The following
definition formally defines what it means for the fault tolerant version to emulate the regular one.
Definition 3 Let S = (H, P) be a data structure scheme, with P = {P1 , . . . , Pk } (the Pi ’s are the
© ª
procedures), and let S̄ = (H̄, P̄) be another data structure scheme, with P̄ = P̄1 , . . . , P̄k . We say
that S̄ is an emulation of S if the following conditions are satisfied:
• For each i, procedure P̄i of S̄ has the same interface as procedure Pi of S (i.e., it expects the
same input pattern and outputs the same output pattern).
• For any sequence of invocations of procedures from S, invoking the corresponding procedures
from S̄ results in the same output to the user.
Two measures characterize the quality of the emulation: time and space, as described in the
following definition:
Definition 4 Let S̄ be an emulation of S. We say that it is an (α, β)-emulation if the following

criteria are satisfied:
• Time: for each sequence of invocations of procedures of S and corresponding invocations of
procedures of S̄, the (amortized) execution time of the S̄ procedures is at most α times the
(amortized) execution time of the S procedures.
6
• Space: for any instance S of data structure scheme S, the corresponding instance S̄ of S̄
occupies at most a factor β more space.
We say that S̄ is a constant emulation, if it is an (O(1), O(1)) emulation.
2.2 Our Results

We are now ready to present a formal description of the results presented in this paper. We provide
the following fault-tolerant data structures:
• A family of fault-tolerant stacks, such that for each d, there is a (d, f log f )-fault-tolerant stack.
The fault-tolerant stack is a constant emulation of the regular stack.
• A family of fault-tolerant linked lists, such that for each d there is a (d, f log f log d)-fault-
tolerant linked list. The fault-tolerant linked list is a constant emulation of the regular linked
list.
• A family of fault-tolerant binary trees, such that for any d, there is a (d, f log f log d)-fault-
tolerant tree. The fault-tolerant binary tree is a constant emulation of the regular binary
tree.
• Expander-based versions of fault tolerant stacks, lists, and trees. The number of lost nodes
is reduced by an O(log f ) factor. This provides O(f ) lost nodes for the stack, and O(f log d)
lost nodes for linked list and binary search tree.
The reconstruction time is a small growing function in f and d.
3 Fault-Tolerant Stacks
The stack data structure supports two operations: Push(x) and Pop. An instance graph takes
the form of a directed path, with the handle Top pointing to the first node in path and the last
node pointing to Null. Each node x has one data field x.value and one outgoing pointer x.next.
Conceptually, we view pointers as oriented down; thus we push and pop nodes from the top of the
stack. The essential topological order of the stack is the “above-under” relation among nodes. The
stack data structure is highly non fault tolerant because one memory fault can generate O(n) lost
nodes.
© ª∞
We describe a new family of stack-like data structures, the d-FTstack, for d ∈ 2i i=0 (for other
d’s simply round up to the nearest power of 2). For any such d, the d-FTstack, is (d, O(f log f ))-fault
tolerant with respect to the “above-under” relation, and is a constant emulation of the stack.
The graph structure of (instances of) the d-FTstack is composed of a sequence of layers. Each
layer, Li , (except possibly the top layer) consists of 2d nodes, Li = {xi,0 , . . . xi,2d−1 }. We index the
layers Ld(n/2d)−1e , . . . , L0 , so that layer L0 is at the bottom of the stack and Ld(n/2d)−1e is on the
top.
Every log d + 1 layers constitute a butterfly structure, as follows (see [27] for a more complete
description of the butterfly structure). A sample graph appears in Figure 1. Each node x i,j in the
butterfly has two outgoing edges, a straight edge and a diagonal edge. The straight edge points
7
Λ
handle
Λ
Λ
head
Λ
Figure 1: A d-FTstack, where d = 4 and n = 45.
from node xi,j to node xi−1,j (if i 6= 0 and to Null otherwise). The diagonal edge is defined as
follows. Let j (i) be the integer that shares the same binary representation as j, except for the
(i mod 2d)-th bit, which is flipped (for example, for d = 2, 3(22) = 1). The diagonal edge points
from xi,j to xi−1,j (i) (if i 6= 0 and to Null otherwise).
The lexicographic order of nodes corresponds to the order in the stack. The top layer may be
incomplete. The handle to the d-FTstack consists of 2d + 1 nodes. One node, Top, stores a pointer
to the current top of the stack. The additional 2d handle nodes hold pointers to the top 2d nodes
of the stack. The FTstack procedures Push() and Pop add and remove the nodes in lexicographic
order and maintain the pointer structure.
Claim 1 For any d, the d-FTstack is a constant emulation of the stack.
Proof: The d-FTstack and the stack have the same user interface and provide the user with the
same behavior. Thus, the FTstack is an emulation of the stack.
The performance ratios between the stack and the FTstack are as follows:
• Time — Each of the FTstack operations take a constant number of steps. Thus, the ratio is
also constant.
• Space — The nodes of the FTstack are in one-to-one correspondence with those of the stack.
Each FTstack node requires a constant amount of space. In addition, the FTstack has 2d + 1
handle nodes. Thus, in total, as n grows, the ratio between the space requirements of the
stack and the FTstack is O(1).
3.1 Faults and Reconstruction

When a fault is detected, reconstruction begins. The reconstruction procedure operates in two
phases, described in the following paragraphs.
• Pop phase — In the Pop phase, we remove nodes from the FTstack and place them in auxiliary
storage, such as another FTstack. We try to reach each node via both of its incoming pointers.
If both pointers are unavailable or if the node is faulty, we discard the node. The Pop phase
8
ends when 2d consecutive reachable nodes are encountered or when the bottom of the FTstack
is reached. At this point the remaining FTstack is functional and has no apparent faults.
(Additional faults may still exist further down the FTstack. In this case we will run the
reconstruction procedure whenever these nodes are encountered.)
• Reinsert phase — In the Reinsert phase, we reinsert the nodes using the FTstack Push()
procedure.
Because we reinsert nodes in the reverse order from which they were popped, the reconstruction
maintains the order of the nodes of the original FTstack.
We now prove a bound on the number of lost nodes as a function of the number of faults f :
Claim 2 If there are f faults then at most O(f log f ) nodes are lost.
Proof: Let F be the set of faulty nodes. Without loss of generality, assume that the number of
faults f = |F | is a power of 2. Recall that a node is inaccessible if it is lost but not faulty.
We begin our analysis by considering a single inaccessible node x0 , which resides in a level Li .
There are two cases. First, suppose that level Li+log f +1 (the layer log f + 1 above x0 ) exists. Let
A(x0 ) denote the set of nodes belonging to Li+log f +1 that are ancestors of x0 .
One characteristic of the butterfly (since the number of faults f < 2d) is that if we trace the
edges back log f + 1 levels from node x0 in level Li to level Li+log f +1 , we obtain a binary tree with
2f leaves. Thus, we have |A(x0 )| = 2f .
In order for x0 to be inaccessible one of the following must hold for all nodes y ∈ A(x0 ): (1)
either y is lost (faulty or inaccessible), or (2) y is accessible but all paths from y to x 0 are blocked
by faulty nodes. At most f nodes of A(x0 ) are lost. This is because the straight edges of the
butterfly structure provide node-disjoint paths from the handle to each of the 2f nodes of A(x 0 ).
Since there are only f faults, these can block only f of these paths.
Let B(x0 ) denote the set of accessible nodes of A(x0 ). From the previous paragraph we have
|B(x0 )| ≥ f . Let T (x0 ) be the binary tree of nodes linking B(x0 ) and x0 (including faulty nodes).
For node z ∈ T (x0 ), let d(z, x0 ) be the distance from z to x0 . Consider a faulty node z in T (x0 ).
We count the number of nodes y ∈ B(x0 ) for which z can block the path from y to x0 . The distance
from z to B(x0 ) is log f + 1 − d(z, x0 ). Thus, the number of nodes of B(x0 ) under z is at most
2log f +1−d(z,x0 ) = f · 2−d(z,x0 )+1 . This is the number of y’s the faulty z can block.
For nodes z and x0 , let
(
2−d(z,x0 )+1 z is in T (x),
w(z, x0 ) =
0 otherwise.
By the above analysis, a faulty node z can block the path for at most f · w(z, x0 ) paths from
nodes of B(x0 ) to x0 . Since at least f paths must be blocked from B(x0 ) to x0 , for x0 to be
inaccessible, it must be that z∈F w(z, x0 )f ≥ f . Thus, we obtain
P
X
w(z, x0 ) ≥ 1. (1)
z∈F
9
For the case where Llog f +1 does not exist (that is, node x0 is close to the top) a similar argument
also yields Equation 1.
So far we have considered a single inaccessible node x0 . Now we consider the set I of all
inaccessible nodes. Consider a faulty node z. Node z is at distance 1 from two descendant nodes,
distance 2 from four descendant nodes, and so on. Thus, summing over all trees T (x) that contain
z, we obtain
log
Xf
2i 2−i+1 = 2 log f.
X
w(z, x) ≤ (2)
x∈I i=1
Thus, from Equation 1 we obtain

Ã !
X X X
|I| = 1≤ w(z, x) . (3)
x∈I x∈I z∈F
Exchanging the order of the summation and from Equation 2, we obtain

Ã !
X X
|I| ≤ w(z, x) ≤ 2f log f. (4)
z∈F x∈I
Thus, the total number of lost nodes is |I| + |F | ≤ 2f log f + f .

Next we bound the complexity of the reconstruction procedure.
Claim 3 For any f and d, the d-FTstack reconstruction procedure completes in O(df log f ) steps.
Proof: Popping and reinserting a node takes a constant number of steps. The Pop phase completes
once 2d consecutive accessible nodes are encountered. With f faults, at most f log f consecutive
layers have inaccessible nodes. Since each layer consists of 2d nodes, the total work is O(df log f ).
From Claims 1-3 we obtain the following performance bounds for a d-FTstack:
Theorem 1 For any d, the d-FTstack is O(d, f log f )-fault tolerant and is a constant emulation
of the stack.
4 Fault-Tolerant Linked Lists

4.1 The Structure
The linked list supports the of following operations:
• Insert(v, p): inserts a node with value v before node p.
• Delete(p): removes node p.
• Value(p): returns value stored in node p.
• Next(p): returns the node following node p.
• Head: returns the first node in the list.
10
Λ
Λ
top
Λ
(the handle)
Λ
Λ
Λ
Λ
Λ
2 1 0
B B B
Figure 2: A d-FTlist, where d = 4 and thus M = 32.
The essential relation between nodes is the 2-relation “before-after”. The linked list is highly
non fault tolerant, as explained in the Introduction. We introduce a new family of list-like data
structures, the d-FTlist, for d power of 2. For any such d, the d-FTstack, is (d, O(f log f log d))-fault
tolerant, with respect to the “before-after” relation.
We a family of fault-tolerant data structures that emulate the link list, d-FTlist, where d is a
power of 2. For each d, the d-FTlist is (d, O(f log f log d))-fault tolerant and is a constant emulation
of the linked list.
Instance graphs of the linked list are directed paths. Thus, the stack and the linked list have
the same graph structure. In the previous section we used a layered graph structure to obtain a
fault-tolerant emulation of the stack; we use a similar structure for the fault-tolerant linked list.
However, the essential difference between the stack and the linked list is that in the linked list
nodes may be inserted and deleted anywhere in the graph, whereas nodes in a stack are pushed and
popped only at the top. Thus, the main difficulty in constructing a fault-tolerant list is maintaining
the structure throughout the dynamic changes. We now describe the structure of the FTList in
detail. A sample graph of an FTlist appears in Figure 2.
The Skeleton. The graph of the d-FTlist is composed of a sequence of blocks B 0 , . . . , B ihead .
Each block B i consists of 2d(log(2d) + 1) vertices arranged in a butterfly structure. In addition,
B i has one special header node, header(B i ). Vertices of the last level of block B i point to those in
the first level of B i−1 . The handle of the d-FTlist consists of 2d + 1 nodes. One handle node, head,
points to the head of the list. The remaining handle nodes, h[k], k = 0, . . . , 2d − 1, point to the
nodes in the first level of the first block. We call this block structure the skeleton of the graph.
Folding. The original linked list is folded onto the skeleton as follows. Each node x of the linked
list is mapped to a vertex s of the skeleton (we use the term node for the nodes of the linked list
and vertex for those of the skeleton). Vertex s stores the entire information of x, including the data
field and the next pointer. At most one list node is mapped to any skeleton vertex. The mapping
maintains the order of nodes across blocks. Specifically, if x and y are nodes of the original linked
list and x is before y, then either x and y are in the same block, or x is in a block before that of
y. Within each block, the nodes are mapped arbitrarily. The empty vertices of each block B i are
11
chained in a list freei . A pointer to the head of freei is stored in header(i). In addition, header(i)
maintains the variable Loadi , which records the total number of nodes mapped onto vertices of B i .
Let M = 2d log(4d) be the number of vertices in a block. We maintain the following invariant.
Invariant 1 At all times at least M/4 nodes are mapped to each block B i (i > 1), and at most M
are mapped to each block B i (i ≥ 0). (Only the last block B 0 may contain less than M/4 nodes.)
4.2 Operations
Head, Next(), and Value(). A full copy of the linked list is embedded in the d-FTlist. Thus,
implementing the procedures Head, Next(), and Value() is easy, with the following small addition.
In the regular linked list the user holds a pointer p, which points to the current location in the
list. In the FTlist one pointer is never enough, because any single memory location may become
faulty. Thus, in addition to the pointer p, the user holds a set of 2d pointers pointing to the nodes
of first level of the current block. The pointers are initialized to the handle (Headand the first level
of the first block), and updated as the current location changes from one block to the next. The
amortized cost per execution of Next() is O(1).
Insert(v, x). Insert involves the following steps:

1. Create a new node y. Let it store the value v. Insert y before x in the embedded copy of the
linked list.
2. Suppose x is in block B i . Map the new node y onto a free vertex s in B i .
3. Update the list freei , and advance Loadi by 1.
4. If Loadi = M then split B i into two blocks.
To split a block, first create two skeleton blocks. Then, insert the first half of the nodes in one
block and the second half in the next. Splitting completes in O(M ) operations.
Delete(x). Delete is the reverse procedure to Insert:

1. Remove x from the embedded copy of the linked list.
2. Suppose x is mapped to vertex s ∈ B i . Delete x from s, and add s to freei . Decrease Loadi
by 1.
3. If Loadi < M/4 (and B i−1 exists) then execute one of the following. If Loadi−1 ≤ 3M/4 then
join B i with B i−1 . Otherwise, divide the nodes evenly between B i−1 and B i .
Joining/dividing is completed in O(M ) operations.

When a fault is detected, reconstruction begins. There are three phases to the reconstruction:
1. Salvage remaining nodes. In this phase the objective is to find as many of the remaining
accessible nodes as possible.
12
2. Determine the correct order between the salvaged nodes.
3. Reconstruct the data structure.
4.3.1 Salvaging Remaining Nodes
In order to find the remaining nodes we use the underlying butterfly structure of the skeleton.
Recall that at all times the user holds 2d pointers to the first layer of the current block. Thus,
in order to find the remaining nodes start from the first layer of the block, advance one level at a
time, and gather list nodes en route. Continue until a block with no faults is encountered.
By analogy to Claim 2, we have:
Claim 4 With f faulty nodes, at most O(f log f ) nodes are inaccessible.
4.3.2 Tags: Determining Order Between Nodes
Once the remaining nodes have been salvaged, it is necessary to determine the correct order between
these nodes. The problem is that the nodes of the list can be mapped onto the skeleton in an
arbitrary order. Thus, if some of the nodes are lost, we may lose the information on the correct
order between the surviving nodes. To overcome this problem we provide an additional facility that
allows to determine the correct order between nodes. We do so by using tags. Specifically, for each
node x, we store in the node an integer tag(x), such that if x and y are nodes mapped to the same
block, then, tag(x) < tag(y) iff x is in front of y in the list. Given such tags, we can reconstruct
the order between nodes of a given block by comparing their tags.
In order to maintain the tags we use the Dietz and Sleator [10] construction. The [10] construc-
tion allows to maintain such tags in a dynamically changing linked list, with tags of size O(log M )
and O(log M ) amortized reassignment cost per insert or delete, for a list of size M . Accordingly,
we maintain a separate instance of the [10] data structure for each block. With this construction
we obtain an O(log M ) = O(log d) cost per insert or delete.
In order to decrease the insertion cost to O(1) amortized time, we use indirection, as described
in [10]. Roughly, the the indirection construction works as follows. We divide the O(d log d)-nodes
of the block’s linked list into Θ(d) groups, each containing Θ(log d) contiguous elements. We split
the tag of each node into two: a high-order tag, which holds the high-order bits of the tag, and a low
order tag, which holds the low-order bits of the tag. Nodes of the same group share a high-order
tag. Hence, this tag is stored only once, in a representative node of the group. Low-order tags
are stored in each node. Dietz and Sleator [10] show that using this construction, tags can be
maintained with O(1) amortized cost per insert/delete. See [10] for details.
In order to compare two nodes, we compare their high-order tags, stored at their respective
representative nodes, and their low-order tag, stored at the nodes themselves. This indirection
means that if the representative node is lost, then we loose information on the order for all nodes
of the group. Hence, each lost node can result in the loss of an additional O(log d) nodes. Hence,
the total number of lost nodes is O(f log f log d).
13
4.4 The Theorem
Putting it all together, we obtain:
Theorem 2 For any d, the d-FTlist is a constant emulation of the linked list, and is (d, O(f log f log d))-
fault tolerant with respect to the “before-after” relation.
Proof: At least a quarter of the vertices are not empty. Thus, the d-FTlist takes an O(1) factor
more space than the linked list. Functions Value(), Next(), and Head run in a constant number of
steps. Regular insertion and deletions require O(1) operations. After every Ω(d log d) insertions or
deletions in a block (which require O(1) amortized time), the block must be split or joined. These
tasks consist of O(d log d) operations on the skeleton followed by O(d log d) insertions. Thus, the
amortized work per insertion and deletion is O(1).
By Claim 4 with f < d faults, at most O(f log f ) vertices of the skeleton are lost. Each
vertex stores at most one list node. Thus, at most O(f log f ) list nodes are lost. Because of
indirection O(f log f ) lost nodes may result in loosing the order information for O(f log f log d)
nodes, effectively making them unusable in the reconstruction.
At most f + 1 consecutive blocks have inaccessible nodes. Thus, the reconstruction completes
in O(f d log2 d) operations.
5 Fault-Tolerant Trees
5.1 General
We consider a binary search tree that supports the following procedures:
• Insert(v, x): insert a node with value v as a child of node x. A node is inserted as a new leaf
or between a parent and child.
• Delete(x): remove node x from the tree. Only nodes having one or zero children can be
deleted.
• Find(v): search for key v starting from the root.
We construct a family of data structures d-FTtree, for d power of 2, which are a constant emulation
of the binary search tree. For each d, the d-FTlist is (d, O(f log f log d))-fault tolerant with respect
to the “above-under” and “left-right” relations. We ensure that the depth of a leaf in the recon-
structed structure is no more than in the original tree. Minor modifications of this presentation
allow making fault-tolerant search trees of any bounded degree.
5.2 The Block Tree

Consider a tree T . As in the FTlist, we use a block structure as a skeleton. Each block of the
d-FTtree consists of M = 2d(log(2d) + 1) vertices interconnected in a butterfly structure. Each
block has a special header vertex, storing the free list and the load of the block.
14
B1 B1
B2 B3 B4
B5 B6
B4
B2 wide link
B3 B7
B5
B7 B6
Figure 3: A d-FTtree, with d = 4, M = 32. Left: mapping of nodes to blocks. Right: the
block tree (logical links in dashed, wide links in grey). Blocks B1 , B3 , and B5 have children.
They are uni-component and contain more than M/6 nodes. Blocks B4 , B6 , and B7 are
multi-component. They have no children. Block B6 contains less than M/6 nodes, but has
an immediate sibling, B5 , which is uni-component and contains more than M/6 nodes.
The nodes of T are mapped to vertices of the blocks, so that each vertex holds at most one
node. We use the term vertex to refer to the vertices of the blocks, and the term node to refer to
the nodes of the original tree T . We say that the block contains the nodes mapped to it.
The blocks are logically arranged in a tree structure, which we call the block tree and denote
by BT . A sample FTtree is depicted in Figure 3. The mapping of the nodes of T to the blocks
maintains the following conditions:
• Let Child(B) be a child of B in BT . Let T1 , . . . , Tk be the forest of subtrees of T contained in
Child(B). (That is, each Ti is a maximal subtree of T that is entirely contained in Child(B).)
Then, for each i, the root of Ti is a direct child of some node in B.
• If R-Sib(B) is the block to the right of B in BT , then when viewed within T , the nodes
contained in R-Sib(B) are to the right of those contained B.
For a block B, if all the nodes contained in B are in one connected component (in T ), we say that
B is a uni-component block. Otherwise B is a multi-component block.
Blocks are interconnected using wide links. To construct a wide link between blocks B and B 0 ,
we maintain pointers between the corresponding vertices in B and B 0 . (Recall that all blocks have
the same skeleton structure - a butterfly.) Wide links are maintained between the following blocks:
• Between B and its leftmost child (in BT ).
• Between B and its immediate sibling to the right (in BT ).
Thus, all block children of a given block are connected in wide-link linked list, rooted in the parent.
The handle to the FTtree consists of 2d pointers to the first layer of the root block.
15
We will insure that the FTtree maintains the following invariant:
Invariant 2 At all times the FTtree has the following structure:

• At most M nodes are mapped to any block.
• Only uni-components blocks have (block) children.
• If a block B has a child then B contains at least M/6 nodes.
• If block B contains fewer than M/6 nodes, then either B is the only block in the tree, or its
immediate sibling (to the right or the left) is uni-component and contains at least M/6 nodes.
From Invariant 2 we obtain the following claims, which provide bounds on the size of the FTtree,
and the out degree of blocks in the block tree.
Claim 5 The size of the FTtree is linear in the size of T .
Proof: The proof is by accounting. We assign each block B containing fewer than M/6 nodes to
a block containing at least M/6 nodes as follows. If B is an only child, we assign it to the parent,
which by the invariant contains at least M/6 nodes. Otherwise, we assign it to its immediate
sibling, which by the invariant contains at least M/6 nodes. Thus, at most three extra blocks are
assigned to each block with M/6 nodes.
Claim 6 Each block has at most 2M child blocks.
Proof: A block contains at most M nodes. Each node has at most two children in T . In the
worst case, each child is in a separate block.
5.3 Operations
Insertions. When a new node is inserted, it is first mapped to the block containing the node’s
parent. If this block then contains more than M nodes, the block is split into two blocks. There are
two types of splits: horizontal and vertical . A vertical split yields a parent and a child; a horizontal
split yields two siblings. A uni-component block only undergoes a vertical split, and a multi-
component block only undergoes a horizontal split. The split procedure takes O(M ) operations.
The main concern is to amortize the cost over O(M ) inserts/deletes. Below, we show how this is
obtained.
Deletions. If the number of nodes in a block falls under M/6 and it does not have a uni-
component immediate sibling with more than M/6 nodes, it is merged with a sibling, a parent, or
a child. After merging, the resulting block may be too big. In this case, it immediately undergoes
a split. The merge procedure takes O(M ) operations. Again, the main concern is to amortize the
cost over O(M ) inserts/deletes.
16
Splits. A block is split as a result of an insertion or merge. Therefore, before splitting, the block
may contain between M and 7M/6 nodes.
A vertical split is performed on uni-component blocks. It is accomplished by breaking the single
tree of the overcrowded block into two separate trees. Note that it is always possible to split an
n-node binary tree into two subtrees (in linear time) so that each resulting tree has between n/3
and 2n/3 nodes. Thus, since we are splitting between a tree with between M and 7M/6 nodes, the
resulting blocks have size between M/3 and 7M/9. Hence, in all cases Invariant 2 is maintained.
A horizontal split divides a multi-component block into two or more new sibling blocks. Consider
the nodes mapped to B. Let T1 , . . . , T` be the forest of trees on these nodes (as determined by the
original tree T ). There are two cases.
• Each Ti has at most M/3 nodes. In this case we split B into two blocks, each of which has
between M/3 and 2M/3 + M/6 = 5M/6 nodes, as follows. Let v1 , v2 , . . . , v` (M ≤ ` ≤ 7M/6)
be the nodes mapped to B, enumerated from left to right. Let T1 , . . . , Tk be the trees contained
in B, enumerated from left to right. Let Ti denote the subtree containing node vM/3 . We put
the nodes of the subtrees T1 , . . . , Ti in the first block, and those of subtrees Ti+1 , . . . , Tk in the
second.
• Otherwise, some subtrees have more than M/3 nodes. There are at most 2 such trees. Denote
these large subtrees by L1 and L2 (if it exists). We put each of L1 and L2 in a separate block.
The remaining nodes can be split into at most three groups: those to the left of L 1 ; those
between L1 and L2 (if it exists); and those to the right of both L1 and L2 . We map each of
these groups of nodes to a separate block. Note that some of these groups may be contain very
few nodes, but they have uni-component neighbors with at least M/3 nodes, thus satisfying
Invariant 2.
Claim 7 The amortized cost of splitting and merging is O(1).
Proof: Splits and merges both cost Θ(M ). We have engineered the splits so that after a horizontal
split of a block B
• multi-component blocks have at most 7M/9 nodes;
• blocks without an M/3-full neighbor have at least M/3 nodes;
• blocks with an M/3-full uni-component neighbor may have an arbitrarily small number of
nodes.
After a vertical split of a block B both resulting blocks have between M/3 and 5M/6 nodes.
Thus, after a split, all resulting blocks that contain between M/3 and 5M/6 nodes will support
at least Θ(M ) insertions or deletions before they are merged or split. All blocks with fewer than
M/3 nodes have uni-component neighbors. Thus, they need not be merged regardless of the number
of deletions. As for splits, such blocks can accommodate Θ(M ) insertions before they need to be
split.
After a merge, we may need to perform a split (immediately or sometime later). If so, the merge
will “pay for” the cost of the split. Regardless of whether a split ensues, all resulting blocks have at
17
least M/3 nodes. Therefore, after a merge to a block B, there will be at least M/3 − M/6 = M/6
additional deletes before the block needs to be merged again.

When a fault is detected, reconstruction begins. Reconstruction follows the tree structure of BT
(the block tree). To reconstruct a block, first reconstruct all of its children recursively. The wide-
links, which link the list of child blocks, guarantee that all child blocks are reachable. This is
because the wide links provide 2d separate, node distinct paths from the block to all the children
block, and at most d of these can contain a fault. After all child blocks have been reconstructed,
the block itself is reconstructed. For each block, reconstruction is performed in three phases:
1. Salvage remaining accessible nodes.
2. Determine the correct topological order between the remaining nodes of the block, and be-
tween the nodes of block and those of the children blocks (if they exist).
3. Reconstruct the block.
Phase (1) is performed using the butterfly structure of the skeleton of the block. By analogy to
Claim 2, if t nodes of the block are faulty, then at most O(t log t) nodes of the block are lost.
The next step is to determine the correct order among the salvaged nodes. The difficulty is
that since some nodes are lost, we may loose information on the relative order among the nodes.
In order to recreate the topological order among the remaining nodes, we again use tags.
Note that when nodes are lost, the reconstructed nodes may not form a binary tree. For
example, the root of the tree may be lost; in this case the remaining nodes form a forest rather
than a tree. Another example is when a node v having two children is lost, and its parent w also
has two children. In this case, attaching the children of v as children of w would maintain the
above-under and left-right relations, but the tree would no longer be a binary tree. In such cases
we introduce dummy nodes in the reconstruction process. Dummy nodes contain no data. They are
used in order to maintain the structure of the tree as a binary tree. The tags enable us to identify
when dummy nodes are necessary. We now proceed to describe the tagging system in detail.
We first show a solution with an O(log n) overhead per insert/delete. Then we present an
improvement which results in an O(log d) overhead per insert/delete. Finally, we use indirection
to reduce the overhead to O(1), but at the cost of an O(log d) factor increase in the number of lost
nodes.
An O(log n) Tag Solution. We maintain the following tagging system. For each node v we
maintain two tags, tagpre (v) and tagrev (v), as follows:
• The tags of tagpre (v) preserve the pre-order traversal of the tree; i.e., tagpre (v) < tagpre (w)
iff v is before w in the pre-order traversal of T . (In the pre-order traversal, first the root is
visited, then the left sub-tree recursively, and then the right sub-tree recursively.)
18
• The tags of tagrev (v) preserve the right-to-left pre-order traversal of the tree i.e., tagrev (v) <
tagrev (w) iff v is before w in the right-to-left pre-order traversal of the tree. (In the right-to-
left pre-order traversal, first the root is visited, then the right sub-tree recursively, and then
the left sub-tree recursively.)
The following claim is easy to validate:
Claim 8 Let v and w be nodes of the tree. Then:
• tagpre (v) < tagpre (w) and tagrev (v) < tagrev (w) iff v is an ancestor of w.
• tagpre (v) < tagpre (w) and tagrev (v) > tagrev (w) iff v is to the left of w.
Thus, together, the two tags tagpre (v) and tagrev (v) allow us to fully identify the topological order
among the remaining nodes, and to reconstruct the original tree. We call such as system of tags a
topological tagging system.
An ordered forest is a forest for which there is a full left-right order on the trees of the forest.
Claim 9 Let T be a binary tree reinforced with a topological tagging system. Let V 0 be a subset
of the nodes of T . Then there is a unique ordered forest F 0 (not necessarily binary) on V 0 that
maintains all the left-right and above-under relations of T . This ordered forest is fully determined
by the tagging system. Specifically, For any two nodes v, w ∈ V 0 :
• v is an ancestor of w in F 0 iff v is an ancestor of w in T ,
• v is to the left of w in F 0 iff v is to the left of w in T .
Proof: By Claim 8 for any two nodes v and w one can determine if one node is an ancestor of
the other and their respective order. This fully determines the ordered forest F 0 .
We now describe the reconstruction procedure. As mentioned above, reconstruction is applied
recursively, based on the structure of block tree. For each block B, the following is performed:
1. Using the skeleton structure of the block, salvage as many as possible of the nodes that are
contained in B. Denote these salvaged nodes by N (B).
2. Based on the tags, reconstruct the ordered forest on N (B) as in Claim 9. Denote this forest
by F (B).
3. If block B is a leaf in the block tree, but not the only block in the tree, then delete block B
from the block tree. Keep the nodes of N (B) in an axillary structure. These nodes shall be
reinserted at B’s parent.
4. Otherwise (B is not a leaf in the block tree, or B is the only block in the tree):
(a) If F (B) is not a tree then add a dummy node as the root to F (B). (Recall that if B
is not a leaf in the block structure then it must be uni-component, i.e., the nodes of B
must form a tree). Denote the resulting tree by T (B). (B must be uni-component as it
is not a leaf or is the root block.)
19
(b) For each child block B 0 of B do the following:
i. Choose any node v ∈ N (B 0 ).
ii. For all nodes w ∈ N (B) check if w is an ancestor of v (using the tags). Let w0 be
the closest ancestor of v in N (B). If no such ancestor is found, then w0 is set to be
the root of T (B).
iii. Connect the root of T (B 0 ) (the tree maped to B 0 ) as a direct child of w0 .
(c) If T (B) is not a binary tree then convert it into a binary tree by adding dummy nodes.
(d) If the number of nodes in the tree is less than M/6, add dummy nodes (in a separate
subtree rooted at the root while maintaining the binary tree structure).
(e) Create a new skeleton structure for B and map the nodes of the resulting tree to the
skeleton, arbitrarily. Insert the new block in the proper location in the block tree.
(f) If B has child blocks that have been deleted in the reconstruction (Step 3, above), then
reinsert the nodes of these blocks, using the tags to identify the proper locations.
To maintain the tags we use the Dietz and Sleator [10] construction for each of the two tags,
tagpre (v) and tagrev (v). In this solution tags are of size O(log n) and each update (insert of delete)
requires O(log n) work (n is the number of nodes in the entire tree).
An O(log d) Tag Solution. Note that the reconstruction procedure uses only the order within
a single block, and between nodes of a parent block and the immediate child blocks. Accordingly,
in order to reduce the insert/delete overhead from O(log n) to O(log d), we exchange the global
tagging system described above, which provides a topological tagging system on the entire tree,
with many local systems, each of which provides the order only among nodes of a single block
and between neighboring blocks. Specifically, for each block B we maintain a topological tagging
system covering the nodes of B and its immediate children. Thus, each node v ∈ B takes part in
(at most) two topological tagging systems:
1. The system covering B and its children.
2. The system covering B’s immediate parent and its children, i.e., B, B’s parent and B’s
siblings.
By Claim 11 the number of child blocks of any given block is O(M ) = O(d log d). Each block has
O(d log d) nodes. Thus, the total number of nodes in each topological tagging system is O(d 2 log2 d).
Thus, the the Dietz and Sleator [10] construction provides insert and delete in O(log d) steps, and
tags of size O(log d).
An O(1) Tag Solution. In order to convert the O(log d) solution to a O(1) solution, we use
indirection, as described in Section 4.3. As with the linked list, indirection may result in a situation
where a node is salvaged but its location in the tree is lost since its higher order tag is stored in
another node, which has been lost. Each node stores the high-order tags for at most O(log d) other
nodes. Thus, the number of lost nodes increases by a factor of at most O(log d).
20
Thus we obtain:
Claim 10 Using the tags with indirection, insert and delete take amortized O(1) steps, and with
f faults, at most O(f log f log d) nodes are lost.
We now justify the time complexity of the reconstruction algorithms.
Claim 11 Reconstruction takes O(poly(f d)) steps.
Proof: The recursive reconstruction procedure stops when reaching a block with no inaccessible
nodes. There are at most O(f log f ) inaccessible nodes. Hence, at most O(f logf ) blocks undergo
reconstruction. The work of reconstruction of a block is polynomial in the number of nodes in the
block which is O(d log d).
We obtain:
Theorem 3 For any d, the d-FTtree is a constant emulation of the binary dictionary tree, and is
(d, O(f log f log d)) fault tolerant with respect to the “above-under” and the “left-to-right” relations.
6 Fault Tolerance with Expanders

In the fault tolerant data structure presented so far, f faults result in O(f log f ) inaccessible nodes.
Using expanders, the number of inaccessible nodes can be reduced to O(f ).
We first describe the EFTstack (Expander Fault Tolerant Stack ). As in the FTstack, the
nodes of the EFTstack are grouped in layers. Here however, the layers are interconnected using
a bounded-degree expander (instead of a butterfly structure). Specifically, let G d = (A, B, E),
|A| = |B| = 2d, be a fixed, bounded-degree bipartite expander graph, with expansion rate α > 1.
The graph interconnecting every two consecutive layers of the d-EFTstack is isomorphic to G d . In
addition, each node points to the corresponding node in the next layer. Note that d is fixed for any
given d-EFT-Stack, and thus Gd can be hardwired into the code. Implementations of Pop, Push(),
and reconstruction are similar in the EFTstack and the FTstack. The details are omitted. We
obtain:
Theorem 4 For any d, the d-EFTstack is a constant emulation of the stack and is O(d, O(f ))-fault
tolerant with respect to the “above-under” relation.
Proof: We prove that with f faults, there are at most O(f ) inaccessible nodes. Let U i and Fi
be the set of unreachable and faulty nodes in level i, respectively. By the expansion property, if
|Ui | ≤ d then |Ui | ≤ |Ui+1 |/α + |Fi |. Since α > 1 and |Fi | ≤ d, by induction |Ui | ≤ d, for all i.
P
Thus, the total number of inaccessible nodes is

∞
X X 1 X
|Ui | ≤ |Fi | = O(f ).
i j=0
αj i
21
Similarly, the d-FTlist and d-FTtree are converted into the d-EFTlist and d-EFTtree. The
blocks of the d-EFTlist and the d-EFTtree contain 2d vertices each (instead of 2d log 2d, since
there is no need for levels within blocks). Two blocks are interconnected by a fixed expander
graph. Mapping within each block is in an arbitrary order. The tagging scheme is unchanged.
We obtain the following performance guarantees for the d-EFTlist and the d-EFTtree:
Theorem 5 For any d, the d-EFTlist is a constant emulation of the linked list and is (d, O(f log d))-
fault tolerant with respect to the “before-after” relation.
Theorem 6 For any d, the d-EFTtree is a constant emulation of the binary search tree. It is
(d, O(f log d))-fault tolerant with respect to the “above-under” and “left-to-right” relations.
7 Discussion
In this paper we presented a framework for studying the fault tolerance of pointer-based data
structures, and provided fault-tolerant versions of several common data structures. Throughout,
we considered a worst-case fault model. Other fault models should also be studied; for example, a
probabilistic fault model that takes locality of faults into account may lead to practical fault-tolerant
data structures. Fault-tolerant data structures should also be considered in a hierarchical memory
setting, in which there is locality among memory faults, but where data locality is important for
efficiency.
Acknowledgments.
We are grateful to Martı́n Farach-Colton and Pino Italiano for several important discussions.
References
[1] N. M. Amato and M. C. Loui. Checking linked data structures. In FTCS-24: 24th International
Symposium on Fault Tolerant Computing, pages 164–175, Austin, Texas, 1994.
[2] Y. Aumann and M. A. Bender. Fault tolerant data structures. In 37th Annual Symposium on Founda-
tions of Computer Science (FOCS), pages 580–589, October 1996.
[3] Y. Aumann, M. A. Bender, and L. Zhang. Efficient execution of nondeterministic parallel programs on
asynchronous systems. Information and Computation, 139(1):1–16, 25 Nov. 1997. An earlier version
of this paper appeared in the 8th Annual ACM Symposium on Parallel Algorithms and Architectures
(SPAA), June 1996.
[4] G. Barnes. A method for implementing lock-free data structures. In Proceedings of the Fifth ACM
Symposium on Parallel Algorithms and Architectures, pages 261–270, 1993.
[5] M. Ben-Or, S. Goldwasser, and A. Wigderson. Completeness theorems for non-cryptographic fault-
tolerant distributed computation. In Proceedings of the 20th ACM Symposium on Theory of Computing,
pages 1–10, 1988.
[6] E. R. Berlekamp. Algebraic Coding Theory. McGraw-Hill, New York, 1968.
22
[7] J. D. Bright, G. F. Sullivan, and G. M. Masson. Checking the integrity of trees. In FTCS-25: 25th
International Symposium on Fault Tolerant Computing Digest of Papers, pages 402–413, Pasadena,
California, 1995.
[8] B. Chor, M. Merrit, and D. Shmoys. Simple constant time consensus protocols in realistic failure models.
In Proceedings of the 4th Annual ACM Symposium on the Principles of Distributed Computing, pages
152–162, 1985.
[9] P. Dietz, J. I. Seiferas, and J. Zhang. A tight lower bound for on-line monotonic list labeling. In Al-
gorithm Theory—SWAT ’94: 4th Scandinavian Workshop on Algorithm Theory, volume 824 of Lecture
Notes in Computer Science, pages 131–142. Springer-Verlag, 6–8 July 1994.
[10] P. Dietz and D. Sleator. Two algorithms for maintaining order in a list. In Proceedings of the 19th
ACM Symposium on Theory of Computing, pages 365–372, 1987.
[11] D. Dolev, J. Halpern, B. Simons, and H. Strong. A new look at fault-tolerant network routing. Infor-
mation and Computation, 72(3):180–196, March 1987.
[12] C. Dwork, D. Peleg, N. Pippenger, and E. Upfal. Fault tolerance in networks of bounded degree.
SiComp, 1989.
[13] P. Feldman and S. Micali. Optimal algorithms for byzantine agreement. In Proceedings of the 20th
ACM Symposium on Theory of Computing, pages 148–161, 1988.
[14] M. Fischer, N. Lynch, and M. Paterson. Impossibility of distributed commit with one faulty process.
Journal of ACM, 32(2):374–382, April 1985.
[15] Google. http://www.google.com/.
[16] J. Gray. Notes on Data Base Operating Systems, pages 393–481. Springer-Verlag, Berlin, 1979.
[17] R. Hagmann. Reimplementing the Cedar File System using logging and group commit. In 11th SOSP,
pages 155–162, December 1987.
[18] J. Hastad, T. Leighton, and M. Newman. Reconfiguring a hypercube in the presence of faults. In
Proceedings of the 28th Annual Symposium on the Foundations of Computer Science, pages 274–284.
IEEE, 1987.
[19] J. Hastad, T. Leighton, and M. Newman. Fast computation using faulty hypercubes. In Proceedings of
the 30th Annual Symposium on the Foundations of Computer Science, pages 251–263. IEEE, 1989.
[20] M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data struc-
tures. In Proceedings of the Twentieth Annual International Symposium on Computer Architecture,
1993.
[21] C. Kaklamanis, A. Karlin, F. Leighton, V. Milenkovoc, P. Raghavan, S. Roa, C. Thomborson, and
A. Tsantilas. Asymptotically tight bounds for computing with faulty arrays of processors. In Proceedings
of the 31st Annual Symposium on the Foundations of Computer Science, pages 285–296, 1990.
[22] P. Kanellakis and A. Shvartsman. Efficient parallel algorithms can be made robust. In Proceedings of
the 8th Annual ACM Symposium on the Principles of Distributed Computing, pages 211–221, 1989.
[23] M. L. Kazar, B. L. Leverett, O. T. Anderson, V. Apostolides, B. A. Bottos, S. Chutani, C. F. Everhart,
W. A. Mason, S. T. Tu, and E. R. Zayas. Decorum file system architectural overview. In USENIX,
pages 151–164, Summer 1990.
[24] Z. Kedem, K. Palem, A. Raghunathan, and P. Spirakis. Combining tentative and definite executions
for very fast dependable parallel computing. In Proceedings of the 23rd Annual ACM Symposium on
Theory of Computing, pages 381–390, May 1991.
[25] S. Kutten and D. Peleg. Fault-local mending. In Proceedings of the 14th Annual ACM Symposium on
the Principles of Distributed Computing, pages 20–27, 1995.
[26] S. Kutten and D. Peleg. Tight fault-locality. In Proceedings of the 36th Annual IEEE Symposium on
Foundations of Computer Science, 1995.
23
[27] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays · Trees · Hypercubes.
Morgan Kaufmann Publishers, San Mateo, California, 1992.
[28] T. Leighton and B. Maggs. Expanders might be practical: Fast algorithms for routing around faults
in the multibutterflies. In Proceedings of the 30th Annual Symposium on the Foundations of Computer
Science, pages 384–389. IEEE, October 1989.
[29] F. J. MacWilliams and N. J. A. Sloane. The Theory of Error-Correcting Codes. Elsevier Science
Publishers, Amsterdam, The Netherlands, 1977.
[30] J. I. Munro and P. V. Poblete. Fault tolerance and storage reduction in binary search trees. Information
and Control, 62(2-3):210–218, August 1984.
[31] D. Patterson, G. Gibson, and R. Katz. A case for redundant arrays of inexpensive disks (RAID). In
11th SOSP, pages 386–393, 1988.
[32] M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR–15–81, Center for Research
in Computing Technology, Harvard University, 1981.
[33] M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal
of the Association for Computing Machinery, 36(2):335–348, April 1989.
[34] P. Raghavan. Robust algorithms for packet routing in the mesh. In Proceedings of the 1st ACM
Symposium on Parallel Algorithms and Architectures, June 1989.
[35] M. Rosenblum and J. Ousterhout. The design and implementation of a log-structured file system. ACM
Transactions on Computer Systems, 10(1):26–52, February 1992.
[36] M. Seltzer, K. Bostic, M. K. McKusick, and C. Staelin. An implementation of a log-structured file
system for UNIX. In USENIX, Winter 1993.
[37] G. F. Sullivan and G. M. Masson. Certification trails for data structures. 21st Int. Symp. on Fault-
Tolerant Computing (FTCS-21), pages 240–7, 1991.
[38] J. D. Valois. Implementing lock-free queues. In Proceedings of the Seventh International Conference
on Parallel and Distributed Computing Systems, pages 64–69, Las Vegas, NV, 1994.
24

Fault-Tolerant Data Structures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fault-Tolerant Data Structures

Uploaded by

Copyright:

Available Formats

Fault-Tolerant Data Structures∗

Yonatan Aumann Michael A. Bender

2 Terminology: Faults, Reconstruction, and Emulation

• A formal definition of data structures and faults.

Data Structures and Data-Structure Schemes. A data structure is characterized by the

Definition 4 Let S̄ be an emulation of S. We say that it is an (α, β)-emulation if the following

2.2 Our Results

Figure 1: A d-FTstack, where d = 4 and n = 45.

Claim 1 For any d, the d-FTstack is a constant emulation of the stack.

3.1 Faults and Reconstruction

Thus, from Equation 1 we obtain

Exchanging the order of the summation and from Equation 2, we obtain

Thus, the total number of lost nodes is |I| + |F | ≤ 2f log f + f .

4 Fault-Tolerant Linked Lists

Figure 2: A d-FTlist, where d = 4 and thus M = 32.

Insert(v, x). Insert involves the following steps:

Delete(x). Delete is the reverse procedure to Insert:

4.3 Faults and Reconstruction

3. Reconstruct the data structure.

4.3.1 Salvaging Remaining Nodes

4.3.2 Tags: Determining Order Between Nodes

5.2 The Block Tree

Invariant 2 At all times the FTtree has the following structure:

Claim 5 The size of the FTtree is linear in the size of T .

Claim 6 Each block has at most 2M child blocks.

Claim 7 The amortized cost of splitting and merging is O(1).

5.4 Faults and Reconstruction

1. Salvage remaining accessible nodes.

3. Reconstruct the block.

The following claim is easy to validate:

Claim 8 Let v and w be nodes of the tree. Then:

• v is an ancestor of w in F 0 iff v is an ancestor of w in T ,

• v is to the left of w in F 0 iff v is to the left of w in T .

1. The system covering B and its children.

We now justify the time complexity of the reconstruction algorithms.

Claim 11 Reconstruction takes O(poly(f d)) steps.

6 Fault Tolerance with Expanders

Thus, the total number of inaccessible nodes is

You might also like