You are on page 1of 19

1

CSE 513: Distributed Systems


(Fault-tolerance)
Guohong Cao
Department of Computer Science & Engineering
The Pennsylvania State University
http://www.cse.psu.edu/~gcao
2
Basic Concepts
Fault-tolerance is related to dependability, which covers:
Availability is the probability that the system is operating
correctly at any given moment.
Reliability refers to the property that a system can run
continuously without failure. Defined in terms of time
interval.
A system goes down 1ms every hour has an availability of
99.9999%, but not reliable.
Safety refers to the situation that when a system
temporarily fails to operate correctly, nothing catastrophic
happens; e.g., in nuclear power plans.
Maintainability refers to how easy a failed system can be
repaired. High maintainability means high availability.
3
Basic Concepts
A system is said to fail when it cannot meet its
specifications.
An error is part of a systems state that may lead
to a failure.
The cause of an error is called a fault.
Bad transmission medium may cause transmission
error.
Transient faults occurs once and then disappear.
Intermittent fault occurs, vanishes, then reappear.
Permanent fault: continues to exist
Fault tolerance (dependable system): the system
can provide its service even in the presence of
faults.
4
Failure Models
A server may produce arbitrary responses at arbitrary times Arbitrary (Byzantine) failure
The server's response is incorrect
The value of the response is wrong
The server deviates from the correct flow of control
Response failure
Value failure
State transition failure
A server's response lies outside the specified time interval Timing failure
A server fails to respond to incoming requests
A server fails to receive incoming messages
A server fails to send messages
Omission failure
Receive omission
Send omission
A server halts, but is working correctly until it halts Crash (fail-stop) failure
Description Type of failure
Other failure model concepts:
Process failure: generates incorrect results; e.g., deadlock, protection
fault, divide by zero.
Software or hardware fault
Network partition
5
Fault-Tolerance with Redundancy
Information redundancy
Extra bits are added to allow recovery from bit errors, e.g.,
Hamming code
Time redundancy
Rollback recovery, abort a transaction and restarts
Physical: triple modular redundancy (TMR)
6
Process Resilience
The key approach to tolerating a faulty process is to
organize several identical processes into groups.
Group management issues such as join, leave.
Group can be flat or hierarchical
Flat doesnt have single point of failure, but the decision making is
complicated.
7
Failure Masking and Replication
Process groups are used to replace single process to
achieve fault-tolerance. Two approaches are used:
primary-based: the primary can be replaced in case of failure.
replicated-write: in the form of active replication, or quorum-based
protocols.
In a fail stop model, k+1 processes are enough for k fault-
tolerant, which means that it can survive k faults
In a byzantine failure model, 2k+1 are needed
In practice, it is difficult to say that k processes will fail, so
statistical analysis is needed.
Atomic multicast is needed to make sure all requests will
arrive at all servers in the same order.
8
The Two-Army Problem
Without failure, agreement is trivial, but agreement in
faulty systems is much harder.
Agreement is needed in electing a coordinator, deciding whether to
commit or not, synchronization.
There are two generals of the same army who have
encamped a short distance apart.
Their objective is to capture a hill, which is possible only if
they attack simultaneously.
If only one general attacks, he will be defeated.
The two generals can only communicate by sending
messengers, which is not reliable.
Is it possible for them to attack simultaneously?
9
Impossibility Proof
Assume there is a protocol which sends messengers a fixed
number of times to solve the problem.
Let P be the shortest protocol. Suppose the last messenger
in P does not reach the destination. Then either the
message carried by the messenger is useless or one of the
generals does not get the needed message.
Since P is the minimum length protocol by our
assumption, the message that was lost was not a useless
message and hence one of the generals will not attack. A
contradiction.
10
Byzantine Generals Problem
In this problem, the enemy is still on the hill, but n
generals are around the enemy. Reliable communication
can be achieved pairwise. m of the generals are traitors
(faulty) and are trying to prevent loyal generals from
reaching agreement by giving wrong information.
Can the loyal generals reach agreement?
To reach an agreement in a system with m faulty
processes, at least 3m+1 processes are needed.
11 12
Reliable Client-Server Communication
A communication channel may exhibit crash,
omission, timing, and arbitrary failures.
Pont-to-point communication:
TCP can mask omission failure, but not crash failure,
where the connection is broken.
One way to deal with it is to generate an exception, or
automatically setup a new connection
RPC semantics in the presence of failure
The client is unable to locate the server
The request msg from the client to the server is lost
The server crashes after receiving a request
The reply message from the server is lost
The client crashes after sending a request
13
RPC Semantics
Client cannot locate the server
Solution: raise an exception. This violates the transparency since the
programmer needs to differentiate RPC from local procedure call.
Lost request msg
When timer expires before getting ack, resend.
Server crashes have different semantics: at least once, at most
once, guarantee nothing, and exact once.
14
Exactly once semantics is impossible to achieve
Server has two strategies: either send a complete message just before it
actually tells the printer to do its work, or after the text has been printed.
The client has four options: always resend request, never resend, send
when ACKed, send when not ACKed.
M: send the complete message, C: crash, P: print the text.
15
RPC Semantics
Lost reply messages
The client cannot differentiate whether the server is
very slow or the request (reply) is lost. Sending another
request may have problem, e.g., transfer money.
Use sequence number, but the server needs to track all
clients.
Some operations are safe, e.g., read the first 1024 bytes
of a file. A request which can be executed multiple
times without any side effect, is said to be idempotent.
16
Client Crashes
After the client crashes, the server may still process the
request and the computation is active, but no parent is
waiting for the result. Such computation is called an
orphan. Orphan wastes CPU cycle. If the client reboots,
the reply from the orphan may come back. Solutions.
Extermination: make a log before sending RPC. After a
reboot, check the log and kill the orphan. Disadvantage:
disk write, grandorphans.
Reincarnation: divide time into numbered epochs. When
reboots, broadcast the start of a new epochs. Remote
computations of that client are killed.
Expiration: define a time T for each RPC. If not enough,
ask for more. Wait T before reboot so that no orphan exists
17
Basic Reliable Multicasting
18
Scalability in Reliable Multicasting
If there are N receivers, the sender gets N ACKs and
results in ACK implosion.
Solution: use NACK.
Problem?
19
Feedback Suppression
Feedback suppression has good scalability, but it
also has some problems:
Accurate scheduling of the NACKs at the receivers
Interrupts to processes which received the messages
correctly. Solutions:
let receivers that missed m join a new group. Need to setup a
new group.
Let receivers that tend to miss the same messages team up and
share the same multicast channel for feedback and
retransmission.
Use hierarchical feedback control
20
In hierarchical reliable multicasting, each local coordinator forwards
the message to its children and later handles retransmission requests.
If the coordinator missed a message, it asks the coordinator of the
parent subgroup to retransmit it.
Problem is how to build the tree dynamically.
21
Atomic Multicast
In atomic multicast, a message is delivered to either all
correct processes or to none at all. Also, all messages are
delivered in the same order to all correct processes.
Useful in making all replicas have the same view
Failed process can synchronize again after recovery
Message receipt is different from message delivery.
22
Virtual Synchrony
As the group membership changes, a view change takes
place by multicasting a vc which indicates the member
change.
A reliable multicast is virtually synchronous if a message
multicast to group view G is delivered to each nonfaulty
process in G. If the sender of the message crashes during
the multicast, the msg may either be delivered to all or
none.
It must deal with sender failure during the sending process.
All multicasts that are in transit while a view change takes
place are completed before the view change comes into
effect.
23
Virtual Synchronous Multicast
24
Yes Causal-ordered delivery Causal atomic multicast
Yes FIFO-ordered delivery FIFO atomic multicast
Yes None Atomic multicast
No Causal-ordered delivery Causal multicast
No FIFO-ordered delivery FIFO multicast
No None Reliable multicast
Total-ordered Delivery? Basic Message Ordering Multicast
Six different versions of virtually synchronous reliable multicasting
Message Ordering
25
Implementing Virtual Synchrony
All messages sent to view G are delivered to all nonfaulty
processes in G before the next group membership change.
If the sender does not fail, easy. When the sender crashes,
processes should get m from somewhere else.
Every process in G keeps m until it knows that everyone
receives it. At this time, m is said to be stable, and can be
delivered.
A flush message is used to notify a view change. Channel
is FIFO and reliable, used in ISIS.
26
27
Distributed Commit
Atomic multicasting is one example of
distributed commit. Other examples include
transaction commit.
In one-phase commit protocol, the coordinator
tells all participants to perform an operation.
If one participant cannot perform, not consistent.
Solution is to use two-phase commit protocol.
On failure, cleanup must enforce all-or-nothing
semantics.
If multiple sites are involved, all should agree on the
same outcome.
28
Phase 1, at the coordinator:
1. The coordinator sends a commit_request to every cohort requesting the cohorts
to commit.
2. The coordinator waits for replies from all the cohorts.
At cohorts:
1. On receiving the commit_request, if the transaction executing is successful, it
writes UNDO and REDO log and sends an agree to the coordinator, otherwise,
it sends an abort.
Phase 2, at the coordinator:
1. If all cohorts agree, the coordinator writes a commit into the log. Then, it sends
a commit to all cohorts. Otherwise, it sends an abort.
2. The coordinator waits for ack from each cohort.
3. If an ack is not received from any cohort within a timeout period, it resends the
commit/abort to that cohort.
4. If all the acks are received, it writes a complete to the log.
At cohorts:
1. On receiving a commit, it releases locks for executing the transaction, and sends
an ack.
2. On receiving an abort, it undoes the transaction using UNDO, releases the
locks and sends an ack.
29
2PC With Failures
In case of message loss, resend.
Coordinator crashes before writing commit
On recovery, it broadcasts an abort, cohorts who had
agreed to commit undo the transaction and abort. Other
cohorts simply undo the transaction. Note that cohorts
are blocked until they receive an abort.
Coordinator crashes after writing commit but
before writing the complete
On recovery, it broadcasts a commit and waits for acks.
Cohorts are blocked until they receive commit.
Coordinator crashes after writing the complete.
Nothing to be done on recovery.
30
2PC With Failures
If a cohort crashes in Phase 1, the coordinator will
abort the transaction.
Suppose a cohort crashes in Phase II, i.e., after
writing UNDO and REDO.
On recovery, the cohort will check with the coordinator
whether to abort/commit.
Committing may require a redo operation because the
cohort may have failed before updating the database.
31
Actions taken by a participant P when residing in state READY
and having contacted another participant Q. If all in READY,
still wait for coordinator.
32
Three-Phase Commit
2PC is blocking. To be non-blocking:
There is no single state from which it is possible to make a
transition directly to either a commit or an abort.
There is no state in which it is not possible to make a final
decision, and from which a transition to a commit can be made.
33
3PC
A participant P may block in ready or precommit. On
timeout, P can conclude that the coordinator has failed.
If P contacts any other participant that is in commit (or
abort), P should move to that state. If all in precommit,
commit. If any in init, abort.
If all participants that P can contact is in ready (and they
form a majority), abort.
If a process recovers to init, should abort
If a process recovers to precommit, can still abort.
Different from 2PC since a crashed participant may recover to
commit; which in 3PC, they can only in init, abort, or precommit
If all processes that P can reach are in precommit (and they
form a majority), commit. Note that no process in init at
this time.
34
Replication
A common technique to provide fault tolerance is to
replicate data at many sites.
Commit protocols can be used to update multiple copies of
data, but it cannot tolerate failures.
It is desirable that sites can continue to operate even when
other sites have crashed.
With voting, each replica is assigned some number of
votes, and a majority of votes must be collected before
accessing the replica.
Static voting
Dynamic voting
35
Vote Assignment
A site must acquire at lease w (write quorum)
votes before write, and r (read quorum) votes
before read.
R+w > v
W> v/2
The first condition guarantees that every read and
write intersects
Read and write will not perform concurrently.
A read operation can get the latest copy.
The second condition ensures that two write
quorums intersect
If there is a network partition, only one group allows
write.
36
The Voting Algorithm
Each replica of the data has version number associated
with it, which is initialized to 0.
When site i performs a read or a write, it first broadcasts a
request for votes.
A site receiving the request replies with the version
number and the vote (lock may be required).
After acquiring enough votes for the read/write quorum
within a timeout period, site i can perform the operation.
For a read, the site checks the version number of all the
collected replica, and use the one with the highest version
number.
For a write, the site finds the highest version number
replica, and does the write. After the write operation, it
updates all copies in the write quorum.
37
Performance
The values of r and w may affect the system performance.
For example, N1=N2=N3=1 vote, N4 =2 votes.
With r=1, w=5, any site failure will make write
unavailable.
With r=3 and w=3, the write operation can tolerate some
site failure, but the read operation may not be efficient. For
example, a read operation may need to get more remote
votes, which increase the access delay.
Assigning more votes to a highly reliable site makes the
system more reliable.
38
Handle Failures
Under only site failures.
Read and write operations can be performed if there are
more than w sites alive.
If the number of votes is less than w, but more than r,
then only read can be performed, otherwise, even read
cannot be done.
Under network partition. Three scenarios:
One group has a read and write quorum, and all others
have neither read nor write quorum.
Some groups have read quorum, but no group has a
write quorum.
No group has even a read quorum.
39
An Example
Consider a system with six sites A, B, C, D, E, and F, each
with one vote. Let w=4, r=3.
If only site B fails, operations can still be performed.
If the network is divided into two groups {ABCD} and
{EF}. Both read and write can be performed in the first
group, but no operation in the second group.
If the network is divided into {ABC} and {DEF}, only
read can be performed.
If the network is divided into {AB}, {CD}, and {EF}, no
operation can be performed.
40
Dynamic Voting Protocols
Network partition or site failures may make read
and write operations unavailable.
Dynamic voting solves this problem by adapting
the number of votes or the set of sites that can
form a quorum.
Majority based approach: the set of sites that can form a
majority to allow access to replicated data changes
dynamically.
Dynamic vote reassignment: the number of votes
assigned to a site changes dynamically.
41
An Example
ABCDE
CE ABD
AB
D
A B
ACE
Assume each site has one vote, only ABCDE, ABD, ACE allow update.
With dynamic voting protocol, ABCDE, ABD, AB, A, and ACE allow
Updates.
42
Definations
Version number (VN
i
): counts the number of
updates to site i. Initialized to 0.
Number of replicas updated (RU
i
): the number of
replicas participating in the most recent update.
Initialized to the number of replicas.
Distinguished site list (DS
i
): stores ids of one or
more sites.
When RU
i
is even, DS
i
identifies the replica with the
largest id number.
When RU
i
is odd, DS
i
is empty except when RU
i
=3, in
which case DS
i
lists the three replicas.
43
- - - - - DS
5 5 5 5 5 RU
3 3 3 3 3 VN
E D C B A
- - abc abc abc DS
5 5 3 3 3 RU
3 3 4 4 4 VN
E D C B A
- - abc abc abc DS
5 5 3 3 3 RU
3 3 5 5 4 VN
E D C B A
b b b b abc DS
4 4 4 4 3 RU
6 6 6 6 4 VN
E D C B A
b b b b abc DS
4 4 2 2 3 RU
6 6 7 7 4 VN
E D C B A
44
The Protocol
1. Site i issues a lock_request to its local lock manager. If the lock is granted,
i sends a vote_request to all the sites.
2. When a site j receives the vote_request, it obtains locks and sends the
values of VN
j
, RU
j
, and DS
j
to site i.
3. From the responses, site i decides whether it belongs to the distinguished
partition, described later.
4. If i does not belong to the distinguished partition, it releases locks and
sends abort to the responded sites, who also releases locks.
5. If i belongs to the distinguished partition, it gets the current data copy and
update. Also i executes the update protocol. Site i then sends a commit to
all participating sites, and asks them to update the data, VN
j
, RU
j
, and DS
j
.
It then releases the locks.
6. When a site j receives the commit message, it updates the data, and VN
j
,
RU
j
, and DS
j
, then releases the locks.
45
Distinguished Partition
Notation:
P: the set of responding sites
M: the most recent version in the partition
Q: the set of sites containing the version M
N: the number of sites that participated in the latest
update indicated by version number M.
We have
M = max{VN
j
: j P}
Q = {j P: VN
j
= M}
N = RU
j
, where j Q
46
Distinguished Partition
If Cardinality (Q) > N/2, site i is a member of the
distinguished partition.
Else if Cardinality (Q) = N/2, break the tie by selecting a
site j Q; if DS
j
Q, then i belongs to the distinguished
partition.
Otherwise, if N=3, and if P contains two or all three sites
indicated by the DS variable of the site in Q, then i belongs
to the distinguished partition.
Otherwise, i does not belong to the distinguished partition.
47
Update Protocol
Update is invoked when a state is ready to
commit. For site i.
VN
i
= M+1
RU
i
= Cardinality (P)
DS
i
is updated as follows when N 3, since static voting
is used when N=3.
DS
i
= K if RU
i
is even, where K is the site with the
highest order
DS
i
= P if RU
i
=3
Note that the protocol may have deadlocks
because of using locks.
48
Dynamic Vote Reassignment
Change the votes of the site in the majority
partition such that the loss of sites is properly
compensated and further partitions can be
handled.
Overthrow technique
After a partition or a failure, for each site x outside the
majority group, there will be one site (e.g., a) that takes
over the vote of x. For example, v(a) is increased by
2v(x).
Side effect: some sites are more powerful.
Alliance technique
The vote is evenly distributed among all notes.
49
An Example
Suppose v(a) =6, v(b)=v(c)=v(d) = 5. The total votes are
21, and the majority is 11.
Assume that site a is disconnected, and {b,c,d} becomes
the majority group with 15 votes.
Using overthrow techniques, assume b gets the extra vote.
V(a) =6, v(b) =17, v(c) =v(d)=5.
The total votes are now 33, and the majority is 17.
With the alliance technique, the final vote is:
V(a) =6, v(b) = v(c) =v(d)=11.
The total votes are 39, and the majority is 20.
Suppose site c disconnects, without vote reassignment,
group {b,d} is not a majority. With reassignment, it is.
50
Recovery
Two basic approaches are used to recover to a correct state
from an error state
Backward recovery: periodically store error-free states in stable
storage. On detecting an error, restore the system to the most
recent recovery point.
Forward recovery: on detection of an error, take steps to move the
system into a correct state.
Disadvantages of backward recovery
Overhead of checkpointing
What about error while recovering
What if there are no safe recovery points
Advantages:
Dont need to anticipate all fault types
Widely implemented in most systems
51
Operation-based Approach
Keep a log or audit trail for each transaction
Update-in-place:
Every update to an object results in a log to be recorded in a stable
storage which has enough information to completely undo and
redo the operation.
A DO operation updates the data and logs the state of the object
before and after each operation on stable storage.
If a transaction commits, nothing more to do
If the transaction aborts, restore the old value (UNDO)
If failure, UNDO and REDO.
Problem: the system crashes before the log is flushed to
stable storage.
52
Write-Ahead Logging
The write ahead log protocol follows two rules:
Update an object only after the UNDO log is recorded.
Before committing the updates, REDO and UNDO logs
are recorded.
On recovery, UNDO the effects of uncommitted
transactions. On restart, REDO the necessary
operations.
If failures are rare, the logging overhead is very
expensive in terms of storage requirements and
CPU time.
53
State-Based Approach
In state-based approach, the complete state of a process is
saved when a recovery point is established.
The process of saving states on the stable storage is called
checkpointing or taking a checkpoint.
The recovery point at which the checkpointing occurs is called a
checkpoint.
The process of restoring a process to a prior-state is called rolling back.
Shadow page: whenever a process wants to modify an object,
the page containing the object is duplicated and is maintained
on the stable storage.
The process only updates one copy. The unmodified copy is called the
shadow page.
If the process fails, the modified copy is discarded and the shadow page
is used.
If the process commits, the shadow page is discarded.
54
Orphan Messages and the Domino Effect
An orphan message is a message whose receiving event is
recorded in the checkpoint, but its sending event is not.
Domino effect: one rollback causes other to roll back; e.g.,
when Y goes back to checkpoint y
2
.
X
Y
Z
m
x
1
z
2
z
1
x
3
x
2
y
2
y
1
z
3
55
Lost Messages
A message whose sending event is recorded, but
its receiving event is not recorded.
Y fails and rolls back to y
1
, and then X is in a state
where it sent m but Y will never receive it. This
situation can also happen if the communication channel
is not reliable.
X
Y
m
x
1
y
1
Failure
56
Livelock
Livelock is a situation in which a single failure can cause
an infinite number of rollbacks, preventing the system
from making progress.
X
Y
n1
x
1
y
1
Failure
m1
X
Y
n2
x
1
y
1
Failure
m2
n1
57
Consistent Checkpoints
A process saves its local state on the stable storage, which is
called a local checkpoint.
The process of saving local states is called local checkpointing.
All the local checkpoints, one from each site, collectively form a
global checkpoint.
A global checkpoint is a strongly consistent set of checkpoints if
there is no orphan message and no lost message.
A global checkpoint is a consistent set of checkpoints if there is
no orphan message.
If every process takes a checkpoint after sending every message,
the set of the most recent checkpoints is always consistent.
However, it has a high overhead. How about taking a checkpoint
after every k (k>1) messages sent?
58
The System Model
Processes communicate by message passing.
Channels are FIFO. End-to-end protocols are
assumed to cope with message loss due to rollback
recovery and communication failure
Another way to handle message loss is to have
processes log messages before each send.
Communication failures do not partition the
network.
59
Coordinated Checkpointing
Two kinds of checkpoints: tentative and
permanent.
First phase
The initiator requests all processes to take tentative
checkpoints.
Each process informs the initiator whether it would like
to take a tentative checkpoint. A process will not say
yes if there is a failure or other reasons.
Second phase
The initiator requests them to make tentative
checkpoints permanent if it receives yes from all of
them; otherwise asks them to discard the tentative
checkpoints.
60
An Optimization
Sometimes, it is not necessary to ask all processes to take
checkpoints for each checkpointing initiation.
X
Y
m
y
1
Z
x
2
x
1
z
1
y
2
z
2
61
Each message is attached a monotonically increasing
number.
Let m be the last message that X received from Y after X
has taken its last checkpoint, Then,
last_rcvd
x
[y] = m.l if m exists, otherwise, it is
Let m be the first message that X sent to Y after X took its
last checkpoint. Then,
first_sent
x
[y] = m.l if m exists, otherwise, it is .
Whenever X requests Y to take a tentative checkpoint, X
sends last_rcvd
x
[y] along with its request; Y takes a
checkpoint only if
last_rcvd
x
[y] first_sent
y
[x] >
cohort
x
= {y | last_rcvd
x
[y] > }
Initial state at all processes p;
For all processes q do first_sent
p
[q] = ;
OK_cp
p
=yes if p is willing to take a checkpoint, no otherwise.
62
At initiator process P
i
:
Send take_tentative (P
i
,last_rcvd
Pi
[p]) to each process p cohort
Pi
if all processes replied yes then
send make_permanent to each process p cohort
Pi
else send abort to all processes p cohort
Pi
At all process p:
Upon receiving take_tentative(q, last_rcvd
q
[p]) from q do
if OK_cp
p
=yes and last_revd
q
[p] first_sent
p
[q] > then
Take a tentative checkpoint
Send take_tentative (p,last_rcvd
p
[r]) message
to each process r cohort
p
If all prcesses r cohort
p
replied yes then
OK_cp
p
= yes
else OK_cp
p
= no
Send (p, OK_cp
p
) to q
63
Algorithm (continued)
Upon receiving make_permanent
Make tentative checkpoint permanent;
Send make_permanent to all processes r cohort
p
Upon receiving abort
Undo tentative checkpoint;
Send abort to all processes r cohort
p
64
Recovery
First phase
The initiator requests all processes to restart from their
previous checkpoints.
Each process informs the initiator whether it would like
to restart. A process will not say yes if it is already
participating in a checkpointing or a recovering process
initiated by some other process.
If the initiator learns that all processes are willing to
restart, it asks them to roll back; otherwise, processes
should continue their normal activities.
Second phase
The initiator propagates its decision to all processes,
which acts accordingly.
65
An Optimization
Sometimes, it is not necessary to ask every process to
rollback.
X
Y
m
y
1
Z
x
2
x
1
z
1
y
2
z
2
Failure
66
Data Structure
Let m be the last message that X sent to Y before X takes
its last checkpoint, Then,
last_sent
x
[y] = m.l if m exists, otherwise, it is
When X requests Y to restart from the previous
checkpoint, it sends last_sent
x
[y] along with its request; Y
restarts from its previous checkpoint only if
last_rcvd
y
[x] > last_sent
x
[y]
cohort
x
= {y| x can send messages to y}
Initial state at all processes p;
resume_execution
p
= true
for each process q do last_rcvd
p
[q] = T;
OK_roll
p
=yes if p is willing to rollback, it is no otherwise.
67
At initiator process P
i
:
Send prepare_roll (P
i
,last_sent
Pi
[p]) to each process p cohort
Pi
if all processes replied yes then
send rollback to each process p cohort
Pi
else send abort to all processes p cohort
Pi
At all process p:
Upon receiving prepare_roll(q, last_sent
q
[p]) message from q do
if OK_roll
p
==yes and
last_rcvd
p
[q] > last_sent
q
[p] and (resume_execution
p
) then
resume_execution
p
= false
Send prepare_roll (p,last_rcvd
p
[r]) message to each
process r cohort
p
If all prcesses r cohort
p
replied yes then
OK_roll
p
= yes
else OK_roll
p
= no
Send (p, OK_roll
p
) to q
68
Algorithm (continued)
Upon receiving rollback and if (resume_execution
p
== false)
restart from ps last checkpoint
Send rollback to all processes r cohort
p
Upon receiving abort
resume execution
Send abort to all processes r cohort
p
69
Asynchronous Checkpointing
Under asynchronous approach, checkpoints at
each process are taken independently without
synchronization.
Remove the synchronization overhead of the
coordinated approach.
May result in domino effects.
One solution is to log incoming messages.
Pessimistic message logging: an incoming message is
logged before it is processed. Two much overhead, and
slows down computation
Optimistic: processes continue to perform the
computation, and store messages in memory, which
will be logged at certain intervals. More rollbacks
during failure, may still have domino effects.
70
Computation Model
The communication channels are reliable, FIFO, and have
infinite buffers.
The message communication delay is arbitrary, but finite.
The underlying computation is assumed to be event-driven,
where a process waits until a message is received,
processes the message, changes its state, and sends
messages to other processes.
71
Notations
rcvd
ij
(cp
i
) represents the number of messages received
by process i from process j as stored in the checkpoint cp
i
.
sent
ij
(cp
i
) represents the number of messages sent by
process i to process j as stored in the checkpoint cp
i
.
X
Y
Z
e
x0
e
y0
e
x1
e
x2
e
y2
e
y3
e
z3
e
z2
e
z1
e
z0
e
y1
72
The Algorithm
At process i:
(a)If i is a process that is recovering after failure then
cp
i
= latest event logged in the stable storage
else cp
i
= latest event that took place in i
(b) For k=1 to N do
send rollback (i, sent
ij
(cp
i
)) to all process j
wait for rollback messages from every other process
for every rollback (j,c) message received from
another process j, i does the following:
if rcvd
ij
(cp
i
) > c then
find the latest event e such that rcvd
ij
(cp
i
) =c
cp
i
= e;
73
An Example
X
Y
Z
e
x0
e
y0
e
x1
e
x2
e
y2 e
y3
e
z2
e
z1
e
z0
e
y1
x
1
Failure
e
x3
y
1
z
1

You might also like