This document discusses concepts related to fault tolerance in distributed systems. It defines key terms like availability, reliability, safety, and maintainability. It describes different types of failures like crashes, omissions, and Byzantine failures. It discusses approaches to achieving fault tolerance through redundancy, process replication, and atomic multicast. It also covers issues like the two generals problem, RPC semantics, and reliable multicasting.
This document discusses concepts related to fault tolerance in distributed systems. It defines key terms like availability, reliability, safety, and maintainability. It describes different types of failures like crashes, omissions, and Byzantine failures. It discusses approaches to achieving fault tolerance through redundancy, process replication, and atomic multicast. It also covers issues like the two generals problem, RPC semantics, and reliable multicasting.
This document discusses concepts related to fault tolerance in distributed systems. It defines key terms like availability, reliability, safety, and maintainability. It describes different types of failures like crashes, omissions, and Byzantine failures. It discusses approaches to achieving fault tolerance through redundancy, process replication, and atomic multicast. It also covers issues like the two generals problem, RPC semantics, and reliable multicasting.
(Fault-tolerance) Guohong Cao Department of Computer Science & Engineering The Pennsylvania State University http://www.cse.psu.edu/~gcao 2 Basic Concepts Fault-tolerance is related to dependability, which covers: Availability is the probability that the system is operating correctly at any given moment. Reliability refers to the property that a system can run continuously without failure. Defined in terms of time interval. A system goes down 1ms every hour has an availability of 99.9999%, but not reliable. Safety refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens; e.g., in nuclear power plans. Maintainability refers to how easy a failed system can be repaired. High maintainability means high availability. 3 Basic Concepts A system is said to fail when it cannot meet its specifications. An error is part of a systems state that may lead to a failure. The cause of an error is called a fault. Bad transmission medium may cause transmission error. Transient faults occurs once and then disappear. Intermittent fault occurs, vanishes, then reappear. Permanent fault: continues to exist Fault tolerance (dependable system): the system can provide its service even in the presence of faults. 4 Failure Models A server may produce arbitrary responses at arbitrary times Arbitrary (Byzantine) failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Response failure Value failure State transition failure A server's response lies outside the specified time interval Timing failure A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Omission failure Receive omission Send omission A server halts, but is working correctly until it halts Crash (fail-stop) failure Description Type of failure Other failure model concepts: Process failure: generates incorrect results; e.g., deadlock, protection fault, divide by zero. Software or hardware fault Network partition 5 Fault-Tolerance with Redundancy Information redundancy Extra bits are added to allow recovery from bit errors, e.g., Hamming code Time redundancy Rollback recovery, abort a transaction and restarts Physical: triple modular redundancy (TMR) 6 Process Resilience The key approach to tolerating a faulty process is to organize several identical processes into groups. Group management issues such as join, leave. Group can be flat or hierarchical Flat doesnt have single point of failure, but the decision making is complicated. 7 Failure Masking and Replication Process groups are used to replace single process to achieve fault-tolerance. Two approaches are used: primary-based: the primary can be replaced in case of failure. replicated-write: in the form of active replication, or quorum-based protocols. In a fail stop model, k+1 processes are enough for k fault- tolerant, which means that it can survive k faults In a byzantine failure model, 2k+1 are needed In practice, it is difficult to say that k processes will fail, so statistical analysis is needed. Atomic multicast is needed to make sure all requests will arrive at all servers in the same order. 8 The Two-Army Problem Without failure, agreement is trivial, but agreement in faulty systems is much harder. Agreement is needed in electing a coordinator, deciding whether to commit or not, synchronization. There are two generals of the same army who have encamped a short distance apart. Their objective is to capture a hill, which is possible only if they attack simultaneously. If only one general attacks, he will be defeated. The two generals can only communicate by sending messengers, which is not reliable. Is it possible for them to attack simultaneously? 9 Impossibility Proof Assume there is a protocol which sends messengers a fixed number of times to solve the problem. Let P be the shortest protocol. Suppose the last messenger in P does not reach the destination. Then either the message carried by the messenger is useless or one of the generals does not get the needed message. Since P is the minimum length protocol by our assumption, the message that was lost was not a useless message and hence one of the generals will not attack. A contradiction. 10 Byzantine Generals Problem In this problem, the enemy is still on the hill, but n generals are around the enemy. Reliable communication can be achieved pairwise. m of the generals are traitors (faulty) and are trying to prevent loyal generals from reaching agreement by giving wrong information. Can the loyal generals reach agreement? To reach an agreement in a system with m faulty processes, at least 3m+1 processes are needed. 11 12 Reliable Client-Server Communication A communication channel may exhibit crash, omission, timing, and arbitrary failures. Pont-to-point communication: TCP can mask omission failure, but not crash failure, where the connection is broken. One way to deal with it is to generate an exception, or automatically setup a new connection RPC semantics in the presence of failure The client is unable to locate the server The request msg from the client to the server is lost The server crashes after receiving a request The reply message from the server is lost The client crashes after sending a request 13 RPC Semantics Client cannot locate the server Solution: raise an exception. This violates the transparency since the programmer needs to differentiate RPC from local procedure call. Lost request msg When timer expires before getting ack, resend. Server crashes have different semantics: at least once, at most once, guarantee nothing, and exact once. 14 Exactly once semantics is impossible to achieve Server has two strategies: either send a complete message just before it actually tells the printer to do its work, or after the text has been printed. The client has four options: always resend request, never resend, send when ACKed, send when not ACKed. M: send the complete message, C: crash, P: print the text. 15 RPC Semantics Lost reply messages The client cannot differentiate whether the server is very slow or the request (reply) is lost. Sending another request may have problem, e.g., transfer money. Use sequence number, but the server needs to track all clients. Some operations are safe, e.g., read the first 1024 bytes of a file. A request which can be executed multiple times without any side effect, is said to be idempotent. 16 Client Crashes After the client crashes, the server may still process the request and the computation is active, but no parent is waiting for the result. Such computation is called an orphan. Orphan wastes CPU cycle. If the client reboots, the reply from the orphan may come back. Solutions. Extermination: make a log before sending RPC. After a reboot, check the log and kill the orphan. Disadvantage: disk write, grandorphans. Reincarnation: divide time into numbered epochs. When reboots, broadcast the start of a new epochs. Remote computations of that client are killed. Expiration: define a time T for each RPC. If not enough, ask for more. Wait T before reboot so that no orphan exists 17 Basic Reliable Multicasting 18 Scalability in Reliable Multicasting If there are N receivers, the sender gets N ACKs and results in ACK implosion. Solution: use NACK. Problem? 19 Feedback Suppression Feedback suppression has good scalability, but it also has some problems: Accurate scheduling of the NACKs at the receivers Interrupts to processes which received the messages correctly. Solutions: let receivers that missed m join a new group. Need to setup a new group. Let receivers that tend to miss the same messages team up and share the same multicast channel for feedback and retransmission. Use hierarchical feedback control 20 In hierarchical reliable multicasting, each local coordinator forwards the message to its children and later handles retransmission requests. If the coordinator missed a message, it asks the coordinator of the parent subgroup to retransmit it. Problem is how to build the tree dynamically. 21 Atomic Multicast In atomic multicast, a message is delivered to either all correct processes or to none at all. Also, all messages are delivered in the same order to all correct processes. Useful in making all replicas have the same view Failed process can synchronize again after recovery Message receipt is different from message delivery. 22 Virtual Synchrony As the group membership changes, a view change takes place by multicasting a vc which indicates the member change. A reliable multicast is virtually synchronous if a message multicast to group view G is delivered to each nonfaulty process in G. If the sender of the message crashes during the multicast, the msg may either be delivered to all or none. It must deal with sender failure during the sending process. All multicasts that are in transit while a view change takes place are completed before the view change comes into effect. 23 Virtual Synchronous Multicast 24 Yes Causal-ordered delivery Causal atomic multicast Yes FIFO-ordered delivery FIFO atomic multicast Yes None Atomic multicast No Causal-ordered delivery Causal multicast No FIFO-ordered delivery FIFO multicast No None Reliable multicast Total-ordered Delivery? Basic Message Ordering Multicast Six different versions of virtually synchronous reliable multicasting Message Ordering 25 Implementing Virtual Synchrony All messages sent to view G are delivered to all nonfaulty processes in G before the next group membership change. If the sender does not fail, easy. When the sender crashes, processes should get m from somewhere else. Every process in G keeps m until it knows that everyone receives it. At this time, m is said to be stable, and can be delivered. A flush message is used to notify a view change. Channel is FIFO and reliable, used in ISIS. 26 27 Distributed Commit Atomic multicasting is one example of distributed commit. Other examples include transaction commit. In one-phase commit protocol, the coordinator tells all participants to perform an operation. If one participant cannot perform, not consistent. Solution is to use two-phase commit protocol. On failure, cleanup must enforce all-or-nothing semantics. If multiple sites are involved, all should agree on the same outcome. 28 Phase 1, at the coordinator: 1. The coordinator sends a commit_request to every cohort requesting the cohorts to commit. 2. The coordinator waits for replies from all the cohorts. At cohorts: 1. On receiving the commit_request, if the transaction executing is successful, it writes UNDO and REDO log and sends an agree to the coordinator, otherwise, it sends an abort. Phase 2, at the coordinator: 1. If all cohorts agree, the coordinator writes a commit into the log. Then, it sends a commit to all cohorts. Otherwise, it sends an abort. 2. The coordinator waits for ack from each cohort. 3. If an ack is not received from any cohort within a timeout period, it resends the commit/abort to that cohort. 4. If all the acks are received, it writes a complete to the log. At cohorts: 1. On receiving a commit, it releases locks for executing the transaction, and sends an ack. 2. On receiving an abort, it undoes the transaction using UNDO, releases the locks and sends an ack. 29 2PC With Failures In case of message loss, resend. Coordinator crashes before writing commit On recovery, it broadcasts an abort, cohorts who had agreed to commit undo the transaction and abort. Other cohorts simply undo the transaction. Note that cohorts are blocked until they receive an abort. Coordinator crashes after writing commit but before writing the complete On recovery, it broadcasts a commit and waits for acks. Cohorts are blocked until they receive commit. Coordinator crashes after writing the complete. Nothing to be done on recovery. 30 2PC With Failures If a cohort crashes in Phase 1, the coordinator will abort the transaction. Suppose a cohort crashes in Phase II, i.e., after writing UNDO and REDO. On recovery, the cohort will check with the coordinator whether to abort/commit. Committing may require a redo operation because the cohort may have failed before updating the database. 31 Actions taken by a participant P when residing in state READY and having contacted another participant Q. If all in READY, still wait for coordinator. 32 Three-Phase Commit 2PC is blocking. To be non-blocking: There is no single state from which it is possible to make a transition directly to either a commit or an abort. There is no state in which it is not possible to make a final decision, and from which a transition to a commit can be made. 33 3PC A participant P may block in ready or precommit. On timeout, P can conclude that the coordinator has failed. If P contacts any other participant that is in commit (or abort), P should move to that state. If all in precommit, commit. If any in init, abort. If all participants that P can contact is in ready (and they form a majority), abort. If a process recovers to init, should abort If a process recovers to precommit, can still abort. Different from 2PC since a crashed participant may recover to commit; which in 3PC, they can only in init, abort, or precommit If all processes that P can reach are in precommit (and they form a majority), commit. Note that no process in init at this time. 34 Replication A common technique to provide fault tolerance is to replicate data at many sites. Commit protocols can be used to update multiple copies of data, but it cannot tolerate failures. It is desirable that sites can continue to operate even when other sites have crashed. With voting, each replica is assigned some number of votes, and a majority of votes must be collected before accessing the replica. Static voting Dynamic voting 35 Vote Assignment A site must acquire at lease w (write quorum) votes before write, and r (read quorum) votes before read. R+w > v W> v/2 The first condition guarantees that every read and write intersects Read and write will not perform concurrently. A read operation can get the latest copy. The second condition ensures that two write quorums intersect If there is a network partition, only one group allows write. 36 The Voting Algorithm Each replica of the data has version number associated with it, which is initialized to 0. When site i performs a read or a write, it first broadcasts a request for votes. A site receiving the request replies with the version number and the vote (lock may be required). After acquiring enough votes for the read/write quorum within a timeout period, site i can perform the operation. For a read, the site checks the version number of all the collected replica, and use the one with the highest version number. For a write, the site finds the highest version number replica, and does the write. After the write operation, it updates all copies in the write quorum. 37 Performance The values of r and w may affect the system performance. For example, N1=N2=N3=1 vote, N4 =2 votes. With r=1, w=5, any site failure will make write unavailable. With r=3 and w=3, the write operation can tolerate some site failure, but the read operation may not be efficient. For example, a read operation may need to get more remote votes, which increase the access delay. Assigning more votes to a highly reliable site makes the system more reliable. 38 Handle Failures Under only site failures. Read and write operations can be performed if there are more than w sites alive. If the number of votes is less than w, but more than r, then only read can be performed, otherwise, even read cannot be done. Under network partition. Three scenarios: One group has a read and write quorum, and all others have neither read nor write quorum. Some groups have read quorum, but no group has a write quorum. No group has even a read quorum. 39 An Example Consider a system with six sites A, B, C, D, E, and F, each with one vote. Let w=4, r=3. If only site B fails, operations can still be performed. If the network is divided into two groups {ABCD} and {EF}. Both read and write can be performed in the first group, but no operation in the second group. If the network is divided into {ABC} and {DEF}, only read can be performed. If the network is divided into {AB}, {CD}, and {EF}, no operation can be performed. 40 Dynamic Voting Protocols Network partition or site failures may make read and write operations unavailable. Dynamic voting solves this problem by adapting the number of votes or the set of sites that can form a quorum. Majority based approach: the set of sites that can form a majority to allow access to replicated data changes dynamically. Dynamic vote reassignment: the number of votes assigned to a site changes dynamically. 41 An Example ABCDE CE ABD AB D A B ACE Assume each site has one vote, only ABCDE, ABD, ACE allow update. With dynamic voting protocol, ABCDE, ABD, AB, A, and ACE allow Updates. 42 Definations Version number (VN i ): counts the number of updates to site i. Initialized to 0. Number of replicas updated (RU i ): the number of replicas participating in the most recent update. Initialized to the number of replicas. Distinguished site list (DS i ): stores ids of one or more sites. When RU i is even, DS i identifies the replica with the largest id number. When RU i is odd, DS i is empty except when RU i =3, in which case DS i lists the three replicas. 43 - - - - - DS 5 5 5 5 5 RU 3 3 3 3 3 VN E D C B A - - abc abc abc DS 5 5 3 3 3 RU 3 3 4 4 4 VN E D C B A - - abc abc abc DS 5 5 3 3 3 RU 3 3 5 5 4 VN E D C B A b b b b abc DS 4 4 4 4 3 RU 6 6 6 6 4 VN E D C B A b b b b abc DS 4 4 2 2 3 RU 6 6 7 7 4 VN E D C B A 44 The Protocol 1. Site i issues a lock_request to its local lock manager. If the lock is granted, i sends a vote_request to all the sites. 2. When a site j receives the vote_request, it obtains locks and sends the values of VN j , RU j , and DS j to site i. 3. From the responses, site i decides whether it belongs to the distinguished partition, described later. 4. If i does not belong to the distinguished partition, it releases locks and sends abort to the responded sites, who also releases locks. 5. If i belongs to the distinguished partition, it gets the current data copy and update. Also i executes the update protocol. Site i then sends a commit to all participating sites, and asks them to update the data, VN j , RU j , and DS j . It then releases the locks. 6. When a site j receives the commit message, it updates the data, and VN j , RU j , and DS j , then releases the locks. 45 Distinguished Partition Notation: P: the set of responding sites M: the most recent version in the partition Q: the set of sites containing the version M N: the number of sites that participated in the latest update indicated by version number M. We have M = max{VN j : j P} Q = {j P: VN j = M} N = RU j , where j Q 46 Distinguished Partition If Cardinality (Q) > N/2, site i is a member of the distinguished partition. Else if Cardinality (Q) = N/2, break the tie by selecting a site j Q; if DS j Q, then i belongs to the distinguished partition. Otherwise, if N=3, and if P contains two or all three sites indicated by the DS variable of the site in Q, then i belongs to the distinguished partition. Otherwise, i does not belong to the distinguished partition. 47 Update Protocol Update is invoked when a state is ready to commit. For site i. VN i = M+1 RU i = Cardinality (P) DS i is updated as follows when N 3, since static voting is used when N=3. DS i = K if RU i is even, where K is the site with the highest order DS i = P if RU i =3 Note that the protocol may have deadlocks because of using locks. 48 Dynamic Vote Reassignment Change the votes of the site in the majority partition such that the loss of sites is properly compensated and further partitions can be handled. Overthrow technique After a partition or a failure, for each site x outside the majority group, there will be one site (e.g., a) that takes over the vote of x. For example, v(a) is increased by 2v(x). Side effect: some sites are more powerful. Alliance technique The vote is evenly distributed among all notes. 49 An Example Suppose v(a) =6, v(b)=v(c)=v(d) = 5. The total votes are 21, and the majority is 11. Assume that site a is disconnected, and {b,c,d} becomes the majority group with 15 votes. Using overthrow techniques, assume b gets the extra vote. V(a) =6, v(b) =17, v(c) =v(d)=5. The total votes are now 33, and the majority is 17. With the alliance technique, the final vote is: V(a) =6, v(b) = v(c) =v(d)=11. The total votes are 39, and the majority is 20. Suppose site c disconnects, without vote reassignment, group {b,d} is not a majority. With reassignment, it is. 50 Recovery Two basic approaches are used to recover to a correct state from an error state Backward recovery: periodically store error-free states in stable storage. On detecting an error, restore the system to the most recent recovery point. Forward recovery: on detection of an error, take steps to move the system into a correct state. Disadvantages of backward recovery Overhead of checkpointing What about error while recovering What if there are no safe recovery points Advantages: Dont need to anticipate all fault types Widely implemented in most systems 51 Operation-based Approach Keep a log or audit trail for each transaction Update-in-place: Every update to an object results in a log to be recorded in a stable storage which has enough information to completely undo and redo the operation. A DO operation updates the data and logs the state of the object before and after each operation on stable storage. If a transaction commits, nothing more to do If the transaction aborts, restore the old value (UNDO) If failure, UNDO and REDO. Problem: the system crashes before the log is flushed to stable storage. 52 Write-Ahead Logging The write ahead log protocol follows two rules: Update an object only after the UNDO log is recorded. Before committing the updates, REDO and UNDO logs are recorded. On recovery, UNDO the effects of uncommitted transactions. On restart, REDO the necessary operations. If failures are rare, the logging overhead is very expensive in terms of storage requirements and CPU time. 53 State-Based Approach In state-based approach, the complete state of a process is saved when a recovery point is established. The process of saving states on the stable storage is called checkpointing or taking a checkpoint. The recovery point at which the checkpointing occurs is called a checkpoint. The process of restoring a process to a prior-state is called rolling back. Shadow page: whenever a process wants to modify an object, the page containing the object is duplicated and is maintained on the stable storage. The process only updates one copy. The unmodified copy is called the shadow page. If the process fails, the modified copy is discarded and the shadow page is used. If the process commits, the shadow page is discarded. 54 Orphan Messages and the Domino Effect An orphan message is a message whose receiving event is recorded in the checkpoint, but its sending event is not. Domino effect: one rollback causes other to roll back; e.g., when Y goes back to checkpoint y 2 . X Y Z m x 1 z 2 z 1 x 3 x 2 y 2 y 1 z 3 55 Lost Messages A message whose sending event is recorded, but its receiving event is not recorded. Y fails and rolls back to y 1 , and then X is in a state where it sent m but Y will never receive it. This situation can also happen if the communication channel is not reliable. X Y m x 1 y 1 Failure 56 Livelock Livelock is a situation in which a single failure can cause an infinite number of rollbacks, preventing the system from making progress. X Y n1 x 1 y 1 Failure m1 X Y n2 x 1 y 1 Failure m2 n1 57 Consistent Checkpoints A process saves its local state on the stable storage, which is called a local checkpoint. The process of saving local states is called local checkpointing. All the local checkpoints, one from each site, collectively form a global checkpoint. A global checkpoint is a strongly consistent set of checkpoints if there is no orphan message and no lost message. A global checkpoint is a consistent set of checkpoints if there is no orphan message. If every process takes a checkpoint after sending every message, the set of the most recent checkpoints is always consistent. However, it has a high overhead. How about taking a checkpoint after every k (k>1) messages sent? 58 The System Model Processes communicate by message passing. Channels are FIFO. End-to-end protocols are assumed to cope with message loss due to rollback recovery and communication failure Another way to handle message loss is to have processes log messages before each send. Communication failures do not partition the network. 59 Coordinated Checkpointing Two kinds of checkpoints: tentative and permanent. First phase The initiator requests all processes to take tentative checkpoints. Each process informs the initiator whether it would like to take a tentative checkpoint. A process will not say yes if there is a failure or other reasons. Second phase The initiator requests them to make tentative checkpoints permanent if it receives yes from all of them; otherwise asks them to discard the tentative checkpoints. 60 An Optimization Sometimes, it is not necessary to ask all processes to take checkpoints for each checkpointing initiation. X Y m y 1 Z x 2 x 1 z 1 y 2 z 2 61 Each message is attached a monotonically increasing number. Let m be the last message that X received from Y after X has taken its last checkpoint, Then, last_rcvd x [y] = m.l if m exists, otherwise, it is Let m be the first message that X sent to Y after X took its last checkpoint. Then, first_sent x [y] = m.l if m exists, otherwise, it is . Whenever X requests Y to take a tentative checkpoint, X sends last_rcvd x [y] along with its request; Y takes a checkpoint only if last_rcvd x [y] first_sent y [x] > cohort x = {y | last_rcvd x [y] > } Initial state at all processes p; For all processes q do first_sent p [q] = ; OK_cp p =yes if p is willing to take a checkpoint, no otherwise. 62 At initiator process P i : Send take_tentative (P i ,last_rcvd Pi [p]) to each process p cohort Pi if all processes replied yes then send make_permanent to each process p cohort Pi else send abort to all processes p cohort Pi At all process p: Upon receiving take_tentative(q, last_rcvd q [p]) from q do if OK_cp p =yes and last_revd q [p] first_sent p [q] > then Take a tentative checkpoint Send take_tentative (p,last_rcvd p [r]) message to each process r cohort p If all prcesses r cohort p replied yes then OK_cp p = yes else OK_cp p = no Send (p, OK_cp p ) to q 63 Algorithm (continued) Upon receiving make_permanent Make tentative checkpoint permanent; Send make_permanent to all processes r cohort p Upon receiving abort Undo tentative checkpoint; Send abort to all processes r cohort p 64 Recovery First phase The initiator requests all processes to restart from their previous checkpoints. Each process informs the initiator whether it would like to restart. A process will not say yes if it is already participating in a checkpointing or a recovering process initiated by some other process. If the initiator learns that all processes are willing to restart, it asks them to roll back; otherwise, processes should continue their normal activities. Second phase The initiator propagates its decision to all processes, which acts accordingly. 65 An Optimization Sometimes, it is not necessary to ask every process to rollback. X Y m y 1 Z x 2 x 1 z 1 y 2 z 2 Failure 66 Data Structure Let m be the last message that X sent to Y before X takes its last checkpoint, Then, last_sent x [y] = m.l if m exists, otherwise, it is When X requests Y to restart from the previous checkpoint, it sends last_sent x [y] along with its request; Y restarts from its previous checkpoint only if last_rcvd y [x] > last_sent x [y] cohort x = {y| x can send messages to y} Initial state at all processes p; resume_execution p = true for each process q do last_rcvd p [q] = T; OK_roll p =yes if p is willing to rollback, it is no otherwise. 67 At initiator process P i : Send prepare_roll (P i ,last_sent Pi [p]) to each process p cohort Pi if all processes replied yes then send rollback to each process p cohort Pi else send abort to all processes p cohort Pi At all process p: Upon receiving prepare_roll(q, last_sent q [p]) message from q do if OK_roll p ==yes and last_rcvd p [q] > last_sent q [p] and (resume_execution p ) then resume_execution p = false Send prepare_roll (p,last_rcvd p [r]) message to each process r cohort p If all prcesses r cohort p replied yes then OK_roll p = yes else OK_roll p = no Send (p, OK_roll p ) to q 68 Algorithm (continued) Upon receiving rollback and if (resume_execution p == false) restart from ps last checkpoint Send rollback to all processes r cohort p Upon receiving abort resume execution Send abort to all processes r cohort p 69 Asynchronous Checkpointing Under asynchronous approach, checkpoints at each process are taken independently without synchronization. Remove the synchronization overhead of the coordinated approach. May result in domino effects. One solution is to log incoming messages. Pessimistic message logging: an incoming message is logged before it is processed. Two much overhead, and slows down computation Optimistic: processes continue to perform the computation, and store messages in memory, which will be logged at certain intervals. More rollbacks during failure, may still have domino effects. 70 Computation Model The communication channels are reliable, FIFO, and have infinite buffers. The message communication delay is arbitrary, but finite. The underlying computation is assumed to be event-driven, where a process waits until a message is received, processes the message, changes its state, and sends messages to other processes. 71 Notations rcvd ij (cp i ) represents the number of messages received by process i from process j as stored in the checkpoint cp i . sent ij (cp i ) represents the number of messages sent by process i to process j as stored in the checkpoint cp i . X Y Z e x0 e y0 e x1 e x2 e y2 e y3 e z3 e z2 e z1 e z0 e y1 72 The Algorithm At process i: (a)If i is a process that is recovering after failure then cp i = latest event logged in the stable storage else cp i = latest event that took place in i (b) For k=1 to N do send rollback (i, sent ij (cp i )) to all process j wait for rollback messages from every other process for every rollback (j,c) message received from another process j, i does the following: if rcvd ij (cp i ) > c then find the latest event e such that rcvd ij (cp i ) =c cp i = e; 73 An Example X Y Z e x0 e y0 e x1 e x2 e y2 e y3 e z2 e z1 e z0 e y1 x 1 Failure e x3 y 1 z 1