You are on page 1of 76

Fault Tolerance (I)

Topics
 Basic concepts
 Physical Redundancy
 Information Redundancy
 Timing Redundancy
 RAID
Readings
 Tannenbaum: 7.1,7.2
Introduction
 A characteristic feature of distributed
systems that distinguishes them from
single-machine systems is the notion of
partial failure
 A partial failure may happen when one
component in a distributed system fails.
 This failure may affect the proper
operation of other components, while at
the same time leaving other components
unaffected.
Introduction
 An important goal in design is to construct
the system in such a way that it can
automatically recover from partial failures
without seriously affecting the overall
performance.
 The distributed system should continue to
operate in an acceptable way while repairs
are being made.
By the way….
 Computing systems are not very reliable
 OS crashes frequently (Windows), buggy
software, unreliable hardware,
software/hardware incompatibilities
 Until recently: computer users were “tech
savvy”
• Could depend on users to reboot, troubleshoot
problems
By the way….
 Computing systems are not very reliable
(cont)
 Growing popularity of Internet/World Wide
Web
• “Novice” users
• Need to build more reliable/dependable systems
 Example: what if your TV (or car) broke down
every day?
• Users don’t want to “restart” TV or fix it (by opening
it up)
 Need to make computing systems more
reliable
Characterizing Dependable
Systems
 Dependable systems are characterized by
 Availability
• This refers to the percentage of time system may be
used immediately
 Reliability
• Mean time to failure (MTTF) I.e., mean time between
failures.
 Safety
• How serious is the impact of a failure
 Maintainability
• How long does it take to repair the system
 Security
Characterizing Dependable
Systems
 Availability and reliability are not the same
thing.
 If a system goes down for a millisecond
every hour, it has an availability of over
99.9999 percent, but it is still highly
unreliable.
 A system that never crashes but is shut
down for two weeks every August has high
reliability but only 96 percent availability.
Definitions
 A system fails when it does not perform
according to its specification.
 An error is part of a system state that
may lead to a failure.
 A fault is the cause of an error.
Definitions
 Types of Faults
 Transient
• Occur once and then disappear.
• If the operation is repeated, the fault goes away.
• Example: Bird flying through the beam of a microwave
transmitter may cause lost bits on some network (not
to mention a roasted bird).
Definitions
 Types of Faults (continued)
 Intermittent
• Occurs and then vanishes of its own accord, then
reappears, etc;
• A loose connector will often cause a intermittent
fault.
 Permanent
• Continues to exist until the faulty component is
repaired.
• Burnt-out chips, software bugs, and disk head
crashes.
 A fault tolerant system does not fail in
the presence of faults.
Server Failure Models

Type of failure Description

Crash failure A server halts, but is working correctly until it halts

Omission failure A server fails to respond to incoming requests


Receive omission A server fails to receive incoming messages
Send omission A server fails to send messages

Timing failure A server's response lies outside the specified time interval

Response failure The server's response is incorrect


Value failure The value of the response is wrong
State transition failure The server deviates from the correct flow of control

Arbitrary failure A server may produce arbitrary responses at arbitrary times


Server Failure Models
 Crash Failure (fail-stop)
 A server halts, but is working correctly until it
halts.
 Example: An OS that comes to a grinding halt
and for which there is only one solution: reboot
Server Failure Models
 Omission Failure
 This occurs when a server fails to respond to
incoming requests or fails to receive incoming
messages or fails to send messages.
 There are many reasons for an omission failure
including:
• The connection between a client and a server has
been correctly established, but there was no thread
to listen to incoming requests.
• A send buffer overflows; The server may need to be
prepared that the client will reissue its previous
request.
• An infinite loop where each iteration causes a forked
process.
Server Failure Models
 Timing Failures
 A server’s response lies outside the specified
time interval.
 An e-commerce site may state that the
response to a user should be no more than 5
seconds (actually this is too long).
 In a video-on-demand application, the client is
to receive frames at 25 frames per second give
or take 2 frames.
 Timing failures are very difficult to deal with.
Server Failure Models
 Response Failure
 A server’s response is incorrect: a wrong reply
to a request is returned or when a server
reacts unexpectedly to an incoming request.
 Example: A search engine that systematically
returns web pages not related to any of the
used search terms.
 Example: A server receives a message that it
cannot recognize.
Server Failure Models
 Arbitrary (Byzantine) Failures
 Arbitrary failures occur
 Server is producing output it should never have
produced, but which cannot be detected as
being incorrect.
 A faulty server may even be maliciously working
together with other servers to produce
intentionally wrong answers.
Server Failure Models
 Ideally, we want fail-stop processes.
 A fail-stop process will simply stop producing
output in such a way that its halting can be
detected by other processes.
 The server may be so friendly to announce it is
about to crash.
 The reality is that processes are not that
friendly.
 We rely on other processes to detect the
failure.
Server Failure Models
 Problem: How to tell the difference
between a process that has halted and a
process that is just slow.
 Timeouts are great but theoretically you cannot
place an exact time on when to expect a
response.
 If most of the time, the timeout interval is too
high then you are delaying the system from
reacting to the failure.
Failure Masking by Redundancy
 If a system is to be fault tolerant, the
best it can do is to try to hide the
occurrence of failures from other
processes.
 Key technique: Use redundancy
 Types of redundancy
 Information redundancy
 Physical redundancy
 Time redundancy
Physical Redundancy
 Extra equipment or processes are added to
make it possible for the system as a whole
to tolerate the loss or malfunctioning of
some components.
 Physical redundancy can thus be done in
either hardware or in software.
 Examples in hardware:
 Aircraft: 747’s have 4 engines but fly on 3.
 Space shuttle: Has 5 computers
 Electronic circuits
Physical Redundancy
 Triple modular redundancy.
Physical Redundancy
 For electronic circuits, each device is replicated
three times.
 Following each stage in the circuit is a triplicated
voter.
 Each voter is a circuit that has three inputs and
one output.
 If two or three of the inputs are the same, the
output is equal to that input.
 If all three inputs are different, the output is
undefined.
 This kind of design is known as TMR (Triple
Modular Redundancy).
Physical Redundancy
 TMR can be applied to any hardware unit.
 The TMR can completely mask the failure
of one hardware unit.
 No explicit actions need to be performed
for error detection, recovery, etc;
 Particularly suitable for transient failures
if we assume the basic TMR scheme (one
voter, three replicas).
Physical Redundancy
 This scheme can’t handle the failure of two
units.
 Once an unit fails, it is essential that both
units should continue to work correctly.
 The TMR scheme depends critically on the
voting element. The voting element is
typically a simple circuit and highly reliable
circuits of this complexity can be built.
 The failure of a single voter cannot be
tolerated.
Physical Redundancy
 The TMR approach can be generalized to
replicating N units. This is called the NMR
approach.
 The larger N is then the higher the number
of faults that can be completely masked.
Physical Redundancy
 The basic TMR/NMR scheme is often
complemented with sparing.
 Sparing is often referred to as stand-by
redundancy since the redundant or spare
units usually are not operating online.
 The restoring organ for sparing is a switch.
 An error detector is also required to
determine when the on-line unit has failed.
 Failed units may be replaced by a spare.
Physical Redundancy
 Some reliability results:
 Overall reliability decreases when the degree
of redundancy is increased above a certain
amount.
 TMR provides the least potential for reliability
improvement.
 NMR systems with spares provide the highest
reliability.
Information Redundancy
 Coding is often used in information redundancy.
 Coding has been extensively used for improving
the reliability of communication.
 The basic idea is to add check bits to the
information bits such that errors in some bits can
be detected, and if possible corrected.
 The process of adding check bits to information
bits is called encoding.
 The reverse process of extracting information
from the encoded data is called decoding.
Information Redundancy
Detectability/Correctability of a Code
 A code defines a set of words that are possible
for that code.
 The Hamming distance of a code is the minimum
number of bit positions in which any two words in
the code differ.
 If d is the Hamming distance, D is the number
of bit errors that it can detect and C is the
number of bit errors it can correct, then the
following relation is always true:
 d = C +D +1 with D  C
Information Redundancy
Detectability/Correctability of a Code
 Let’s say that you have a code that looks like this:
000
001
010
011
100
101
110
111
 Hamming distance is one. You can’t detect an
error.
 Why? Let’s say that a fault transforms 001 to 011.
How do you know this is a fault vs the possibility
that 011 is correct?
Information Redundancy
Detectability/Correctability of a Code
 On the otherhand, let’s say that you have
the following code of 3 codewords:
0000 0011 1100
If a fault changes one bit in a correct word it will
result in a word that is not in the above list.
This is not true if two bits are changed. Hence,
the above code can only tolerate one fault.
You can’t correct. Let’s say that 0000 changes to
0010. You know there is an error, but how do
you know that this should go to 0000 and not
0011.
Information Redundancy
Simple Parity Bits
 Simple parity bits have been in common use
in computer systems for many years.
 The parity bit is selected so that the total
number of 1’s in the codeword is odd (even)
for an odd-parity (even-parity) code.
 This means that the Hamming distance is 2.
 The parity bit can only detect single bit
errors.
Information Redundancy
Simple Parity Bits
 Example (assume odd-parity):
 Codeword is 000; the parity bit is 1
 Codeword is 001; the parity bit is 0
 Codeword is 010; the parity bit is 0
 Let’s say that 000 is transferred as 0001. The
parity bit is set as 1 which results in a odd
number of ones (remember we are only
interested in an odd number of ones).
Information Redundancy
Simple Parity Bits
 All errors involving an odd number of bits
can be detected because such errors will
produce an incorrect parity.
Information Redundancy
Hamming Codes
 Multiple parity bits are added such that
each parity bit is a parity of a subset of
information bits. The code can detect and
also correct errors.
 Widely used in semiconductor memory and
in disk arrays.
Information Redundancy
Hamming Codes
 Parity bits occupy the bit positions 1,2,4,….
(power of 2) in the encoding. The
remaining are the data positions.
 Let k be the number of parity bits.
 Let m be the number of data bits.
 The word length of the encoded word is
m+k.
Information Redundancy
Hamming Codes – Example
 Let k = 3 and m = 4
 Bits in positions 1,2,4 are the parity bits. Label
these as c1,c2 and c3.
 Bits in positions 3,5,6,7 are the data bits. Label
these as d1,d2,d3 and d4.
 The value of parity bits is defined by the following
relations: 0 000 4 100
c1 = d1d2d4 1 001 d1 5 101 d2

c2 = d1d3d4 2 010 6 110 d3

c3 = d2d3d4 3 011 7 111 d4


Information Redundancy
Hamming Codes – Example
 Let the word to be transmitted be 1011.
001 010 011 100 101 110 111
c1 c2 d1 c3 d2 d3 d4

0 1 1 0 0 1 1

c1 = d1d2d4
c2 = d1d3d4
c3 = d2d3d4
Information Redundancy
Hamming Codes – Example
 How do we come up with these relations?
 A Hamming code generator computes the
check bits according to the following
scheme.
• The binary representation of the position
number j is jk-1 ... j1 j0.
• The value of a check bit is chosen to give odd
(or even) parity over all bit positions j such that
ji = 1.
• Thus each bit of the data word participates in
several different check bits.
Information Redundancy
Hamming Codes – Example
 Assume the word transferred is 1111.

001 010 011 100 101 110 111


c1 c2 d1 c3 d2 d3 d4

0 1 1 0 1 1 1

Transmitted improperly; was


originally a zero.
Information Redundancy
Hamming Codes – Example
 Location of bits in error
 The check bits obtained from the relationship give above
are XORed with the actual check bits obtained from the
code. If each error bit is 0, no error;
c1’ = d1d2d4 = 1 else the error location bits
c2’ = d1d3d4 = 1 specify the location of the bit
c3’ = d2d3d4 = 1 in error
e1 = c1c1’ = 0  1 = 1 d2 is common to c1’ and c3’
e2 = c2c2’ = 1  1 = 0 as well as c1 and c3.
e3 = c3c3’ = 0  1 = 1
 Correction is done by simply complementing the
bit.
Information Redundancy
Hamming Codes – Example
 The use of Hamming codes becomes more
efficient, in terms of numbers of bits
needed relative to the number of data bits,
as the word size increases.
 If the data word length is 8 bits, the
number of check bits will be 4. This
overhead is 50%.
 If the word length is 84 bits, the number
of check bits will be 7 giving an overhead
of 9 percent.
Information Redundancy
Cyclic Redundancy Code (CRC)
 These codes are applied to a block of data,
rather than independent words.
 CRCs are commonly used in detecting
errors in data communication.
 A sequence of bits is represented as a
polynomial (generator polynomial).
Information Redundancy
Cyclic Redundancy Code (CRC)
 If the kth bit is 1, then the polynomial contains xk.
 Example:1100101101
x9 + x8 + x5 + x3 + x 2 + 1
 Encoding
 To the data bit sequence, add (k+1) bits in the end.
 The extended data sequence is divided (modula 2) by the
generator polynomial.
 The final remainder is added to the data sequence to
form the encoded data.
Information Redundancy
Cyclic Redundancy Code (CRC)
 Decoding
 The extra (k+1) bits are just discarded to
obtain the original data bits.
 Error checking: The data bits are again divided
by the generator polynomial and the final
remainder is checked with last (k+1) bits of the
received data.
 If there is a difference, an error has occurred.
Information Redundancy
Cyclic Redundancy Code (CRC)
 Through proper selection of the generating
polynomial CRC codes will:
 Detect all single bit errors in the data stream
 Detect all double bit errors in the data stream
 Detect any odd number of errors in the data
stream
 Detect any burst error for which the length of the
burst is less than the length of the generating
polynomial
 Detect most all larger burst errors
Time Redundancy
 An action is performed and if the need
arises, it is performed again.
 Example: If a transaction aborts, it can be
redone with no harm.
 This is especially useful when the faults
are transient or intermittent.
Case Study
 Let’s look at RAID(Redundant Array of
Inexpensive Disks).
 Motivation
 Improve disk access time by using arrays of
disks
 Disks are getting inexpensive.
 Lower cost disks:
• Less capacity.
• But cheaper, smaller, and lower power.
Disk Organization 1
 Interleaving disks.
 Supercomputing applications.
 Transfer of large blocks of data at high rates.

...

Grouped read: single read spread over multiple disks


Disk Organization 1
 What is interleaving?
 Assume you have 4 disks.
 Byte interleaving means that byte N is on disk
(N mod 4).
 Block interleaving means that block N is on disk
(N mod 4).
 All reads and writes involve all disks, which is
great for large transfers
Disk Organization 2

 Independent disks.
 Transaction processing applications.
 Database partitioned across disks.
 Concurrent access to independent items.

...

Read Write
Problem: Reliability
 Disk unreliability causes frequent backups.
 Fault tolerance is needed, otherwise disk
arrays are too unreliable to be useful.
 RAID: Use of extra disks containing
redundant information.
 Similar to redundant transmission of data.
RAID Levels
 Different levels provide different
reliability, cost, and performance.
 The mean time to failure (MTTF) is a
function of total number of disks, number
of data disks in a group (G), number of
check disks per group (C), and number of
groups.
 The number C is determined by RAID level.
First RAID Level
 Mirrors
 Most expensive approach.
 All disks duplicated (G=1 and C=1).
 Every write to data disk results in write to
check disk.
 Reads can be from either disk.
 Double cost and half capacity.
Second RAID Level

 Data is split at the bit level and spread over


data and redundancy (check) disks.
 Redundant bits are computed using Hamming
code and placed in the redundancy disk.
 Interleave data across disks in a group.
 Add enough check disks to detect/correct
error.
 Single parity disk detects single error.
 Makes sense for large data transfers.
 Small transfers mean all disks must be
accessed (to check if data is correct).
Third and Fourth RAID Level
 The third RAID level is similar to the
second RAID level except that splitting of
data is at the byte level. There is one
parity disk.
 The fourth RAID level is similar to the
third RAID level except that splitting of
data is at the block level. There is one
parity disk
 The fifth RAID level is similar to the
fourth RAID level except that check bits
are distributed across multiple disks.
 There are 8 RAID levels.
Process Resilience
 The key approach to tolerating a faulty
process is to organize several identical
processes in a group.
 Design issues include the following:
 When a message is sent to the group itself, all
members of the group receive it.
 Dealing with process groups
Problems of Agreement
 A set of processes need to agree on a value
(decision), after one or more processes have
proposed what that value (decision) should be
 Examples:
 mutual exclusion, election, transactions
 Processes may be correct, crashed, or they
may exhibit arbitrary (Byzantine) failures
 Messages are exchanged on an one-to-one
basis, and they are not signed
Problems of Agreement
 The general goal of distributed agreement
algorithms is to have all the nonfaulty
processes reach consensus on some issue
and to establish that consensus within a
finite number of steps.
 What if processes exhibit Byzantine
failures.
 This is often compared to armies in the
Byzantine Empire in which there many
conspiracies, intrigue and untruthfulness
were alleged to be common in ruling circles.
The Two-Army Problem
 How can two perfect processes reach agreement
about 1 bit of information ?
… over an unreliable communication channel
 Red army: 5000 troops
 Blue army #1, #2: 3000 troops each
 How can the blue armies reach agreement on when to
attack ?
 Their only means of communication is by sending
messengers
 … that may be captured by the enemy !

 No solution!
The Two-Army Problem
 Proof by contradiction: Assume there is a
solution with a minimum number of
messages
 Suppose commander of blue army 1 is General
Alexander and the command of the blue army
2 is General Bonaparte.
 General Alexander sends a message to General
Bonaparte reading “I have a plan; let’s attack at
dawn tomorrow”.
 The messenger gets through and Bonaparte
sends him back a message with a note saying
“Splendid idea, Alex. See you at dawn
tomorrow.”
 The messenger gets back.
The Two-Army Problem
 Proof by contradiction (cont)
 Alexander wants to make sure that Bonaparte
does know that the messenger got back safely
so that Bonaparte is confident that Alexander
will attack.
 Alexander tells the messenger to go tell
Bonaparte that his message arrived and the
battle is set.
 The messenger gets through, but now
Bonaparte worries that Alexander does not
know if the acknowledgement got through.
 Bonaparte acknowledges the acknowledgement.
 Etc etc etc
History Lesson: The Byzantine
Empire
 Time: 330-1453 AD.
 Place: Balkans and Modern Turkey.
 Endless conspiracies, intrigue, and
untruthfullness were alleged to be common
practice in the ruling circles of the day.
 That is: it was typical for intentionally wrong
and malicious activity to occur among the ruling
group. A similar occurance can surface in a DS,
and is known as ‘byzantine failure’.
 Question: how do we deal with such malicious
group members within a distributed system?
Byzantine Generals Problem
 Now assume that the communication is perfect but
the processes are not.
 This problem also occurs in military settings and is
called the Byzantine Generals Problem.
 We still have the red army, but n blue generals.
 Communication is done pairwise by phone; it is
instantaneous and perfect.
 m of the generals are traitors (faulty) and are
actively trying to prevent the loyal generals from
reaching agreement by feeding them incorrect and
contradictory information.
 Is agreement still possible?
Byzantine Generals Problem
 We will illustrate by example where there
are 4 generals, where one is a traitor
(analogous to a faulty process).
 Step 1:
 Every general sends a (reliable) message to
every other general announcing his troop
strength.
 Loyal generals tell the truth.
 Traitors tell every other general a different
lie.
 Example: general 1 reports 1K troops, general 2
reports 2K troops, general 3 lies to everyone
(giving x, y, z respectively) and general 4
reports 4K troops.
Byzantine Generals Problem
Byzantine Generals Problem
 Step 2:
 The results of the announcements of step 1 are
collected together in the form of vectors.
Byzantine Generals Problem
Byzantine Generals Problem
 Step 3
 Consists of every general passing his vector
from the previous step to every other general.
 Each general gets three vectors from each
other general.
 General 3 hasn’t stopped lying. He invents 12
new values: a through l.
Byzantine Generals Problem
Byzantine Generals Problem
 Step 4
 Each general examines the ith element of each
of the newly received vectors.
 If any value has a majority, that value is put
into the result vector.
 If no value has a majority, the corresponding
element of the result vector is marked
UNKNOWN.
Byzantine Generals Problem

 The same as in previous example, except now with 2 loyal


generals and one traitor.
Byzantine Generals Problem

 With m faulty processes, agreement is possible


only if 2m+1 processes function correctly
 The total is 3m+1.
 If messages cannot be guaranteed to be delivered
within a known, finite time, no agreement is
possible if even one process is faulty.
 Why? Slow processes are indistinguishable from
crashed ones.
Byzantine Generals Problem
 Let f be the number of faults to be
tolerated.
 The algorithm needs f+1 rounds.
 In each round, a process sends to all the
other processes the values that it received
in the previous round. The number of
message sent is on the order of:
O(Nf+1) where N is the number of generals.
 If you do not assume Byzantine faults then
you need a lot less infrastructure.

You might also like