Professional Documents
Culture Documents
Topics
Basic concepts
Physical Redundancy
Information Redundancy
Timing Redundancy
RAID
Readings
Tannenbaum: 7.1,7.2
Introduction
A characteristic feature of distributed
systems that distinguishes them from
single-machine systems is the notion of
partial failure
A partial failure may happen when one
component in a distributed system fails.
This failure may affect the proper
operation of other components, while at
the same time leaving other components
unaffected.
Introduction
An important goal in design is to construct
the system in such a way that it can
automatically recover from partial failures
without seriously affecting the overall
performance.
The distributed system should continue to
operate in an acceptable way while repairs
are being made.
By the way….
Computing systems are not very reliable
OS crashes frequently (Windows), buggy
software, unreliable hardware,
software/hardware incompatibilities
Until recently: computer users were “tech
savvy”
• Could depend on users to reboot, troubleshoot
problems
By the way….
Computing systems are not very reliable
(cont)
Growing popularity of Internet/World Wide
Web
• “Novice” users
• Need to build more reliable/dependable systems
Example: what if your TV (or car) broke down
every day?
• Users don’t want to “restart” TV or fix it (by opening
it up)
Need to make computing systems more
reliable
Characterizing Dependable
Systems
Dependable systems are characterized by
Availability
• This refers to the percentage of time system may be
used immediately
Reliability
• Mean time to failure (MTTF) I.e., mean time between
failures.
Safety
• How serious is the impact of a failure
Maintainability
• How long does it take to repair the system
Security
Characterizing Dependable
Systems
Availability and reliability are not the same
thing.
If a system goes down for a millisecond
every hour, it has an availability of over
99.9999 percent, but it is still highly
unreliable.
A system that never crashes but is shut
down for two weeks every August has high
reliability but only 96 percent availability.
Definitions
A system fails when it does not perform
according to its specification.
An error is part of a system state that
may lead to a failure.
A fault is the cause of an error.
Definitions
Types of Faults
Transient
• Occur once and then disappear.
• If the operation is repeated, the fault goes away.
• Example: Bird flying through the beam of a microwave
transmitter may cause lost bits on some network (not
to mention a roasted bird).
Definitions
Types of Faults (continued)
Intermittent
• Occurs and then vanishes of its own accord, then
reappears, etc;
• A loose connector will often cause a intermittent
fault.
Permanent
• Continues to exist until the faulty component is
repaired.
• Burnt-out chips, software bugs, and disk head
crashes.
A fault tolerant system does not fail in
the presence of faults.
Server Failure Models
Timing failure A server's response lies outside the specified time interval
0 1 1 0 0 1 1
c1 = d1d2d4
c2 = d1d3d4
c3 = d2d3d4
Information Redundancy
Hamming Codes – Example
How do we come up with these relations?
A Hamming code generator computes the
check bits according to the following
scheme.
• The binary representation of the position
number j is jk-1 ... j1 j0.
• The value of a check bit is chosen to give odd
(or even) parity over all bit positions j such that
ji = 1.
• Thus each bit of the data word participates in
several different check bits.
Information Redundancy
Hamming Codes – Example
Assume the word transferred is 1111.
0 1 1 0 1 1 1
...
Independent disks.
Transaction processing applications.
Database partitioned across disks.
Concurrent access to independent items.
...
Read Write
Problem: Reliability
Disk unreliability causes frequent backups.
Fault tolerance is needed, otherwise disk
arrays are too unreliable to be useful.
RAID: Use of extra disks containing
redundant information.
Similar to redundant transmission of data.
RAID Levels
Different levels provide different
reliability, cost, and performance.
The mean time to failure (MTTF) is a
function of total number of disks, number
of data disks in a group (G), number of
check disks per group (C), and number of
groups.
The number C is determined by RAID level.
First RAID Level
Mirrors
Most expensive approach.
All disks duplicated (G=1 and C=1).
Every write to data disk results in write to
check disk.
Reads can be from either disk.
Double cost and half capacity.
Second RAID Level
No solution!
The Two-Army Problem
Proof by contradiction: Assume there is a
solution with a minimum number of
messages
Suppose commander of blue army 1 is General
Alexander and the command of the blue army
2 is General Bonaparte.
General Alexander sends a message to General
Bonaparte reading “I have a plan; let’s attack at
dawn tomorrow”.
The messenger gets through and Bonaparte
sends him back a message with a note saying
“Splendid idea, Alex. See you at dawn
tomorrow.”
The messenger gets back.
The Two-Army Problem
Proof by contradiction (cont)
Alexander wants to make sure that Bonaparte
does know that the messenger got back safely
so that Bonaparte is confident that Alexander
will attack.
Alexander tells the messenger to go tell
Bonaparte that his message arrived and the
battle is set.
The messenger gets through, but now
Bonaparte worries that Alexander does not
know if the acknowledgement got through.
Bonaparte acknowledges the acknowledgement.
Etc etc etc
History Lesson: The Byzantine
Empire
Time: 330-1453 AD.
Place: Balkans and Modern Turkey.
Endless conspiracies, intrigue, and
untruthfullness were alleged to be common
practice in the ruling circles of the day.
That is: it was typical for intentionally wrong
and malicious activity to occur among the ruling
group. A similar occurance can surface in a DS,
and is known as ‘byzantine failure’.
Question: how do we deal with such malicious
group members within a distributed system?
Byzantine Generals Problem
Now assume that the communication is perfect but
the processes are not.
This problem also occurs in military settings and is
called the Byzantine Generals Problem.
We still have the red army, but n blue generals.
Communication is done pairwise by phone; it is
instantaneous and perfect.
m of the generals are traitors (faulty) and are
actively trying to prevent the loyal generals from
reaching agreement by feeding them incorrect and
contradictory information.
Is agreement still possible?
Byzantine Generals Problem
We will illustrate by example where there
are 4 generals, where one is a traitor
(analogous to a faulty process).
Step 1:
Every general sends a (reliable) message to
every other general announcing his troop
strength.
Loyal generals tell the truth.
Traitors tell every other general a different
lie.
Example: general 1 reports 1K troops, general 2
reports 2K troops, general 3 lies to everyone
(giving x, y, z respectively) and general 4
reports 4K troops.
Byzantine Generals Problem
Byzantine Generals Problem
Step 2:
The results of the announcements of step 1 are
collected together in the form of vectors.
Byzantine Generals Problem
Byzantine Generals Problem
Step 3
Consists of every general passing his vector
from the previous step to every other general.
Each general gets three vectors from each
other general.
General 3 hasn’t stopped lying. He invents 12
new values: a through l.
Byzantine Generals Problem
Byzantine Generals Problem
Step 4
Each general examines the ith element of each
of the newly received vectors.
If any value has a majority, that value is put
into the result vector.
If no value has a majority, the corresponding
element of the result vector is marked
UNKNOWN.
Byzantine Generals Problem