You are on page 1of 12

Contents:

1. Introduction.
2. Fault Tolerance by Duplication.
3. Types :
Hardware Fault Tolerance Systems
Software Fault Tolerance Systems
4. General Fault Tolerance Procedure.
5. Fault Tolerance in Distributed Systems.
6. Conclusion.
7. References.
Introduction
• Fault-tolerance is the property of a system that
continues operating properly in the event of
failure of some of its parts .
• A fault-tolerant system is designed from the
ground up for reliability by building
multiples of all critical components, such as
CPUs, memories, disks.
• Usually the definitions involved in this
propagation process are
Failure
Fault
Error
• Fault-tolerance is not just a property of
individual machines; it may also characterise
the rules by which they interact.
• Recovery from errors in fault-tolerant systems
can be characterised as either roll-forward or
roll-back.
Fault-tolerance by duplication:
Duplication can give fault-tolerance in three ways :
Replication
Redundancy
Diversity
A redundant array of independent disks (RAID)
is an example of a fault-tolerant storage device that
uses redundancy.
Tandem and Stratus were the first two manufacturers
that were dedicated to building fault tolerant
computer systems for the transaction processing
(OLTP)market.
Types:
Fault tolerance systems are two types.

1. Hardware fault tolerance systems.


2. Software fault tolerance systems.
Hardware fault tolerance systems:

• Hardware fault tolerance is a well


understood problem. It consists of
including two or more identical copies of
every working part .
• Such a system implemented with a single
backup is known as single point tolerant,
and represents the vast majority of fault
tolerant systems.
Software fault tolerance systems:
• Software fault tolerance is the ability for
software to detect and recover from a fault that
is happening .
• Software faults are all design faults. Software
manufacturing, the reproduction of software, is
considered to be perfect.
• Without software fault tolerance, it is
generally not possible to make a truly fault
tolerant system.
General Fault Tolerance Procedure:

• Error detection is the process of identifying that the


system is in an invalid state. This means that some
component in the system has failed.
• In the error recovery phase, the error and more
importantly its effects, are removed by restoring the
system to a valid state.
• Finally, in fault treatment, we go after the fault
that caused the error so that it can be isolated.
In other words, we first treat the symptoms and
then go after the underlying cause.
Fault Tolerance in Distributed Systems:
Distributed System:
We define a distributed software system as a
system with two or more independent processing sites
that communicate with each other over a medium whose
transmission delays may exceed the time between
successive state changes.
Distributed systems may also be the source of
many failures.
Processing Site Failures
Communication Media Failures
Transmission Delays
Conclusion:
It is good to maintain redundant
subsystems for every important working
part in the system. Software fault tolerance
still has not been that relatively mature in
comparison with hardware. Software fault
tolerance research has drawn more and
more focus nowadays, as the majority of
system defects are shown to be software
defects.
References:

• Fischer, M., N. Lynch, and M. Paterson,


"Impossibility of Distributed Consensus with One
Faulty Process," Journal of the ACM, Vol. 32,
No. 2, April 1985, pp. 374-382.

• Halpern, J. and Y. Moses, "Knowledge and


Common Knowledge in a Distributed
Environment," Proc. of the 3rd ACM Symposium
on Principles of Distributed Systems, 1984, pp.

You might also like