You are on page 1of 7

BIRLA INSTITUTE OF TECHNOLOGY, MESRA

Assignment Topic : FAILURES AND FAULT TOLERANCE IN


DISTRIBUTED SYSTEMS

Name : PLABAN ROY


Roll number :MCA/45015/12
Subject : MCA 7101 DISTRIBUTED DATABASES

Acknowledgements: I am indebted to madam Ajanta Das for


providing us a good understanding of Distributed Database
Management System.

Fault tolerance is the property that enables a system to


continue operating properly in the event of the failure of (or one
or more faults within) some of its components. If its operating
quality decreases at all, the decrease is proportional to the
severity of the failure, as compared to a naively designed system
in which even a small failure can cause total breakdown. Fault
tolerance is particularly sought after in high-availability or lifecritical systems.

A fault-tolerant design enables a system to continue its


intended operation, possibly at a reduced level, rather than failing
completely, when some part of the system fails.[1] The term is
most commonly used to describe computer systems designed to
continue more or less fully operational with, perhaps, a reduction
in throughput or an increase inresponse time in the event of some
partial failure. That is, the system as a whole is not stopped due
to problems either in the hardware or the software. An example in
another field is a motor vehicle designed so it will continue to be
drivable if one of the tires is punctured. A structure is able to
retain its integrity in the presence of damage due to causes such
as fatigue, corrosion, manufacturing flaws, or impact.

Reasons for Failure


Soft failures make up more than 90% of system failures.
The Tandem data suggests that about 49% of hardware
failures are disk failures
23% are due to communication
17% are due to processor failure
9% due to poor wiring
Software failures are typically caused by bugs in the code.
The estimates for the number of bugs in the software vary.
It can vary from .25 bugs per 1000 instructions to 10 bugs
per 1000 instructions

Basic Fault Tolerance Approaches and Techniques


Fault tolerance
Refers to a system design approach which recognizes
that faults will occur
Fault prevention/Fault intolerance
Aim at ensuring that the implemented system will not
contain any faults
Two aspects
Fault avoidance
Refers to the techniques used to make sure
that faults are not introduced into the system
Involve detailed design methodologies such
as design walkthroughs, design inspections
etc..
Fault removal
Refers to the techniques that are employed
to detect any faults that might have
remained in the system despite the

application of fault avoidance and removed


these faults.
Fault detection
Issue a warning when a failure occurs but do not
provide any means of tolerating the failure.
Latent Failure
One that is detected some time after its occurrence
Mean time to detect
Average error latency time over a number of identical
systems.
Fail-stop modules
Constantly monitors itself and when it detects a fault,
shuts itself down automatically
Fail-fast
Implemented in software by defensive programming,
where each software module checks its own state
during state transactions.
Fault avoidance
Techniques aim to prevent faults from entering the
system during design stage
Fault removal
Methods attempt to find faults within a system before it
enters service
Fault detection
Techniques used during service to detect faults within
the operational system
Fault tolerant

Techniques designed to tolerant faults, i.e. to allow the


system operate correctly in the presence of faults.

Occurrence of Events over Time

You might also like