Assignment Topic : FAILURES AND FAULT TOLERANCE IN
DISTRIBUTED SYSTEMS
Name : PLABAN ROY
Roll number :MCA/45015/12 Subject : MCA 7101 DISTRIBUTED DATABASES
Acknowledgements: I am indebted to madam Ajanta Das for
providing us a good understanding of Distributed Database Management System.
Fault tolerance is the property that enables a system to
continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or lifecritical systems.
A fault-tolerant design enables a system to continue its
intended operation, possibly at a reduced level, rather than failing completely, when some part of the system fails.[1] The term is most commonly used to describe computer systems designed to continue more or less fully operational with, perhaps, a reduction in throughput or an increase inresponse time in the event of some partial failure. That is, the system as a whole is not stopped due to problems either in the hardware or the software. An example in another field is a motor vehicle designed so it will continue to be drivable if one of the tires is punctured. A structure is able to retain its integrity in the presence of damage due to causes such as fatigue, corrosion, manufacturing flaws, or impact.
Reasons for Failure
Soft failures make up more than 90% of system failures. The Tandem data suggests that about 49% of hardware failures are disk failures 23% are due to communication 17% are due to processor failure 9% due to poor wiring Software failures are typically caused by bugs in the code. The estimates for the number of bugs in the software vary. It can vary from .25 bugs per 1000 instructions to 10 bugs per 1000 instructions
Basic Fault Tolerance Approaches and Techniques
Fault tolerance Refers to a system design approach which recognizes that faults will occur Fault prevention/Fault intolerance Aim at ensuring that the implemented system will not contain any faults Two aspects Fault avoidance Refers to the techniques used to make sure that faults are not introduced into the system Involve detailed design methodologies such as design walkthroughs, design inspections etc.. Fault removal Refers to the techniques that are employed to detect any faults that might have remained in the system despite the
application of fault avoidance and removed
these faults. Fault detection Issue a warning when a failure occurs but do not provide any means of tolerating the failure. Latent Failure One that is detected some time after its occurrence Mean time to detect Average error latency time over a number of identical systems. Fail-stop modules Constantly monitors itself and when it detects a fault, shuts itself down automatically Fail-fast Implemented in software by defensive programming, where each software module checks its own state during state transactions. Fault avoidance Techniques aim to prevent faults from entering the system during design stage Fault removal Methods attempt to find faults within a system before it enters service Fault detection Techniques used during service to detect faults within the operational system Fault tolerant
Techniques designed to tolerant faults, i.e. to allow the
system operate correctly in the presence of faults.