You are on page 1of 5

Analyzing Software Fault-Tolerance in Real-time

Systems using voting technique.


Pradeep Kumar, Amit Prakash Singh
University School of Information technology, GGSIPU, Kashmere Gate, New Delhi – 110403.
pradeepkumarmca@yahoo.com, apsingh_cse@yahoo.com, aps.ipu@gmail.com

Abstract- Designing Real-time systems are a challenging task. needed, but this is to be taken care within the deadline.
Most of the challenge comes from the fact that Real-time systems Redundancy in terms of the hardware is to keep another
have to interact with real world entities. Fault-tolerance is the hardware instance as standby and as the active hardware
capability of any systems to execute for its intended purpose even
break-down the standby instance will take the control over the
though the system may be present with a number of errors. These
application in a flash of time with a little degradation in terms
errors may be within the design of the supporting hardware or
may be available within the algorithm. Nearly all the computing of functionality or reliability. There is also the software
systems are the real-time in some form executing the outcome redundancy where a number of application variants are
requiring some fault tolerance, but real-time systems are the running on the parallel computers or nodes in a distributed
mission critical or safety critical applications whose success environment and feed forward recovery is made available to
depend not only the logically correct output but also the the system unless application on all the nodes are failed to
stipulated time assigned to execute the output. Redundancy execute correctly.
increases the chances of system surviving the faults but adds
complexity too. The voting process in which variants of the II. PARAMETERS OF A REAL-TIME PROCESS
application runs on multiple hardware module in connected
environment and support the successful execution of the real-time A real-time process [9] is well illustrated in the Fig. 1 by the
application is the subject of discourse. following parameters,
I. INTRODUCTION Ai Si Fi Di
Wi C -Li
Real-time systems[1,2,3] are made to work 100% perfect
within the environment constraints which trigger the systems i

to work on the event triggers. Their success is assessed not


only by the logically correct output but also the time in which
the results are to be produced. This time is known as the time
deadline of the real-time systems. A developer cannot
Fig. 1. Parameters of a real-time process
guarantee its developed system to be completely free from any
type of errors. The errors may be hosting within the
hardware/software which may be a transient, intermittent or a where the parameters are defined as,
permanent design error or it may be the erroneous operating Ai -arrival time of a process ready to execute
method of the user or the externally induced upsets or the
physical damage. Best practices of software engineering in Si -time at which the process executes
earlier detection of the errors in the software life cycle are very Fi -time at which the process complete the task
crucial as later revelation may end result in some catastrophic
failure leading to casualties. Extensive research has been going Ci -the amount of time taken to complete the task
on in this area to make the real-time system a fault-tolerant Di -the deadline, is the timeliness to successfully
system. A fault-tolerant systems must execute to provide the complete the specified task
correct output and within the stipulate time frame. Providing
Li -lateness equals to Fi - Di
the high availability to a system is ascertained through the
redundancy. This redundancy may be the time redundancy Wi -process waiting time, Wi = (Si - Ai )
where a system in crisis is made to re-execute its portion of the
SLi -slack time defined as SLi = (Di - Fi) + Wi
disturbed functionality or re-execute from the scratch, if
Process is divided into periodic and aperiodic processes. 2. Hardware Redundancy, keeping the standby
Periodic process regularly does identical activities set in their instance(s) of the hardware items that will changes to
time slice and aperiodic process executing the activities not in active instance and will take over the control of the
periodic fashion of time. failed part.

A process in its life cycle passes through various states while 3. Software Redundancy, assured by making the
executing the assigned set of activities. Process entering into variants of the application and executing them
the system is first goes to the ready state waiting for the concurrently on different nodes in the system.
processor to be available and changes to the running state
while executing the task by the processor. In the mean while if Checkpointing is required to implement the time redundancy
some interrupt occurs, it enters to the waiting state and as then in real-time system as shown in the Fig. 3 below. In this the
continues to the remaining task activities. The process finally backup of the running system is taken periodically and saved
accomplish the task and changes to the terminate state. All to secondary storage. When a failure occurs the system is
these state change allows the process to be only on a single restored from the last saved checkpoint. Transient faults can be
state at once. Fig. 2 clearly shows the different states of the life tolerated with rollback techniques using Checkpointing.
cycle of a process which is similar to the real-time process. Checkpoints
saved Recovering from
start CPU Terminat last checkpoint
Ready Running
e
t
p
III. EMBEDDING FAULT-TOLERANCE TO REAL-TIME SYSTEM
u
rr
Avizienis in 1976 Waiting
given for
twoI/O tasks
complementary
e approaches [4, 5] failure

for the construction of reliable


Waiting due to systems. Firstly, the fault
t
Normal
prevention approach,preemption
tries to ensure that the implemented
in
execution continued
system does not and will not contain any faults, has two go to last execution
checkpoin
aspects as: cycle of a process
Fig. 2: Life DEADLINEt
period
1. Fault avoidance techniques, which employ the system Ai Di
design methodologies, quality control, testing etc. to
avoid introducing faults into the system. Fig. 3: recovering from the checkpoint

2. Fault detection and removal techniques, employing


the testing and validation etc. to find the faults which
were inadvertently introduced into the system and
later removing these.

Shrivastava [4,5] stated, a reliable computing system must be


capable of providing normal services in the presence of a finite
number of component failures. Fault within a system cause its
failure. These faults could be present in either the components
(hardware) of the system or in its design (algorithm). Fault-
tolerance in a real-time system is possible through the B. Fault-tolerance through Hardware
following, described as,
Real-time systems achieve the hardware based fault-tolerance
A. Supporting through redundancy through the hardware redundancy thus enabling the high
availability even under the hardware fault conditions. But this
The redundancy is making the duplication of anything and adds a big cost chunk to the whole system budget. Hardware
executing it for providing standby support. There are three redundancy is provided in one of the following ways:
ways to provide the redundancy to the real-time system as,
1. One for One Redundancy, a case of parallel
1. Time Redundancy, re-executing the portion or the redundancy in which one copy of the active hardware
whole application in case of failure. module is made available as standby and is regularly
synchronized to take over as the active one if the former finite number of errors. The employed methods for this are
fails to operate normally. almost same as the hardware fault-tolerance.

Fig. 4 gives the state transition based diagram for fault Software failures can be characterized by keeping track of the
handling with One to One redundancy. software defect density in the system depending on the factors
Instance 0: like the complexity of the software, software process used in
Instance 0: Active
2.Standby development, the level of the extensive testing being done by
Switchover Instance 1: the testers, the skill level of the development team, etc.
Instance 1: Active
3. X out of N Redundancy, is a special Standbycase of parallel
redundancy requiring at least X hardware modules to Software Defect Density is the number of defects per thousand

instance 0
instance 0

support the N parallel running units. The higher level lines of code (KLOC), as
passes

fault in
module of X regularly watches the system and takes the (2)
control of0:one failed unit of N thus adding more support to
Instance No. of defects
checking
multiple failures at once.
Checking Theinstance 0 Instance
binomial
0: Error
distribution gives Defect Density = KLOC
the reliability of the system, RS (X,N,R), where
Instance R is the
1: Active
Instance 1: Active
instance 0 passes (
individual component’s reliability.
Software fault-tolerance is ascertained through the following
instance 0
repairing

4. Load Sharing, this shares the load among the parallel ways:
running hardware modules. If one of the units fails then
1. Timeouts, most real-time systems use the timer to
immediately the load distribution occurs among the
Instance 0: Failed
keep track of the feature execution. A timeout generally
remaining modules which also results in the degradation of
Instance 1: Active signals that some entity involved in the feature has
the overall system’s performance.
misbehaved and a corrective action is required. The
Measuring hardware fault-tolerance often accounts for the corrective action could be retry the message interaction or
maximizing the Mean Time Between failures (MTBF) which abort the feature. The choice between retrying or aborting
isFig.
the4.average
Fault handling
time State transition
between diagram
failures of hardwares estimated by on timeouts is based on several factors of the systems.
the manufacturers; minimizing the total numbers of hardware
2. Audits, the distributed data on the multiple processors
failures in a billion hours expressed as FITS; minimizing the
may get inconsistent due to the non-synchronization. The
Mean Time To Repair (MTTR) involves hardware repair or
simple strategy to overcome this problem is to implement
now a days it is only replacing the hardware module;
the audit program that regularly checks the consistency of
Availability of the hardware module is the percentage of time
the data structures across the multiple processors by
when the system is operational and expressed in terms of
performing predefined checks.
nines(‘9’), for e.g. 99.999%;
Downtime per year is more better way to understand the 3. Exception handling, using the exception handling
availability. The availability of a hardware module is codes to process the messages so that system never enters
calculated as, the infinite loop state.
MTBF
(1) 4. Task Rollback, during the smooth running of the
Availability MTBF
= + MTTR
system, the checkpoints are saved regularly to a stable
storage that are used to restore the system from the last
checkpoint or if needed from the scratch in case of failure.
C. Fault-tolerance through Software
5. Incremental reboot, to minimize the downtime, the
Most real-time systems often overlook the software fault-
software processor reboot is done in incremental phases as
tolerance while focusing on the hardware fault-tolerance. Most
processor reboot in first phase; processor reboot with the
of the faults are found in the software portion of any system
configuration data in second phase and processor reboot
due to the human tendency and which cannot be captured
with configuration data followed by application reload in
through extensive software testing. Software fault-tolerance is
third phase.
the ability to make the real-time system to work correctly in
accordance with the specification inspite of the presence of 6. Voting, is the feed forward recovery system where at
least three variants of the application are executed on
different nodes running in distributed environment. Any 1 1 0 1 99.99%
fault in one of the variant is voted out by the successful 1 1 1 1 99.9999%
completion of the other executing variants. Voting scheme
is the main topic of discourse in this research.
The implications of the above equations used in the table
Measuring the software fault-tolerance again involves the above are that the combined availability of two or more
parameters of the hardware fault-tolerance. MTBF for software components in parallel is always much higher than the
can be determined by simply multiplying the defect rate with availability of its individual components. Thus parallel
KLOCs executed per second. MTTR for a software module operation provides a very powerful mechanism for making a
can be computed as the time taken to reboot after a software highly reliable system from low reliability. For this reason, all
fault is detected. MTTR for software depends on factors like mission critical systems are designed with redundant
software fault-tolerance techniques used, Operating System components.
and application image downloading techniques. FITS,
Availability and Downtime per year for software is like the V. FAULT HANDLING TECHNIQUE
software counter-part of the hardware fault-tolerance’s
parameters. Redundancy increases the chances of systems surviving the
faults. The complexity is also increases to many folds. This
IV. SYSTEM AVAILABILITY IN PARALLEL section describes some of the technique used in fault handling
software design via Software redundancy.
The combined availability of the ‘n’ system in parallel is
calculated as, 1-(none part are available)n. Showing the Malik and Rehman in [7, 8] proposed fault tolerance model
combined availability by the following equations, shown in Fig.5 for real-time systems which incorporate the
concept of time stamped fault tolerance. The Proposed Model
System Availability of 2 components in parallel, is based on the voting technique of the software fault-
A = 1- (1 - Availability of component X )2 (3) tolerance, as described; a scheme devised here is based on feed
forward ANN structure in which nodes are interconnected with
System Availability of 3 components in parallel, each other having adaptive or changing weights to each
A = 1- (1 - Availability of component X )3 (4) interconnection. In this scheme, three nodes or computers
which are hardware copies with similar specifications are
Similarly, interconnected with each other. Each node has only one
System Availability of ‘n’ components in parallel, processor. Three variants of the application are running each
on three nodes. These design variant algorithms have
A = 1- (1 - Availability of component X )n (5) approximately equal computation time. This scheme provides
Now for the time being consider the truth table for the OR forward recovery mechanism through the three parallel
gates, as shown in Table I. executing combinations of both hardwares and softwares.
Forward recovery
TABLE I X1 on
Node1
AVAILABILITY OF SYSTEM IN PARALLEL U1 O1

Common X2 on
Input Input Input INPUT Node2
System Availability U2 O2
X1 X2 X3 ORed Output
(taking Availability
(X1+ X2+ X3) X3 on
of X be 99%)
(0/1) (0/1) (0/1)
Node3
U3 O1
fail

0 0 0 0 0
0 0 1 1 99% DM
AT
0 1 0 1 99% pass

0 1 1 1 99.99%
1 0 0 1 99%
Fig. 5. Proposed model [7,8]
1 0 1 1 99.99%
[7,8] taken three variants of the application X, running the [8] Malik, S. Rehman, M.J.,”A framework for fault tolerance in
algorithm X1 on the node Node1, algorithm X2 on Node2 and distributed real time systems.” Emerging Technologies, 2005.
algorithm X3 on the last node Node3. One ‘U’ variable which Proceedings of the IEEE Symposium on, 17-18 Sept. 2005, pp. 505-510.
is dependent on the timing of the each algorithm output on [9] Jakovljevic, G. Rakamaric, Z. Babic, D., “Java simulator of real-
each node and a function ‘O’ on each node select the output
time scheduling algorithms.” Proceedings of the 24th International
from each node running algorithm. This function value along
Conference on 24-27 June 2002, Information Technology Interfaces,
with the total weighted time from each node is transferred to a
2002. ITI 2002, pp. 411-416 vol.1.
decision mechanism for output selection which is finally
verified for the corrected output through an acceptance test [10] Martinovic, G. Budin, L. Hocenski, Z., “Undergraduate teaching

mechanism. of real-time scheduling algorithms by developed software tool.”


Education, IEEE Transactions on, Feb. 2003, Vol. 46 Issue 1, pp 185-
This scheme provides an automatic forward recovery 196.
employing the voting technique for software fault-tolerance. If [11] R. A. Orr , UK M. T. Norris , R. Tinker, C. D. V. Rouch, “Tools
a node fail to produce output or produce output after time for real time system design”, International Conference on Software
overrun the system will not fail. It will continue to operate Engineering, Proceedings of the 10th international conference on
with remaining nodes. This system will produce output until
Software engineering, Singapore, pp.130-137, 1988.
all the nodes fail.
[12] Stephane De Vroey, Joel Goossens, Christian
VI. FUTURE WORK Hernalsteen, “A Generic Simulator of Real-Time Scheduling
Algorithms.” 29th Annual Simulation Symposium (SS '96), New
To assess the various parameters of the fault-tolerance of the Orleans, LA, April 08-11, 1996.
systems, it is proposed to model the real-time application for
[13] Furht, B., “The design of a real-time system for simulators and
sorting algorithm. Three variants of the sorting algorithm will
trainers.” Real-Time Applications, 1993., Proceedings of the IEEE
be executed in distributed environment with three connected
Workshop on, 13-14 May 1993, New York, NY, pp. 18-22.
nodes and one of these algorithms will be infected with some
fault. Then the fault-tolerance of the real-time system is
analyzed in the Matlab’s Real-time simulated environment. AUTHOR INFORMATION

Pradeep Kumar, MTech(IT-Weekend) candidate, University School of


REFERENCES
Information Technology, GGSIPU, pradeepkumarmca@yahoo.com

[1] Phillip A. Laplante, Real-Time Systems Design and Analysis-An Amit Prakash Singh, Assistant Professor, University School of
Engineers Handbook, IEEE. Press, PHI, 2001. Information Technology, GGSIPU, apsingh_cse@yahoo.com,
[2] C. M. Krishna. Real-Time Systems, 1st ed., McGraw-Hill Higher aps.ipu@gmail.com
Education, 1996.
[3] Jane W.S. Liu. Real-Time Systems. Pearson Education, Inc., 2007.

[4] N Viswanadham. Reliability And Fault-Tolerance Issues in Real-


Time Systems. Indian Academy of Sciences, Bangalore, 1987.
[5] S K Shrivastava , “A Tutorial on the Principles of Fault Tolerance”,
Sadhana, Vol. II, Parts 1&2, October 1987, pp. 7-22.
[6] Alan Burns, Neil Audsley, Andy Wellings, “Real-Time Distributed
Computing”, FTDCS, pp. 34, 5th IEEE Workshop on Future Trends of
Distributed Computing Systems, 1995.
[7] Malik, S. Rehman, M.J., “Time Stamped Fault Tolerance in Real
Time Systems.” 9th International Multitopic Conference, Center for
Software Dependability, Mohammad Ali Jinnah Univ., Islamabad, IEEE
INMIC 2005, pp.1–5.

You might also like