Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems
Yevgeniy Gershteyn
Department of Computer Science
Rochester Institute of Technology
Rochester, NY, USA
yxg7752@cs.rit.edu
February 5, 2003
Abstract constraint of the dependable system is reliability.

It is quite different from availability in term of
This paper reviews fault-tolerance aspects in “that a system can run continuously without
distributed systems. Due to possibility of partial failure” [5] for particular amount of time. Safety
failure in the system, the system should routinely is very important in case of using control
recover itself from it without disturbing the systems, where any failure can cause disaster for
overall performance of the system. There exist a surrounding environment. At last,
lots of researches how to make distributed maintainability shows level of how fast system
system fault-tolerant. In this paper I analyze can be repaired and brought back to work.
some ideas on different aspects of fault There exist several techniques to design
tolerance, such as new research on corporative fault tolerant systems, such as leasing,
leasing and scalable consistency maintaince in replication. In particular, leasing is kind of
context distributed networks; improving fault- convention between client and server for a
tolerance by replicating agents; and problem limited time.
specific branch-and bound algorithm for
asynchronious distributed systems.
1 Introduction 2 Cooperative leases: scalable

“An important goal in distributed systems design consistency maintenance in
is to construct the system in such a way that it content distribution networks.
can automatically recover from partial failures
without seriously affecting the overall This article informs about results of the research
performance” [5(p361)]. The partial failure is the which was made with affiliation of the IBM,
key problem of the distributed system, which is Intel, EMC, and Sprint at the University of
differentiate it from a singe processor system. Massachusets. The new terms of “cooperative
Distributed systems are designed to work from consistensy along with a mechanism, called
different locations or in other words from a cooperative leases”[1] were introduced. “By
different processors, and each processor may go supporting ∆-consistency semantics and by using
off unexpectedly or be disconnected from the a single lease for multiple proxies, cooperative
network. In this case designer of the distributed leases allows the notion of leases to be applied in
systems should build system that controls a flexible, scalable manner to CDNs”[1]
failures or even better tries to prevent them. The (Content Distribution Network). The major point
fault tolerant systems are close related to the is to divide information stream into different
description of the dependable systems, which are level of consistency, which depends on level of
enclosed by following requirements [4]: actuality of information. In this case, depends on
availability, reliability, safety, and requirements of the user, more important
maintainability. In particular, an availability information can be obtained faster, rather then in
requires from any component be ready to use at the same cache as all objects. For example for a
any time. Failure of this constraint might cause financial company, financial information is more
failure of the entire distributed system. Another important than sport scores. This partitioning
may bring efficiency and flexibility of any type 3 Improving fault-tolerance by
of information.
The authors proposed ∆-consistency replicating agents.
semantics undertaking that all data is up-to-date
as server has it. By adding extra parameter to Replication of services leads to increase level of
original leases approach, authors assume that if complication of the system. This article
user specifies object update rate, than server will examines “the use of transparent agent
inform client in certain interval of time. This replication, and technique in which the replicates
brings differentiate server responds to the clients of agents appear and act as one entity thus
and make them receive updated information at avoiding an increase in system complexity and
requested time. Another modification is “allow a minimizing additional system loads.” Another
server to grant a single lease collectively to a point is “inter-agent communication, read/write
group of proxies, instead of issuing a separate consistency, resource locking, resource synthesis
lease to each individual proxy. For each cached and state synchronization. An implementation of
object, the proxy group designates an the transparent agent replication for the FIPA-OS
invalidation proxy, referred to as the leader”[1]. framework is presented and the results of testing
The leader interracts with the server about the it within a real-world multi-agent system are
leasing, and make it from all proxies in the shown” [2]
group. It brings some downsides of this solution. Multi-agent systems (MASs) appropriate
First, there is bottleneck, which appears then one solution for development difficult, distributed
proxy is in use, and second is that leader is systems, but they are still vulnerable to any kinds
responsible for broadcasting this information to of faults that any distributed system may have.
the group. However, it has some advatages in “The modular nature of a MAS gives it a certain
this solution, which are: “(i) it reduces the the level of inherent fault tolerance, however the
amount of state maintained at a server (by using non-deterministic nature of the agents, the
a single lease to represent a proxy group instead dynamic environment, the inter connectedness of
of an individual proxy); and (ii) it reduces the the agents and the lack of any central control
number of notifications that need to be sent by point make it impossible to foresee possible fault
the server (by offloading some of notification states and make fault handling behavior
burden to leader proxies)”[1]. Moreover, authors unpredictable”[2]. Since that, entire system may
did not mentioned that leader become a single crash just because only one agent had a small
point failure, and it needs some kind of error. The authors made their assumptions based
replication techniques for leader of the group, or on the open environment system, where agents
every process should be able to be a leader. This are not reliable and might not be fault-tolerant.
involves additional overhead for each client, “In an open system, agents may be malicious,
since they have to be able to multicast. poorly designed or poorly implemented, hosts
I might sound sceptic, but to prove this may get overloaded or fail, network connections
theory, authors used self implemented simulated may be slow or down”[2].
environment. There is no guarantee that “There are four essential characteristics
simulator performs as real networks would. of a multi-agent system: a MAS is composed of
However, let’s look on expiremental results. The autonomous software agents, a MAS has no
“results show a factor of 2.5 reductions in server single point of control, a MAS interacts with a
message overhead and a 20% reduction in server dynamic environment, and the agents within a
state space overhead when compared to original MAS are social (agents communicate and
leases albeit at an increased inter-proxy interact with each other and may form
communication overhead”[1]. So, in proposed relationships)” [2]. Moreover, there exist various
configuration has a lease server which is fault types, which can bring wide system crash.
responsible for lease processing. The clients have These faults can come from program defects,
lease handler to act as a leader or as a client. As a sudden changes, processor and communication
result authors claim that theirs new word in fault fault, as well as emerging actions, which are not
tolerance “cooperative leases meets the goals of predicted. In addition, there are many techniques
flexibility and scalability by (i) employing - to have replication agents, and they either agent-
consistency semantics, (ii) using a single lease to centric or system-centric. The major differences
represent multiple proxies and (iii) using between them is that system-centric is more
application-level multicast to propagate server complex structure, but it can catch system-wide
notifications”[1]. faults. The mixture of these two types of agents
is classified as a Dependable Mobile Agent failure, which requires backups of itself and if
System. However, both of the techniques uses one proxy goes down another and another can
agent replication aspects, and "the redundent replace main proxy's functionalities. This brings
replicates and any proxies are external to original another overhead for this solution, but it pay of
agent, using a system centric approach, but by improving reliability of the system.
individual agents must have the capability to Another proxy to exchange data
utilize replication" [2]. between replicas, and this proxy makes sure that
Agent replication is the act of creating all members receive equal data. There is some
one or more duplicates of one or more agents in downsides of this part such as it become single
a multi-agent system" [2]. Each of these agents point of failure solution. Also, it may cause time
may perform the same task as the original agent. delay between data receieved and data backed up
Duplicate agents form replicate group and each to the replicates, and they become outdated.
member is named as replicates. "The other role that proxies can fill is
As soon as a replicate group is formed and it is management of the replicate group. Through
viewable by all members of the MAS, "there are various replicate group management policies,
several ways that agents can interact with it: some of the replication issues can be dealt with.
• An agent can send requests to each replicate in Three replication management policies are
turn until it receives an appropriate reply. identified: hot-standby, cold-standby, and active.
• An agent can send requests to all replicates and " [2]. Depending on policy system desides which
select one, or synthesize the replies that it proxy will start after current working proxy goes
receives. down. As well as proxies may "improve
• An agent can pick one of the replicates based performance managment of the replicate group"
on a particular criteria (speed, reliability, etc.) [2].
and interact only with that agent" [2]. “This paper introduced the topic of
There exist heterogeneous and agent replication and examined the issues
homogeous replicatios. The major difference associated with using agent replication in a
between two of them that first one has separate multi-agent system. Transparent proxies were
implemention then primary agent, and second introduced as a method of dealing with the main
one not. However, they have exact copies of issues of agent communication, read/write
original agent. Homogeous replication may have consistency and state synchronization”[2]. At
same faults as original agent, because it has the University of Saskatchewan, Saskatoon, Canada
same implementation, and defects of these ideas were implemented and I-HELP MAS
implementation which were left behind in there is a working system "for the replication server.
might be catch in all replicas. On other side The FIPA-OS [6] agent toolkit was chosen a
heterogeneous replication in case of the fault of platform for the implementation" [2].
the original agent may continue operating.
Another useful technique to have replicas with
diferent versions of the applications, if
something wrong on new version, which were 4 A Problem-Specific Fault-
supported on previous one, that previous version
replica can take over original agent. As soon as, Tolerance Mechanism for
group is constucted and operating under MAS, Asynchronous, Distributed
there are some ussies that may arrise, such as
"inter-agent communication and result synthesis,
Systems.
read/write consistency, and state
synchronization. In this case, there possibility of This article presents new theory on fault-tolerant
that system may fail at any point. The authors mechanism for distributed systems, which is
introduce transparent replication which uses based on research of different kind of networks,
proxy to communicate between replicas and such as LAN, WAN. Since these networks are
clients. So, "proxies provide two important providing various services over the network, it
functions: they make the replicate group appear makes communications between the users
to be a single entity and they control execution untrusted. So, authors designed the algorithm,
and state management of a replicate group" [2]. which provides reliable services and scalability
In details, proxy is providing "communication of the resources. In fact, scalability is achieved
between replicates and other agents in the MAS" by “a fully decentralized algorithm, in which the
[2]. In this case, proxy become single point of dynamically available resources are managed
through a membership protocol” [3]. On other empty, the process sends work requests to other
side, fault tolerance is assured in meaning of processes. A process that receives a work request
“that the loss of up to all but one resource will and has enough problems in its pool removes
not affect the quality of the solution” [3]. some of those problems and sends them to the
There was “branch-and-bound search requester.” [3] This describes dynamic design
algorithm” used to solve this problem. This which was chosen by algorithm designers. As
algorithm “is intelligent search method often usual dynamic methods provide better
used for optimization problems” [3], such as NP- performance and consistency, and in here all
hard problems. The original problem is processes update data with better solutions for
decomposed to smaller problems, and then start future reference and faster decision making
to solve them. In this algorithm, order of the result. However, to make this solution more
operations is important. Firstly, in decompose reliable some extensions were invented, such as
stage problem splits “into a set of new “a group membership protocol to allow dynamic
subproblems. “ [3] After decomposition problem variation in the number and components of
is branched. So, next step is branching, during resources and a fault tolerance mechanism” [3].
which “a bound value l(v) on the optimal The group membership protocol is here
solution of subproblem v. This bound value will for “collecting and updating information about
be used by Select and Eliminate operations” [3]. which resources participate in the computation at
During select operation, subproblem is verified any given time” [3]. This works really simple.
that new bound value is better that previously When new process or computer comes to the
known one, if it is than it stores “into the pool of assigned group, it sends request to join the group
the active problems”. [3] Otherwise this solution to known server, which controls this group and
is eliminated. “The best-known solution is knows if anyone from this group is present. If
updated when better feasible solution is found” not, the group is initialized, and on other hand if
[3]. The best performance can be obtained, if group is active, the server just simply joins this
most of operations can be done simultaneously, member to this group. On other side when
and there are some approaches, such as process lives the group by any causes, such as
synchronous or asynchronous, work sharing, and self termination or even failure membership
information sharing mechanisms. Synchronous server is aware of these kind processes. As a
design underlying that all processes will wait result, the system always knows who is where
each other to complete, and not to wait in the and what status of the systems’ components are.
asynchronous algorithm. Work sharing technique Moreover, the server is able to log all activities
allows monitoring all processes and assigning of all members, and make sure that process is not
any available slots, so these processes may work in timeout, due to some inactivity period of it.
in parallel. Information sharing mechanism The fault-tolerant mechanism does not capture
based on storing all information about all known “failures of computers and restore their data, but
sharing solutions and its decision is the best- rather focuses on detecting missing results” [3].
known solution for this kind of problem. This can be obtained by getting placed problem
The internet connected computes were on the particular place as a node in the tree, and
chosen as target architecture to prove branch- by calculating code for each location in the tree.
and-bound search algorithm. This architecture As a result, each node can be identified by
gives all resources for distributed systems, such unique code. “Furthermore, given a set of nodes
as [3]: scalability, dynamic availability, of the tree, we can easily find its complement,
unreliability, communication characteristics, that is, the list of nodes of the tree that are not in
heterogeneity, and lack of centralized control. the given set” [3].
Prove is made with some assumptions, but it still Algorithm was proved using simulation
does not fully brake practical point of the of the real systems. This simulation, as authors
algorithm. However, in my opinion, any says, “provides great flexibility in testing a wide
assumptions make any prove or any algorithm range of B&B strategies in a variety of Internet-
not real and some kind of artificial. like environments” [3]. Even though,
Let us look on algorithm with more experiments were made using “small problems”,
details. “Fully decentralized, asynchronous, fault result showed that overall performance rate
tolerant parallel B&B algorithm” [3] is increased. During these experiments,
considered to meet system requirements. implemented simulator provided different
“Each process maintains its local pool of information, such as “execution time,
problems to be solved. When the local pool is communication costs, and storage space” [3].
of hardware added to the system. In this case
As a result of experiments authors establish “a leader proxy becomes another piece of a single-
failure-recovery mechanism suited for a tree-like point failure. The third paper, is more theoretical
problem space. This mechanism and a low cost touch in distributed systems, but it still has
group membership protocol are the ingredients powerful canvas for future development. I think,
that transform a rather conventional parallel that my reseach in this topic of the distributed
branch-and-bound algorithm into a scalable, systems, opened eyes for me, and showed that
reliable, more powerful algorithm, able to exploit there are still holes, and there are ways to
the computational power of hundreds of Internet- improve existed systems, or even design new
connected resources. Scalability is achieved ones based on them.
through a fully distributed design. The algorithm
is fault tolerant under our assumptions and can
execute and terminate correctly even if only a
single resource remains available” [3]. References
Moreover, they found solution for “difficult
problems of fault tolerance and termination [1] A. Ninan, P. Kulkarni, P. Shenoy, K.
detection in distributed environments by Ramamritham, R. Tewari Cooperative leases:
exploiting problem-specific features, specifically scalable consistency maintenance in content
the tree structure of the problem space” [3]. distribution networks. Proceedings of the
eleventh international conference on World Wide
Web, Honolulu, Hawaii, USA, pages: 1 – 12,
2002
5 Conclusion [2] A. Fedoruk, R. Deters Improving fault-
tolerance by replicating agents. In Proceedings of
In this paper, I made research on some fault the first international joint conference on
folerance aspects of the distributed systems. Autonomous agents and multi agent systems:
Since I like practical solutions more than part2, Bologna, Italy, pages 737 – 744, 2002.
theoretical researches, I tried to analize them
from a practical point of view, and in some cases [3] A. Iamnitchi, I. Foster A Problem-Specific
these theretical ideas really good, but the Fault-Tolerance Mechanism for Asynchronous,
assumptions which were made really smplify the Distributed Systems. Proceedings of the 2000
problem. Anyway, all these reseaches were made International Conference on Parallel
and some results showed that there is big Processing, Toronto, Canada, pages 4 – 13,
opportunity to improve the distributed systems August 2000
from the fault tolerant perspective. The major
point of all is to make system functioning even if [4] H. Kopetz, P. Verissimo. Real Time and
any part of it goes off. This would make a Dependability Concepts. In S. Mullender.
system,especially distributed, available, reliable, Distributed Systems, pp 441-446. Wolkingham:
safe, and easy maintainable. Since, taking off Addison-Wesley, 2nd ed., 1993.
one element of the system will not stop its
functions, and during this down time that [5] A. Tanenbaum, M. van Steen. Distributed
element can be improved to make system more
Systems: Principles and Paradigms. Prentice
secure or make some updates to it. From these Hall, 2002.
three papers, my preference was the second one
about improvement for replicating agents. It
[6] FIPA-OS. http://fipa-os.sourceforge.net,
makes system more reliable, even if you pay retrieved February 05, 2003.
higher price to achieve it. For example the first
paper, tries to improve distributed system by
inventing new type of leasing, and do not
mention about fault-tolerant issues for the piece

Fault Tolerance in Distributed Systems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fault Tolerance in Distributed Systems

Uploaded by

Copyright:

Available Formats

Fault Tolerance in Distributed Systems

Abstract constraint of the dependable system is reliability.

1 Introduction 2 Cooperative leases: scalable

You might also like