You are on page 1of 26

Description

FIELD OF THE INVENTION This invention relates generally to file systems, and more particularly to file systems distributed over multiple computer systems. BACKGROUND OF THE INVENTION In modern computer systems, large collections of data are usually organized on disk storage as files. If the number of files is large, then the files may be distributed over multiple computer systems. Users'programs access the files by requesting file services from one or more file systems. The file systems also perform administrative actions such as controlling coherent access by the clients, communicating with physical storage components, maintaining redundant copies, and recovering from failure. In most file systems, the files comprise user data and metadata. The metadata are all information required to manage the user data, such as names, locations, dates, file sizes, access protection, and so forth. The organization of the user data is usually managed by the client programs. It is laborious to administer a large distributed file system that serves a large and growing user community. For instance, to store more files, and to serve more users, one must add more disks and more server computers. Each of these components requires human attention. To simplify the distribution of files, groups of files or "volumes" are often manually assigned to particular disks. Then, the files can manually be moved or replicated when components fill up, fail, or become throughput bound. Joining many thousands of files distributed over many disks into a redundant array of independent disks (RAID) is only a partial solution; administration problems still arise when the system grows so large to require multiple RAIDs, and multiple server processors. In the prior art, there are have been numerous attempts to construct distributed file systems that are scalable. Scalable in this context means that the file system can be adjusted to any desired size without changing the underlying architecture of the system. Some of these prior art file systems are now described to illustrate the need for a better scalable file system. The Cambridge File Server (CFS), described by Birrell et al. in "A universal file server," IEEE Transactions on Software Engineering, SE-6(5):450-453, September 1980, takes a two-layered approach to building a distributed file system. There, the layers provide the users with two abstractions: files and indexes. File systems built on the two layers can use these abstractions to implement a distributed file system. As a characteristic, the CFS manages the entire distributed file system from a single server computer. Controlling data flow from a single server is simple, but in situations where a single server cannot handle the task, the CFS falls short. Also, a single server based system is vulnerable to failure. The Network File System (NFS), as described by Sandberg et al. in "Design and implementation

of the Sun network file system," Proceedings of the Summer USENIX Conference}, pages 119130, June 1985, is not a file system in itself, but rather a remote file access protocol. The NFS protocol provides a weak notion of cache coherence, and its stateless design requires client users to make many unnecessary and frequent accesses to the servers to maintain a marginal level of coherence in the data. The Andrew File System (AFS), described by Howard et al. in "Scale and performance in a distributed file system," ACM Transactions on Computer Systems, 6(1):51-81, February 1988, and its offshoot DCE/DFS as described by Kazar et el., in "DEcorum file system architectural overview," Proceedings of the Summer USENIX Conference, pp. 151-164, June, 1990, provides better cache performance and data coherence than NFS. AFS is designed for a different kind of scalability than will be described herein. The AFS has a global name space and security architecture that allows client computers to connect to many separate file servers using a wide area network. The Echo file system described by Mann et al in "A coherent distributed file cache with directory write-behind," ACM Transactions on Computer Systems, 12(2):123-164, May 1994, is log-based. The Echo file system replicates data for reliability, and access paths are allowed to span multiple disks for availability. In addition, the Echo file system provides coherent caching. However, the Echo file system cannot easily be scaled. There, each volume can only be managed by a single server computer. Failover, in the case of a hardware failure, can only be to a predetermined backup server. A volume can only span as many disks as can be connected to a single server. Although there is an internal layering of file services on top of a disk service, the Echo file system requires both layers to execute in the same address space on the same machine. The VMS Cluster file system, described by Strecker et al. in "VAXclusters: A closely-coupled distributed system," ACM Transactions on Computer Systems, 4(2):130-146, May 1986, offloads file system processing to individual servers that are members of a cluster, i.e., a plurality of closely-coupled computers. Each server in the cluster executes its own instance of the file system program in conjunction with a shared physical disk. Synchronization is provided by a distributed lock service. The shared physical disk is accessed either through a special-purpose cluster interconnect (CI) to which a disk controller can be directly connected, or through an ordinary local area network (LAN) such as Ethernet, and a processor acting as a disk server. The Spiralog file system described by Johnson et al. in "Overview of the Spiralog file system," Digital Technical Journal, 8(2):5-14, 1996, also off-loads processing of its file system to individual members of a cluster of interconnected servers that run above a shared storage system layer. The interface between layers in the Spiralog file system differs from the VMS cluster file system because the lower layer is neither file-like, nor simply disk-like. Instead, Spiralog provides an array of stably-stored bytes, and permits atomic actions to update arbitrarily scattered sets of bytes within the array. Spiralog's split between layers simplifies the file system, but complicates

the storage system considerably. Spiralog does not scale easily, nor does Spirolog tolerate hardware faults readily. A Spirolog volume can only span the disks connected to a single server, and the volume becomes unavailable when the server suffers a failure. Though designed as a cluster file system, Calypso, described by Devarakonda et al. in "Recovery in the Calypso file system," ACM Transactions on Computer Systems, 14(3):287-310, August 1996, is more similar to Echo than the VMS cluster file system. Like Echo, Calypso stores its files on multi-ported disks, i.e., disks that can be accessed by multiple servers. One of the servers directly connected to each disk acts as a file server for data stored on that disk; when the server fails, another server takes over. Other servers in a Calypso cluster access the current server as file system clients. Like Echo, the client computers can maintain coherent caches using a multiplereader/single-writer locking protocol. Shillner et al., in a "Simplifying distributed file systems using a shared logical disk," Technical Report TR-524-96, Dept. of Computer Science, Princeton University, 1996, describe a distributed file system on top of a shared logical disk. There, a lower layer uses multiple servers cooperating to implement a single logical disk. In an upper layer, multiple independent servers execute the same file system code on top of the logical disk to provide access to shared files. However, the logical disk layer does not provide redundancy. The system can recover from a failure in a local server, but dynamic reconfiguration of other failed servers is not possible. Their file system uses careful ordering of operations that write file metadata, but the writes are not logged. Their technique avoids the need for a full metadata scan to restore consistency after a server failure. Unfortunately the shared logical disk can lose track of free blocks after a server failure. This necessitates a time consuming garbage collection process to locate the free blocks. The xFS file system, described by Anderson et al. in "Serverless network file systems," ACM Transactions on Computer Systems, 14(1):41-79, February 1996, distributes management responsibility for files over multiple servers and provides good availability and performance. However, xFS has a predesignated manager for each file, and the storage server is log-structured working independent of other servers. File system recovery and reconfiguration is not addressed. An ideal distributed file system would provide all of its users with shared access to the same set of files. Access would be controlled in a coherent and transparent manner so that any users's view of any file at any one time is consistent with any other user's view. In addition, the distributed file system needs to be scalable to any arbitrary size to provide more storage space and higher performance as the need for data by an ever increasing number of users increases. The users would also like to have uninterrupted access to the data of the files, so high availability is a necessity, despite the fact that it is well known that hardware components can unpredictably fail at any time. In order to keep maintenance costs down, the distributed file system should require a minimal amount of human administration, and the complexity of the administration should not increase as more hardware components or users are added.

SUMMARY OF THE INVENTION Provided is a file system distributed over a plurality of computers connected by a network. The plurality of computers execute user programs, and the user programs access files stored on a plurality of physical disks connected to the plurality of computers. According to the invention, the file system includes a plurality of file servers executing on the plurality of computers as a single distributed file server layer, a plurality of disk servers executing on the plurality of computers as a single distributed disk server layer, and a plurality of lock servers executing on the plurality of computers as a single distributed lock server to coordinate the operation of the distributed file and disk server layers so that the user programs can coherently access the files on the plurality of physical disks. In one aspect of the invention, each of the plurality of file servers executes independently on a different one of the plurality of computers, and the plurality of file servers communicate only with plurality of disk servers and the plurality of lock servers, and not with each other. Furthermore, each of the plurality of file, disk, and lock servers can execute on a different one of the plurality of computers. Some of the computers executing user programs and file servers can be diskless workstations. In another aspect of the invention, the disk server layer organizes the plurality of physical disks as a single virtual disk having a single address space. As an advantage of the invention, the number of computers, users programs, physical disks, files, file servers, disk servers, and lock servers can dynamically change while the user programs, file servers, disk servers, and lock servers execute to provide a scalable file system. Also, the arrangement of the computers, user programs, physical disks, files, file servers, disk servers, and lock servers over the plurality of computers and physical disks can dynamically change while the user programs, file servers, disk servers, and lock servers execute to proved fault tolerance. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a top level block diagram of a file system according to the invention; FIG. 2 is block diagram of a plurality of computer systems connected by a network over which the file system of FIG. 1 is distributed; FIG. 3 is a block diagram of a client/server configuration of the file system of FIG. 1; FIG. 4 is a block diagram of a sparse address space of a virtual disk used by the file system; FIG. 5 is a flow diagram of a process for acquiring locks to perform a file access; and FIG. 6 is a flow diagram of locking transactions.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS System Overview FIG. 1 will be used to give an overview of a scalable distributed file system 100 according to the invention. As shown in FIG. 1, client user programs (clients) 101 would like to persistently store data on physical disks 102. The clients 101 organize the data into files which can be written to, and read from the disks 102. The invention provides a two layer file system to access the disks 102 and to manage the files. A first layer is designed as a distributed file server 110, and a second layer implements a distributed virtual disk server 120. Synchronization and coherence between a plurality of layers 110 is provided by a distributed lock server 130. The distributed file system 100 can include multiple copies of the file, disk, and lock servers 110, 120, and 130. In this case, each copy of a server can execute on a different physical machine. As a design feature, the number of copies that can concurrently service read and write accesses to the files on the disks 102 by the users 101 can easily by scaled to suit any size user community. Initially, the servers 110, 120, and 130 can all execute on a single processor, and then later, as the number of users and files increase, additional copies of the servers can be started up to increase storage capacity and improve throughput. This can be done without changing the configuration of existing servers, or interrupting their operation. Each of the distributed servers 110, 120, and 130 can be scaled independent of the other servers. The copies of the servers can be viewed as "bricks'" that can be stacked incrementally to build as large a file system as needed. A system administrator can add new users without concern about which machines will manage their files, and which disks are used to store the files. The salient point is that even when there are several copies of the servers 110, 120, or 130, the several copies act in a coordinated manner as a single functional unit, and, consequently, the distributed file system 100 gives all user programs a consistent view of the same set of files on the distributed physical disks 101. The distributed file system 100 according to the invention makes the numerous physical disks 102 appear as a single virtual disk having a single address space. The number of physical components used to implement the file system 100, e.g., processors, disks, and networks is not important. As the load changes, units can be added or removed without disturbing normal operation. Coherency of data in the files is controlled by the distributed lock server 130. Although the file, disk, and lock servers are shown together, it should be noted that the copies can execute independent on separate physical machines. For example, the file server 110 is designed so that the individual copies run totally independent of each other, the copies only communicate with the disk and lock server, and not with each other. Thus the individual copies can start, stop, and fail without disturbing the operation of the file system.

In one configuration of the present file system, some machines can be dedicated to run application programs and file servers 110, while other machines provide disk and lock servers. In another configuration, any processor can perform any of the file system functions because each copy of can process access requests from any client user program. The file server 110 can use any known file access protocol supported by host operating systems, such as DCE/DFS, NFS, or SMB. One distinguishing feature of the present file system is that it has a very simple internal structure, a set of cooperating servers use a common virtual disk and synchronize access to that disk with locks. This structure allows one to handle system recovery, reconfiguration, and load balancing with very little machinery. A system administrator can make a full and consistent backup of the entire file system without bringing it down. Backups can optionally be kept on-line, allowing users quick access to accidentally deleted files. The file system 100 tolerates and recovers from machine, network, and disk failures without operator intervention. Multiple interchangeable servers provide access to the same set of files by being layered on a single shared virtual disk, and the actions of the servers are coordinating with locks to ensure coherence in the data. The file system 100 can be scaled up by adding servers and machines as needed. This structure achieves fault tolerance by recovering automatically from server failures and continuing to operate with the servers that survive. The structure allows the file system to be distributed over multiple machines to optimally balance the load depending on the dynamic operational need of the user programs 101. Example Server/Machine Arrangement FIG. 2 shows an example arrangement 200 with assignment of file system functions to various computer systems. The arrangement 200 includes computer systems 210 and 220 connected by a network 230. The systems 210 execute client user programs 101. The systems 210 also include a file system switch 211, a copy 110 of the file server 110, and a virtual disk driver 212. The systems 210 can be workstations, or other similar computer systems. The systems 210 can be diskless. Each system 210 can concurrently execute multiple user programs 101 on behalf of one or more users. The systems 220 provide the distributed disk server 120 and the distributed lock server 130. The virtual disk can be distributed over physical disks 102 attached to the systems 220. The system administrator can control how the physical disks are distributed over the machines 220. The functions do not have to be assigned to machines exactly as shown in FIG. 2. For example, the file server 110 and disk servers 120 do not have to execute on separate machines, in some installations it may make sense to use the same machine for both functions, particularly when the file server 110 is not heavily loaded. Similarly, the distributed lock server 130 is independent of the other functions. Instead of having each machine 220 having a copy of the lock server 130, the lock functions can be served from machines 210, or any other available machine. During operation of the arrangement 200, the client user programs 101 access the files on the disks 102 using operating system call interfaces. User programs executing on different machines

all see the same set of files, and their view of the files are coherent, that is, changes made to files on one machine are immediately visible to all user programs 101. Programs get essentially the same semantic guarantees as if the entire arrangement was implemented using, for example, a local Unix (TM) file system. Changes to the user data of a file is staged through a conventional local buffer pool, and is not guaranteed to reach non-volatile physical storage 102 until a next application of a synchronizing system call. Changes to the metadata of the files are logged, and can optionally be guaranteed non-volatile by the time the system call completes and returns to the user programs 101. In order to avoid a metadata write for each user data read, the file system maintains an approximate last time that a file was accessed. For a complete description of the logging aspects of the file system see U.S. patent application Ser. No. 08/859,670 "Multiple logs for distributed computer systems," filed by Thekkath et al. on May 20, 1997. The copies of the servers on each machine executes within the machine's operating system kernel. In another embodiment of the invention, copies of the servers may run outside the kernel. When the copies are mounted, the copies registers themselves with the kernel's file system switch 211 as one of the available file system implementations. The servers uses the kernel's buffer pool to cache data from recently used files. The file server 110 reads and writes data of the virtual disk using the virtual disk driver 212. Each copy of the file server 110 maintains its own copy of a "redo" log of pending file changes. The logs are maintained by the virtual disk server 120 so that when any file server 110 fails, the surviving servers read the log to recover from the failure. The various copies of the file server have no need to communicate with each other, they only communicate with the virtual disk server 120, and the lock server 130. Although the copies execute independently, they behave as a single functional unit. This makes server addition, removal, and recovery simple. The virtual disk driver 212 hides the distributed nature of the virtual disk. To the higher levels of the operating system, it appears as if the files are stored on a local physical disk. The driver 212 is responsible for contacting the correct disk server 120, and for failing over to another server when necessary. The distributed file system servers execute cooperatively to provide the file system with a large, scalable, fault-tolerant virtual disk that is implemented on top of the physical disks 102 of the machines 220. The file server 110 tolerates multiple machine and network failures as long as the virtual disk and lock service is accessible. The lock server 130 provide multiple-reader/single-writer locks to the client users 101. For fault tolerance and scalable performance the lock server 130 can be distributed. The file system uses the lock server 130 to coordinate access to the virtual disk, and to keep local cache buffers coherent across multiple servers.

Security In the configuration shown in FIG. 2, every machine 210 that hosts user programs also hosts a copy of the file server 110. This configuration has the potential for good load balancing and scaling, but poses security concerns. Any machine 210 can read or write any block of the shared virtual disk, so the servers 110 must run on machines with trusted operating systems when a secure operation is desired. It would not be sufficient for the machine 210 executing the file servers 110 to authenticate themselves as acting on behalf of the user programs 101 with the machines 220 executing the disk and lock servers as is done with a remote file access protocols like NFS. Full security also requires the disk and lock servers to execute on trusted operating systems, and all three types of servers to authenticate themselves to one another. Finally, to ensure file data is kept private, users should be prevented from eavesdropping on the network 130 interconnecting the machines 210 and 220. In a simple solution, one could fully solve these problems by placing the machines in an environment that prevents users from booting modified operating system kernels on their machines, and then interconnecting the machines with a private network that excludes access by user processes. This does not necessarily mean that the machines must be locked in a room with a private physical network; known cryptographic techniques for secure booting, authentication, and encrypted links could be used instead. Also, in many applications, partial solutions may be acceptable; typical existing NFS installations are not secure against network eavesdropping or even data modification by users who boots a modified kernel on their workstations. It is possible to reach the NFS level of security by having the disk server 120 only accept requests from file server machines with trusted network addresses. The network addresses can be Internet Protocol (IP) addresses. Client/Server Configuration The present file system can be exported to machines outside a trusted administrative domain using the configuration shown in FIG. 3. In this context, an untrusted client machine 310 is distinguish from a trusted server machine 320. Only the file server 110 executing on the trusted machine 320 communicates directly with the disk and lock servers 120 and 130. The trusted machine 320 can be located in a restricted environment and interconnected by a private network as discussed above. The remote untrusted machine 310 communicates with the trusted machine 320 through a separate network 330. The untrusted machine 310 has no direct access to the disk and lock servers. Using the file system switch 211, the client user programs 101 can use any file access protocol supported by the host operating system, such as DCE/DFS, NFS, or SMB, because the file server 110 appears just like a local file system on the machine running the server. Of course, a protocol that supports coherent access, such as DCE/DFS, is best, so that the file system's coherence across multiple servers is retained at the next level up. Ideally, the protocol should also support failover from one server to another. The protocols just mentioned do not support failover directly, but the technique of having a new machine take over the network address of a failed

machine can be applied here. Apart from security, there is a second reason for using the client/server configuration 300. Because the file server 110 executes in the kernel of the operating system, it is difficult to port the file system to different operating systems, or even different versions of a single operating system. The client/server configuration 300 allows client programs 101 to access the file system 100 from any remote unsupported system 310 using the network 330. For example, the network 130 can be the Internet, and the machine 310 can be any remote client computer connected to the Internet. The system 100 can than be centralized as an Internet server to provided file services to any number of remote Internet clients. Virtual Disk Address Space In the preferred embodiment, the disks 102 can include many individual disk drives, for example, SCCI type disks, which can be configured as a single shared pool of storage using RAID technologies. The virtual disk layer can provide disk caching and supports efficient snapshots for consistent back-ups. The disks 102 effectively provides a sparse 264 byte address space which can be allocated on demand. FIG. 4 shows how the sparse 264 byte address space 400 of the virtual disk can be partitioned. Because there is so much virtual addressing space, the addresses do not need to be carefully husbanded and dynamically reused. Addresses can statically be parceled out in generous quantities. Virtual addresses are committed to physical locations when data are written. A gross partitioning logically allocates addresses in terabyte (240) ranges, e.g., 1T, 2T, etc., in FIG. 4. In order to keep the internal data structures of the file system small, physical addresses are also committed and decomitted in fairly large chunks, for example, 64K bytes. A first address range is allocated to shared configuration parameters and file system housekeeping information (PARAMS) 410. The second range 420 stores process specific recovery logs 420. There can be one private log for every possible file server process that can execute on the processing units 110, for example, 256 logs 421-322. Fewer or more logs are also possible. Logs are bounded in size. The physical space allocated to a log is managed as a circular buffer. When the log fills up, a check can made to determine whether the updates described in the oldest, for example, 25% of the log have been carried out. If not, further file updates can be blocked. Otherwise, the tail end of the log can be reallocated. The rest of the address space from 2T to 264 is allocated for data of the file system. This data includes file system metadata 401 and user data 402. The metadata 401 define the structure of the user data 402. The metadata 401 includes bitmaps 430, information nodes (INODES) 440, and directory information (DIR) 450. The bitmaps 430 indicate which virtual addresses are used or available. The INODES 440 store pointers to the user data, sizes of files, data formats, dates, and the like.

The directory information 450 stores user file names, and their equivalent system names or numbers. The user data 402 can be organized as sequential, relational, or object oriented files for example. In one implementation, the file system supports about 16 million files although this limit is easily changed by changing some of the boundaries between the address ranges. For expediency sake, the emphasis for data recovery in the preferred embodiment is placed on the metadata because if the metadata are lost, then the entire file system is at risk. A reasonable recovery of user data can be achieved by periodic back-ups taken at check-points.

Introduction to Distributed System Design


Audience and Pre-Requisites

This tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.
The Basics

What is a distributed system? It's one of those things that's hard to define without first defining many other things. Here is a "cascading" definition of a distributed system:
A program is the code you write. A process is what you get when you run it. A message is used to communicate between processes. A packet is a fragment of a message that might travel on a wire. A protocol is a formal description of message formats and the rules that two processes must follow in order to exchange those messages. A network is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which are connected by communication links. A component can be a process or any piece of hardware required to run a process, support communications between processes, store data, etc. A distributed system is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, such that all components cooperate together to perform a single or small set of related tasks.

Why build a distributed system? There are lots of advantages including the ability to connect remote users with remote resources in an open and scalable way. When we say open, we mean each component is continually open to interaction with other components. When we say

scalable, we mean the system can easily be altered to accommodate changes in the number of users, resources and computing entities. Thus, a distributed system can be much larger and more powerful given the combined capabilities of the distributed components, than combinations of stand-alone systems. But it's not easy - for a distributed system to be useful, it must be reliable. This is a difficult goal to achieve because of the complexity of the interactions between simultaneously running components. To be truly reliable, a distributed system must have the following characteristics:

Fault-Tolerant: It can recover from component failures without performing incorrect actions. Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed. Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired. Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a nondistributed system. Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect. Predictable Performance: The ability to provide desired responsiveness in a timely manner. Secure: The system authenticates access to data and services [1]

These are high standards, which are challenging to achieve. Probably the most difficult challenge is a distributed system must be able to continue operating correctly even when components fail. This issue is discussed in the following excerpt of an interview with Ken Arnold. Ken is a research scientist at Sun and is one of the original architects of Jini, and was a member of the architectural team that designed CORBA.

Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure. Imagine asking people, "If the probability of something happening is one in 1013, how often would it happen?" Common sense would be to answer, "Never." That is an infinitely large number in human terms. But if you ask a physicist, she would say, "All the time. In a cubic foot of air, those things happen all the time."

When you design distributed systems, you have to say, "Failure happens all the time." So when you design, you design for failure. It is your number one concern. What does designing for failure mean? One classic problem is partial failure. If I send a message to you and then a network failure occurs, there are two possible outcomes. One is that the message got to you, and then the network broke, and I just didn't get the response. The other is the message never got to you because the network broke before it arrived.

So if I never receive a response, how do I know which of those two results happened? I cannot determine that without eventually finding you. The network has to be repaired or you have to come up, because maybe what happened was not a network failure but you died. How does this change how I design things? For one thing, it puts a multiplier on the value of simplicity. The more things I can do with you, the more things I have to think about recovering from. [2]

Handling failures is an important theme in distributed systems design. Failures fall into two obvious categories: hardware and software. Hardware failures were a dominant concern until the late 80's, but since then internal hardware reliability has improved enormously. Decreased heat production and power consumption of smaller circuits, reduction of off-chip connections and wiring, and high-quality manufacturing techniques have all played a positive role in improving hardware reliability. Today, problems are most often associated with connections and mechanical devices, i.e., network failures and drive failures. Software failures are a significant issue in distributed systems. Even with rigorous testing, software bugs account for a substantial fraction of unplanned downtime (estimated at 25-35%). Residual bugs in mature systems can be classified into two main categories [5].

Heisenbug: A bug that seems to disappear or alter its characteristics when it is observed or researched. A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug-mode. The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle). Bohrbug: A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched. A Bohrbug typically manifests itself reliably under a well-defined set of conditions. [6]

Heisenbugs tend to be more prevalent in distributed systems than in local systems. One reason for this is the difficulty programmers have in obtaining a coherent and comprehensive view of the interactions of concurrent processes. Let's get a little more specific about the types of failures that can occur in a distributed system:

Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending "I'm alive" (heartbeat) messages or fails to respond to requests. Your computer freezing is a halting failure. Fail-stop: A halting failure with some kind of notification to other components. A network file server telling its clients it is about to go down is a fail-stop. Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded. Network failures: A network link breaks.

Network partition failure: A network fragments into two or more disjoint sub-networks within which messages can be sent, but between which messages are lost. This can occur due to a network failure. Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized; when a message is delayed longer than a threshold period, etc. Byzantine failures: This captures several types of faulty behaviors including data corruption or loss, failures caused by malicious programs, etc. [1]

Our goal is to design a distributed system with the characteristics listed above (fault-tolerant, highly available, recoverable, etc.), which means we must design for failure. To design for failure, we must be careful to not make any assumptions about the reliability of the components of a system. Everyone, when they first build a distributed system, makes the following eight assumptions. These are so well-known in this field that they are commonly referred to as the "8 Fallacies".
1. 2. 3. 4. 5. 6. 7. 8. The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. [3]

Latency: the time between initiating a request for data and the beginning of the actual data transfer. Bandwidth: A measure of the capacity of a communications channel. The higher a channel's bandwidth, the more information it can carry. Topology: The different configurations that can be adopted in building networks, such as a ring, bus, star or meshed. Homogeneous network: A network running a single network protocol.

So How Is It Done?

Building a reliable system that runs over an unreliable communications network seems like an impossible goal. We are forced to deal with uncertainty. A process knows its own state, and it knows what state other processes were in recently. But the processes have no way of knowing each other's current state. They lack the equivalent of shared memory. They also lack accurate ways to detect failure, or to distinguish a local software/hardware failure from a communication failure. Distributed systems design is obviously a challenging endeavor. How do we do it when we are not allowed to assume anything, and there are so many complexities? We start by limiting the scope. We will focus on a particular type of distributed systems design, one that uses a clientserver model with mostly standard protocols. It turns out that these standard protocols provide

considerable help with the low-level details of reliable network communications, which makes our job easier. Let's start by reviewing client-server technology and the protocols.
In client-server applications, the server provides some service, such as processing database queries or sending out current stock prices. The client uses the service provided by the server, either displaying database query results to the user or making stock purchase recommendations to an investor. The communication that occurs between the client and the server must be reliable. That is, no data can be dropped and it must arrive on the client side in the same order in which the server sent it.

There are many types of servers we encounter in a distributed system. For example, file servers manage disk storage units on which file systems reside. Database servers house databases and make them available to clients. Network name servers implement a mapping between a symbolic name or a service description and a value such as an IP address and port number for a process that provides the service. In distributed systems, there can be many servers of a particular type, e.g., multiple file servers or multiple network name servers. The term service is used to denote a set of servers of a particular type. We say that a binding occurs when a process that needs to access a service becomes associated with a particular server which provides the service. There are many binding policies that define how a particular server is chosen. For example, the policy could be based on locality (a Unix NIS client starts by looking first for a server on its own machine); or it could be based on load balance (a CICS client is bound in such a way that uniform responsiveness for all clients is attempted). A distributed service may employ data replication, where a service maintains multiple copies of data to permit local access at multiple locations, or to increase availability when a server process may have crashed. Caching is a related concept and very common in distributed systems. We say a process has cached data if it maintains a copy of the data locally, for quick access if it is needed again. A cache hit is when a request is satisfied from cached data, rather than from the primary service. For example, browsers use document caching to speed up access to frequently used documents.

Caching is similar to replication, but cached data can become stale. Thus, there may need to be a policy for validating a cached data item before using it. If a cache is actively refreshed by the primary service, caching is identical to replication. [1] As mentioned earlier, the communication between client and server needs to be reliable. You have probably heard of TCP/IP before. The Internet Protocol (IP) suite is the set of communication protocols that allow for communication on the Internet and most commercial networks. The Transmission Control Protocol (TCP) is one of the core protocols of this suite. Using TCP, clients and servers can create connections to one another, over which they can exchange data in packets. The protocol guarantees reliable and in-order delivery of data from sender to receiver. The IP suite can be viewed as a set of layers, each layer having the property that it only uses the functions of the layer below, and only exports functionality to the layer above. A system that implements protocol behavior consisting of layers is known as a protocol stack. Protocol stacks can be implemented either in hardware or software, or a mixture of both. Typically, only the lower layers are implemented in hardware, with the higher layers being implemented in software.

Resource : The history of TCP/IP mirrors the evolution of the Internet. Here is a brief overview of this history.

There are four layers in the IP suite:


1. Application Layer : The application layer is used by most programs that require network communication. Data is passed down from the program in an application-specific format to the next layer, then encapsulated into a transport layer protocol. Examples of applications are HTTP, FTP or Telnet. 2. Transport Layer : The transport layer's responsibilities include end-to-end message transfer independent of the underlying network, along with error control, fragmentation and flow control. End-to-end message transmission at the transport layer can be categorized as either connection-oriented (TCP) or connectionless (UDP). TCP is the more sophisticated of the two protocols, providing reliable delivery. First, TCP ensures that the receiving computer is ready to accept data. It uses a three-packet handshake in which both the sender and receiver agree that they are ready to communicate. Second, TCP makes sure that data gets to its destination. If the receiver doesn't acknowledge a particular packet, TCP automatically retransmits the packet typically three times. If necessary, TCP can also split large packets into smaller ones so that data can travel reliably between source and destination. TCP drops duplicate packets and rearranges packets that arrive out of sequence.

<="" p="">UDP is similar to TCP in that it is a protocol for sending and receiving packets across a network, but with two major differences. First, it is connectionless. This means that one program can send off a load of packets to another, but that's the end of their relationship. The second might send some back to the first and the first might send some more, but there's never a solid connection. UDP is also different from TCP in that it doesn't provide any sort of guarantee that the receiver will receive the packets that are sent in the right order. All that is guaranteed is the packet's contents. This means it's a lot faster, because there's no extra overhead for error-checking above the packet level. For this reason, games often use this protocol. In a game, if one packet for updating a screen position goes missing, the player will just jerk a little. The other packets will simply update the position, and the missing packet - although making the movement a little rougher - won't change anything.

<="" p="">Although TCP is more reliable than UDP, the protocol is still at risk of failing in many ways. TCP uses acknowledgements and retransmission to detect and repair loss. But it cannot overcome longer communication outages that disconnect the sender and receiver for long enough to defeat the retransmission strategy. The normal maximum disconnection time is between 30 and 90 seconds. TCP could signal a failure and give up when both end-points are fine. This is just one example of how TCP can fail, even though it does provide some mitigating strategies.

3. Network Layer : As originally defined, the Network layer solves the problem of getting packets across a single network. With the advent of the concept of internetworking, additional functionality was added to this layer, namely getting data from a source network to a destination network. This generally involves routing the packet across a network of networks, e.g. the Internet. IP performs the basic task of getting packets of data from source to destination. 4. Link Layer : The link layer deals with the physical transmission of data, and usually involves placing frame headers and trailers on packets for travelling over the physical network and dealing with physical components along the way.

Resource : For more information on the IP Suite, refer to the Wikipedia article.

Remote Procedure Calls

Many distributed systems were built using TCP/IP as the foundation for the communication between components. Over time, an efficient method for clients to interact with servers evolved called RPC, which means remote procedure call. It is a powerful technique based on extending the notion of local procedure calling, so that the called procedure may not exist in the same address space as the calling procedure. The two processes may be on the same system, or they may be on different systems with a network connecting them. An RPC is similar to a function call. Like a function call, when an RPC is made, the arguments are passed to the remote procedure and the caller waits for a response to be returned. In the illustration below, the client makes a procedure call that sends a request to the server. The client process waits until either a reply is received, or it times out. When the request arrives at the server, it calls a dispatch routine that performs the requested service, and sends the reply to the client. After the RPC call is completed, the client process continues.

<="" p=""> Threads are common in RPC-based distributed systems. Each incoming request to a server typically spawns a new thread. A thread in the client typically issues an RPC and then blocks (waits). When the reply is received, the client thread resumes execution. A programmer writing RPC-based code does three things:
1. Specifies the protocol for client-server communication 2. Develops the client program 3. Develops the server program The communication protocol is created by stubs generated by a protocol compiler. A stub is a routine that doesn't actually do much other than declare itself and the parameters it accepts. The stub contains just enough code to allow it to be compiled and linked.

The client and server programs must communicate via the procedures and data types specified in the protocol. The server side registers the procedures that may be called by the client and receives and returns data required for processing. The client side calls the remote procedure, passes any required data and receives the returned data. Thus, an RPC application uses classes generated by the stub generator to execute an RPC and wait for it to finish. The programmer needs to supply classes on the server side that provide the logic for handling an RPC request. RPC introduces a set of error cases that are not present in local procedure programming. For example, a binding error can occur when a server is not running when the client is started. Version mismatches occur if a client was compiled against one version of a server, but the server has now been updated to a newer version. A timeout can result from a server crash, network problem, or a problem on a client computer. Some RPC applications view these types of errors as unrecoverable. Fault-tolerant systems, however, have alternate sources for critical services and fail-over from a primary server to a backup server. A challenging error-handling case occurs when a client needs to know the outcome of a request in order to take the next step, after failure of a server. This can sometimes result in incorrect actions and results. For example, suppose a client process requests a ticket-selling server to check for a seat in the orchestra section of Carnegie Hall. If it's available, the server records the request and the sale. But the request fails by timing out. Was the seat available and the sale recorded? Even if there is a backup server to which the request can be re-issued, there is a risk that the client will be sold two tickets, which is an expensive mistake in Carnegie Hall [1].

Here are some common error conditions that need to be handled:

Network data loss resulting in retransmit: Often, a system tries to achieve 'at most once' transmission tries. In the worst case, if duplicate transmissions occur, we try to minimize any damage done by the data being received multiple time. Server process crashes during RPC operation: If a server process crashes before it completes its task, the system usually recovers correctly because the client will initiate a retry request once the server has recovered. If the server crashes completing the task but before the RPC reply is sent, duplicate requests sometimes result due to client retries. Client process crashes before receiving response: Client is restarted. Server discards response data.

Some Distributed Design Principles Given what we have covered so far, we can define some fundamental design principles which every distributed system designer and software engineer should know. Some of these may seem obvious, but it will be helpful as we proceed to have a good starting list.

As Ken Arnold says: "You have to design distributed systems with the expectation of failure." Avoid making assumptions that any component in the system is in a particular state. A classic error scenario is for a process to send data to a process running on a second machine. The process on the first machine receives some data back and processes it, and then sends the results back to the second machine assuming it is ready to receive. Any number of things could have failed in the interim and the sending process must anticipate these possible failures. Explicitly define failure scenarios and identify how likely each one might occur. Make sure your code is thoroughly covered for the most likely ones. Both clients and servers must be able to deal with unresponsive senders/receivers. Think carefully about how much data you send over the network. Minimize traffic as much as possible. Latency is the time between initiating a request for data and the beginning of the actual data transfer. Minimizing latency sometimes comes down to a question of whether you should make many little calls/data transfers or one big call/data transfer. The way to make this decision is to experiment. Do small tests to identify the best compromise. Don't assume that data sent across a network (or even sent from disk to disk in a rack) is the same data when it arrives. If you must be sure, do checksums or validity checks on data to verify that the data has not changed. Caches and replication strategies are methods for dealing with state across components. We try to minimize stateful components in distributed systems, but it's challenging. State is something held in one place on behalf of a process that is in another place, something that cannot be reconstructed by any other component. If it can be reconstructed it's a cache. Caches can be helpful in mitigating the risks of maintaining state across components. But cached data can become stale, so there may need to be a policy for validating a cached data item before using it.

If a process stores information that can't be reconstructed, then problems arise. One possible question is, "Are you now a single point of failure?" I have to talk to you now - I can't talk to anyone else. So what happens if you go down? To deal with this issue, you could be replicated. Replication strategies are also useful in mitigating the risks of maintaining state. But there are challenges here too: What if I talk to one replicant and modify some data, then I talk to another? Is that modification guaranteed to have already arrived at the other? What happens if the network gets partitioned and the replicants can't talk to each other? Can anybody proceed? There are a set of tradeoffs in deciding how and where to maintain state, and when to use caches and replication. It's more difficult to run small tests in these scenarios because of the overhead in setting up the different mechanisms.

Be sensitive to speed and performance. Take time to determine which parts of your system can have a significant impact on performance: Where are the bottlenecks and why? Devise small tests you can do to evaluate alternatives. Profile and measure to learn more. Talk to your colleagues about these alternatives and your results, and decide on the best solution. Acks are expensive and tend to be avoided in distributed systems wherever possible. Retransmission is costly. It's important to experiment so you can tune the delay that prompts a retransmission to be optimal.

Advantages and disadvantages


DDBMS has many advantages. Data is located near the greatest demand site, access is faster, processing is faster due to several sites spreading out the work load, new sites can be added quickly and easily, communication is improved, operating costs are reduced, it is user friendly, there is less danger of a single-point failure, and it has process independence. Several reasons why businesses and organizations move to distributed databases include organizational and economic reasons, reliable and flexible interconnection of existing database, and the future incremental growth. Companies believe that a decentralized, distributed data database approach will adapt more naturally with the structure of the organizations. Distributed database is more suitable solution when several database already exist in an organization. In addition, the necessity of performing global application can be easily perform with distributed database. If an organization grows by adding new relatively independent organizational units, then the distributed database approach support a smooth incremental growth. Data can physically reside nearest to where it is most often accessed, thus providing users with local control of data that they interact with. This results in local autonomy of the data allowing users to enforce locally the policies regarding access to their data. One might want to consider a parallel architecture is to improve reliability and availability of the data in a scalable system. In a distributed system, with some careful tact, it is possible to access some, or possibly all of the data in a failure mode if there is sufficient data replication.

DDBMS also has a few disadvantages. Managing and controlling is complex, there is less security because data is at so many different sites. Distributed databases provides more flexible accesses that increase the chance of security violations since the database can be accessed throughout every site within the network. For many applications, it is important to provide secure. Present distributed database systems do not provide adequate mechanisms to meet these objectives. Hence the solution requires the operation of DDBMS capable of handling multilevel data. Such a system is also called a multi level security distributed database management systems (MLS-DDBMS). MLS-DDBMS provides a verification service for users who wish to share data in the database at different level security. In MLS- DDBMS, every data item in the database has correlated with one of several classifications or sensitivities. The ability to ensure the integrity of the database in the presence of unpredictable failures of both hardware and software components is also an important features of any distributed database management systems. The integrity of a database is concerned with its consistency, correctness, validity, and accuracy. The integrity controls must be built into the structure of software, databases, and involved personnel. If there are multiple copies of the same data, then this duplicated data introduces additional complexity in ensuring that all copies are updated for each update. The notion of concurrency control and recoverability consume much of the research efforts in the area of distributed database theory. Increasing in reliability and performance is the goal and not the status quo.

Advantages and disadvantages of distributedsystem over centralized system?


Advantages of Distributed Systems over Centralized ones1:Incremental growth:Computing power can be added in smallincrements2:Reliability:If one machine crashes, the system as a whole canstill survive3:Speed:A distributed system may have more total computing power than amainframe4:Open system:This is the most important point and themost characteristic pointof a distributed system.Since it is an open system it is always ready to communicate withother systems.an open system that scales has an advantage over a perfectly closedand self-contained system.Economic:ANd Microprocessors offer a better price/performance thanmainframesDisadvantages of Distributed Systems over Centralized ones1:As i previously told you distributed systems will have aninherent security issue.2:Networking:If the network gets saturated then problems withtransmission will surface.3:Software:There is currently very little less software support forDistributed system.4:Troubleshooting:Troubleshooting and diagnosing problems in a

distributed system can also become more difficult, because theanalysis may require connecting to remote nodes or inspectingcommunication between nodes.To design my own distribute system i would do the following:1:Distributed users always pose a threat to be exposed byhackers.Therefore I'l create

groups of users and will provide themconstrained access to data.2:Authentication required by all users who use the system.3:encryption..that is for message protection,4:The whole Domain is also to be protected via firewalls..so iwould install good Firewalls.5:Would install good anti viruses on my machines which are a part of the distributed system, as thesesystems are prone to the domino effect that is if one system gets effected the whole network will besusceptible to becoming effected.6:Design:As far as possible i'll keep the design simple and as local as possible.Will probably use a virtual privatenetwork.Will use machines that are compatible with each other.Will enforce secure access on all users.Will probably save the most important, the database on another well equipped machine so that it hasa system that is only dedicated to look after the database.Readmore:http://wiki.answers.com/Q/Advantages_and_disadvantages_of_distributed_syste m_over_centralized_system#ixzz1JP436kkp

Chapter 6

Distributed DBMS Architecture


This chapter introduces the architecture of different distributed systems such as client/server system and peer-to-peer distributed system. Due to diversity of distributed system, it is very difficult to derive an equivalent architecture for distributed DBMS. Different alternative architectures of the distributed database system and the advantages and disadvantages of such systems are discussed here in detail. This chapter also introduces the concept of multi-database system (MDBS) which is used to manage the heterogeneity of different DBMSs in a heterogeneous distributed DBMS environment. The classification of multi-database system and the architectures of such databases are thoroughly presented in this chapter. The outline of this chapter is as follows. Section 6.1.1 introduces different alternative architectures for client/server system and pros and cons of this system. In Section 6.1.2, alternative architectures for peer-to-peer distributed system has been discussed. Section 6.1.3 focuses on multi database system (MDBS). The classifications of MDBS and their corresponding architectures have been illustrated in this section. 6.1 Introduction The architecture of a system reflects the structure of the underlying system. It defines the different components of the system, the functions of these components and overall interactions and relationships among these components. This concept is true for general computer systems as well as software systems also. The software architecture of a program or computing system is the structure or structures of the system, which comprises software elements or modules, the externally visible properties of these elements and the relationships among them. Software architecture can be thought as a representation of an engineering software system and the process and discipline for effectively implementing the design(s) of such system.

Distributed database system can be consider as a large-scale software system, thus, the architecture of distributed system can be defined in a similar manner like software systems. This chapter introduces the different alternative reference architectures of distributed database systems such as client/server, peerto-peer and multi-database systems. 6.1.1 Client/Server System In the late 1970s and early 1980s smaller systems (mini computer) were developed that required less power and air conditioning. The term client/server was first used in the 1980s and gaining acceptance in reference to personal computer (PCs) on a network. In the late 1970s, Xerox developed the standard and technology that is familiar as Ethernet today. This provided a standard means of linking together computers from different manufactures and formed the basis for modern local area networks (LANs) and wide area networks (WANs). Client/server system has been developed to cope up with the rapidly changing business environment. The general forces that drive the move to Client/Server system are as follows: 102 A strong business requirement for decentralized computing horsepower. Standard, powerful computers with user-friendly interfaces. Mature, shrink-wrapped user applications with wide spread acceptance. Inexpensive, modular systems designed with enterprise class architecture such as power and network redundancy and file archiving network protocols to link them together. Growing cost/performance advantages of PC-based platforms. TheClient/Serve r system is a versatile, message-based and modular infrastructure that is intended to improve usability, flexibility, interoperability and scalability as compared to centralized, mainframe, time-sharing computing. In the simplest sense, the client and the server can be defined as follows. AClient is an individual users computer or a user application that does a certain amount of processing its own and sends and receives requests to and from one or more servers for other processing and/or data. AServer consists of one or more computers or an application program that receives and processes requests from one or more client machines. A server is typically designed with some redundancies in power, network, computing, and file storage. Usually, a client is defined as a requester of services and a server is defined as the provider of services. A single machine can be both a client and a server depending on the software configuration. Sometimes, the term server or client refers to the software rather than the machines. Generally, server software

runs on powerful computers dedicated for exclusive use to business applications. On the other hand, client software runs on common PCs or workstations. The properties of a server are: Passive (slave) Waiting for requests On request serves clients and sends reply. The properties of a client are: Active (Master) Sending requests Waits until reply arrives. Server can be stateless or stateful. A stateless server does not keep any information between requests. A stateful server can remember information between requests. 6.1.1.1 Advantages and disadvantages of Client/Server System A client/server system provides a number of advantages over a powerful mainframe centralized system. The major advantage is that it improves usability, flexibility, interoperability and scalability as compared to centralized, mainframe, time-sharing computing. In addition, a client/server system has the following advantages: 103 A client/server system has the ability to distribute the computing workload between client workstations and shared servers. A client/server system allows the end user to use microcomputers graphical user interfaces, thereby improves functionality and simplicity. It provides better performance at a reduced cost for hardware and software than alternative mini or mainframe solutions. The client/server environment is more difficult to maintain for a variety of reasons which are as follows. The client/server architecture creates a more complex environment in which it is often difficult to manage different platforms (LANs, operating systems, DBMS etc.)

In a client/server system, the operating system software is distributed over many machines rather than a single system, thereby increases complexity. A client/server system may suffer from security problem as the number of users and processing sites increases. The workstations are geographically distributed in a client/server system and each of these workstations are administrated and controlled by individual departments, which adds extra complexity. Furthermore, one communication cost is incurred with each processing. The maintaining cost of a client/server system is greater than alternative mini or mainframe solutions. 6.1.1.2 Architecture of Client/Server Distributed System Client/Server architecture is a prerequisite to the proper development of Client/Server systems. The Client/Server architecture is based on hardware and software components that interact to form a distributed system. In a client/server distributed database system, entire data can be viewed as a single logical database while at the physical level data may be distributed. From the data organizational view, the architecture of a client/server distributed database system is mainly concentrated on software components of the system and this system includes three main components: clients, servers and communications middleware. (i) AClient is an individual computer or process or users application that requests services from the server. A Client is also known as front-end application since the end user usually interacts with the client process. The software components require in client machine are client operating system, client DBMS and client graphical user interface. Client process is run on an operating system that has at least some multitasking capabilities. The end users interact with the client process via graphical user interface. In addition, a client DBMS is required at client side, which is responsible for managing the data that is cached to the client. In some client/server architecture, communication software is embedded into the client machine to interact efficiently with other machines in the network as a substitute of communication middleware. 104 (ii) AServe r consists of one or more computers or is a computer process or application that provides services to clients. A Server is also known as back-end application since the server process provides the background services for the client processes. A server provides most of the data management services such as query processing and optimization, transaction management, recovery management, storage

management and integrity services to clients. In addition, sometimes communication software is resided into the server machine to manage communications with clients and other servers in the network instead of communication middleware. (iii) A communication middleware is any process(es) through which clients and servers communicate with each other. The communication middleware is usually associated with a network that controls data and information transmission between clients and servers. Communication middleware software consists of three main components: Application Program Interface (API), Database translator and network translator. The application program interface (API) is public to client applications through which it can communicate with the communication middleware. The middleware API allows the client process to be database server independent. The database translator translates the SQL requests into the specific database server syntax, thus, enables a DBMS from one vendor to communicate directly with a DBMS from another vendor without a gateway. The network translator manages the network communications protocols, thus, it allows clients to be network protocol independent. To accomplish the connection between the client and the server, the communication middleware software operates at two different levels. The physical level deals with the communications between the client and the server computers (computer to computer) whereas the logical level deals with the communications between the client and the server processes (interprocess).

You might also like