You are on page 1of 8

Scalable Cluster Computing with

MOSIX for LINUX


Amnon Barak Oren La'adan Amnon Shiloh
Institute of Computer Science
The Hebrew University of Jerusalem
Jerusalem 91904, Israel
amnon,orenl,amnons @cs.huji.ac.il


http://www.mosix.cs.huji.ac.il

ABSTRACT 1 Introduction
This paper describes the Mosix technology for
Mosix is a software tool for supporting cluster
Cluster Computing (CC). Mosix [4, 5] is a set
computing. It consists of kernel-level, adaptive
of adaptive resource sharing algorithms that are
resource sharing algorithms that are geared for
geared for performance scalability in a CC of
high performance, overhead-free scalability and
any size, where the only shared component is
ease-of-use of a scalable computing cluster. The
the network. The core of the Mosix technology
core of the Mosix technology is the capability
is the capability of multiple nodes (workstations
of multiple workstations and servers (nodes) to
and servers, including SMP's) to work coopera-
work cooperatively as if part of a single system.
tively as if part of a single system.
The algorithms of Mosix are designed to re-
In order to understand what Mosix does, let
spond to variations in the resource usage among
us compare a Shared Memory (SMP) multicom-
the nodes by migrating processes from one node
puter and a CC. In an SMP system, several pro-
to another, preemptively and transparently, for
cessors share the memory. The main advan-
load-balancing and to prevent memory deple-
tages are increased processing volume and fast
tion at any node. Mosix is scalable and it at-
communication between the processes (via the
tempts to improve the overall performance by
shared memory). SMP's can handle many si-
dynamic distribution and redistribution of the
multaneously running processes, with efficient
workload and the resources among the nodes of
resource allocation and sharing. Any time a pro-
a computing-cluster of any size. Mosix conve-
cess is started, finished, or changes its computa-
niently supports a multi-user time-sharing envi-
tional profile, the system adapt instantaneously
ronment for the execution of both sequential and
to the resulting execution environment. The user
parallel tasks.
is not involved and in most cases does not even
So far Mosix was developed 7 times, for dif- know about such activities.
ferent version of Unix, BSD and most recently
for Linux. This paper describes the 7-th version Unlike SMP's, Computing Clusters (CC) are
of Mosix, for Linux. made of collections of share-nothing worksta-
tions and (even SMP) servers (nodes), with
different speeds and memory sizes, possibly ment, that is completely transparent to the user.
from different generations. Most often, CC's Current CC's lack such capabilities. They rely
are geared for multi-user, time-sharing environ- on user's controlled static allocation, which is
ments. In CC systems the user is responsible to inconvenient and may lead to significant perfor-
allocate the processes to the nodes and to man- mance penalties due to load imbalances.
age the cluster resources. In many CC systems, Mosix is a set of algorithms that support adap-
even though all the nodes run the same operating tive resource sharing in a scalable CC by dy-
system, cooperation between the nodes is rather namic process migration. It can be viewed as a
limited because most of the operating system's tool that takes CC platforms one step closer to-
services are locally confined to each node. The wards SMP environments. By being able to allo-
main software packages for process allocation cate resources globally, and distribute the work-
in CC's are PVM [8] and MPI [9]. LSF [7] and load dynamically and efficiently, it simplifies
Extreme Linux [10] provide similar services. the use of CC's by relieving the user from the
These packages provide an execution environ- burden of managing the cluster-wide resources.
ment that requires an adaptation of the appli- This is particularly evident in a multi-user, time-
cation and the user's awareness. They include sharing environments and in non-uniform CC's.
tools for initial (fixed) assignment of processes
to nodes, which sometimes use load consider-
ations, while ignoring the availability of other
resources, e.g. free memory and I/O overheads.
2 What is Mosix
These packages run at the user level, just like
Mosix [4, 5] is a tool for a Unix-like ker-
ordinary applications, thus are incapable to re-
nel, such as Linux, consisting of adaptive re-
spond to fluctuations of the load or other re-
source sharing algorithms. It allows multiple
sources, or to redistribute the workload adap-
Uni-processors (UP) and SMP's (nodes) run-
tively.
ning the same kernel to work in close coopera-
In practice, the resource allocation problem tion. The resource sharing algorithms of Mosix
is much more complex because there are many are designed to respond on-line to variations in
(different) kinds of resources, e.g., CPU, mem- the resource usage among the nodes. This is
ory, I/O, Inter Process Communication (IPC), achieved by migrating processes from one node
etc, where each resource is used in a different to another, preemptively and transparently, for
manner and in most cases its usage is unpre- load-balancing and to prevent thrashing due to
dictable. Further complexity results from the memory swapping. The goal is to improve the
fact that different users do not coordinate their overall (cluster-wide) performance and to cre-
activities. Thus even if one knows how to op- ate a convenient multi-user, time-sharing en-
timize the allocation of resources to processes, vironment for the execution of both sequen-
the activities of other users are most likely to in- tial and parallel applications. The standard run
terfere with this optimization. time environment of Mosix is a CC, in which
For the user, SMP systems guarantee effi- the cluster-wide resources are available to each
cient, balanced use of the resources among node. By disabling the automatic process migra-
all the running processes, regardless of the re- tion, the user can switch the configuration into a
source requirements. SMP's are easy to use be- plain CC, or even an MPP (single-user) mode.
cause they employ adaptive resource manage- The current implementation of Mosix is de-
signed to run on clusters of X86/Pentium based processes of a users' session share the execu-
workstations, both UP's and SMP's that are con- tion environment of the UHN. Processes that mi-
nected by standard LANs. Possible configura- grate to other (remote) nodes use local (in the
tions may range from a small cluster of PC's remote node) resources whenever possible, but
that are connected by Ethernet, to a high perfor- interact with the user's environment through the
mance system, with a large number of high-end, UHN. For example, assume that a user launches
Pentium based SMP servers that are connected several processes, some of which migrate away
by a Gigabit LAN, e.g. Myrinet [6]. from the UHN. If the user executes “ps”, it will
report the status of all the processes, including
processes that are executing on remote nodes. If
2.1 The technology one of the migrated processes reads the current
time, i.e. invokes gettimeofday(), it will get the
The Mosix technology consists of two parts:
current time at the UHN.
a Preemptive Process Migration (PPM) mech-
anism and a set of algorithms for adaptive re- The PPM is the main tool for the resource
source sharing. Both parts are implemented at management algorithms. As long as the require-
the kernel level, using a loadable module, such ments for resources, such as the CPU or main
that the kernel interface remains unmodified. memory are below certain threshold, the user's
Thus they are completely transparent to the ap- processes are confined to the UHN. When the
plication level. requirements for resources exceed some thresh-
The PPM can migrate any process, at any old levels, then some processes may be migrated
time, to any available node. Usually, migra- to other nodes, to take advantage of available
tions are based on information provided by one remote resources. The overall goal is to maxi-
of the resource sharing algorithms, but users mize the performance by efficient utilization of
may override any automatic system-decisions the network-wide resources.
and migrate their processes manually. Such a The granularity of the work distribution in
manual migration can either be initiated by the Mosix is the process. Users can run parallel ap-
process synchronously or by an explicit request plications by initiating multiple processes in one
from another process of the same user (or the node, then allow the system to assign these pro-
super-user). Manual process migration can be cesses to the best available nodes at that time.
useful to implement a particular policy or to test If during the execution of the processes new
different scheduling algorithms. We note that resources become available, then the resource
the super-user has additional privileges regard- sharing algorithms are designed to utilize these
ing the PPM, such as defining general policies, new resources by possible reassignment of the
as well as which nodes are available for migra- processes among the nodes. The ability to as-
tion. sign and reassign processes is particularly im-
Each process has a Unique Home-Node portant for “ease-of-use” and to provide an ef-
(UHN) where it was created. Normally this ficient multi-user, time-sharing execution envi-
is the node to which the user has logged-in. ronment.
In PVM this is the node where the task was Mosix has no central control or master-slave
spawned by the PVM daemon. The system im- relationship between nodes: each node can op-
age model of Mosix is a CC, in which every erate as an autonomous system, and it makes
process seems to run at its UHN, and all the all its control decisions independently. This
design allows a dynamic configuration, where avoid as much as possible thrashing or the swap-
nodes may join or leave the network with min- ping out of processes [2]. The algorithm is trig-
imal disruptions. Algorithms for scalability en- gered when a node starts excessive paging due
sure that the system runs well on large configu- to shortage of free memory. In this case the al-
rations as it does on small configurations. Scal- gorithm overrides the load-balancing algorithm
ability is achieved by incorporating randomness and attempts to migrate a process to a node
in the system control algorithms, where each which has sufficient free memory, even if this
node bases its decisions on partial knowledge migration would result in an uneven load distri-
about the state of the other nodes, and does not bution.
even attempt to determine the overall state of the
cluster or any particular node. For example, in
the probabilistic information dissemination al- 3 Process migration
gorithm [4], each node sends, at regular inter-
vals, information about its available resources Mosix supports preemptive (completely trans-
to a randomly chosen subset of other nodes. At parent) process migration (PPM). After a migra-
the same time it maintains a small “window”, tion, a process continues to interact with its en-
with the most recently arrived information. This vironment regardless of its location. To imple-
scheme supports scaling, even information dis- ment the PPM, the migrating process is divided
semination and dynamic configurations. into two contexts: the user context – that can be
migrated, and the system context – that is UHN
dependent, and may not be migrated.
2.2 The resource sharing algorithms
The user context, called the remote, contains
The main resource sharing algorithms of Mosix the program code, stack, data, memory-maps
are the load-balancing and the memory ush- and registers of the process. The remote encap-
ering. The dynamic load-balancing algorithm sulates the process when it is running in the user
continuously attempts to reduce the load differ- level. The system context, called the deputy,
ences between pairs of nodes, by migrating pro- contains description of the resources which the
cesses from higher loaded to less loaded nodes. process is attached to, and a kernel-stack for the
This scheme is decentralized – all the nodes ex- execution of system code on behalf of the pro-
ecute the same algorithms, and the reduction of cess. The deputy encapsulates the process when
the load differences is performed independently it is running in the kernel. It holds the site-
by pairs of nodes. The number of processors dependent part of the system context of the pro-
at each node and their speed are important fac- cess, hence it must remain in the UHN of the
tors for the load-balancing algorithm. This al- process. While the process can migrate many
gorithm responds to changes in the loads of the times between different nodes, the deputy is
nodes or the runtime characteristics of the pro- never migrated.
cesses. It prevails as long as there is no extreme The interface between the user-context and
shortage of other resources, e.g., free memory or the system context is well defined. Therefore
empty process slots. it is possible to intercept every interaction be-
The memory ushering (depletion prevention) tween these contexts, and forward this interac-
algorithm is geared to place the maximal num- tion across the network. This is implemented
ber of processes in the cluster-wide RAM, to at the link layer, with a special communication
User-level User-level

remote

oc al
s
pr loc
es
Link Link
layer layer
deputy

Kernel Kernel

Figure 1: A local process and a migrated process

channel for interaction. Figure 1 shows two chronously locate and interact with the remote.
processes that share a UHN. In the figure, the This location requirement is met by the commu-
left process is a regular Linux process while the nication channel between them. In a typical sce-
right process is split, with its remote part mi- nario, the kernel at the UHN informs the deputy
grated to another node. of the event. The deputy checks whether any
The migration time has a fixed component, action needs to be taken, and if so, informs the
for establishing a new process frame in the new remote. The remote monitors the communica-
(remote) site, and a linear component, propor- tion channel for reports of asynchronous events,
tional to the number of memory pages to be e.g., signals, just before resuming user-level ex-
transfered. To minimize the migration overhead, ecution. We note that this approach is robust,
only the page tables and the process' dirty pages and is not affected even by major modifications
are transferred. of the kernel. It relies on almost no machine de-
pendent features of the kernel, and thus does not
In the execution of a process in Mosix, lo-
hinder porting to different architectures.
cation transparency is achieved by forwarding
site dependent system calls to the deputy at the One drawback of the deputy approach is the
UHN. System calls are a synchronous form of extra overhead in the execution of system calls.
interaction between the two process contexts. Additional overhead is incurred on file and net-
All system calls that are executed by the process work access operations. For example, all net-
are intercepted by the remote site's link layer. If work links (sockets) are created in the UHN,
the system call is site independent it is executed thus imposing communication overhead if the
by the remote locally (at the remote site). Other- processes migrate away from the UHN. To over-
wise, the system call is forwarded to the deputy, come this problem we are developing “migrat-
which executes the system call on behalf of the able sockets”, which will move with the process,
process in the UHN. The deputy returns the re- and thus allow a direct link between migrated
sult(s) back to the remote site, which then con- processes. Currently, this overhead can signifi-
tinues to execute the user's code. cantly be reduced by initial distribution of com-
Other forms of interaction between the two municating processes to different nodes, e.g. us-
process contexts are signal delivery and pro- ing PVM/MPI. Should the system become im-
cess wakeup events, e.g. when network data ar- balanced, the Mosix algorithms will reassign the
rives. These events require that the deputy asyn- processes to improve the performance [3].
4 The implementation In many kernel activities, such as the execu-
tion of system calls, it is necessary to transfer
The porting of Mosix for Linux started by a fea- data between the user space and the kernel. This
sibility study. We also developed an interactive is normally done by the copy to user(),
kernel debugger, a pre-requisite for any project copy from user() kernel primitives. In
of this scope. The debugger is invoked either by Mosix, any kernel memory operation that in-
a user request, or when the kernel crashes. It al- volves access to user space, requires the deputy
lows the developer to examine kernel memory, to communicate with its remote to transfer the
processes, stack contents, etc. It also allows to necessary data.
trace system calls and processes from within the The overhead of the communication due to re-
kernel, and even insert break-points in the kernel mote copy operations, which may be repeated
code. several times within a single system call, could
In the main part of the project, we imple- be quite substantial, mainly due to the network
mented the code to support the transparent op- latency. In order to eliminate excessive re-
eration of split processes, with the user-context mote copies, which are very common, we imple-
running on a remote node, supported by the mented a special cache that reduces the number
deputy, which runs in the UHN. At the same of required interactions by prefetching as much
time, we wrote the communication layer that data as possible during the initial system call re-
connects between the two process contexts and quest, while buffering partial data at the deputy
designed their interaction protocol. The link be- to be returned to the remote at the end of the
tween the two contexts was implemented on top system call.
of a simple, but exclusive TCP/IP connection. To prevent the deletion or overriding of
After that, we implemented the process migra- memory-mapped files (for demand-paging) in
tion mechanism, including migration away from the absence of a memory map, the deputy holds
the UHN, back to the UHN and between two re- a special table of such files that are mapped to
mote sites. Then, the information dissemination the remote memory. The user registers of mi-
module was ported enabling exchange of sta- grated processes are normally under the respon-
tus information among the nodes. Using this sibility of the remote context. However, each
facility, the algorithms for process-assessment register or combination of registers, may be-
and automatic migration were also ported. Fi- come temporarily owned for manipulation by
nally, we designed and implemented the Mosix the deputy.
application programming interface (API) via the Remote (guest) processes are not accessible to
/proc. the other processes that run at the same node (lo-
cally or originated from other nodes) - and vice
versa. They do not belong to any particular user
4.1 Deputy / Remote mechanisms
(on the remote node, where they run) nor can
The deputy is the representative of the remote they be sent signals or otherwise manipulated by
process at the UHN. Since the entire user space local processes. Their memory can not be ac-
memory resides at the remote node, the deputy cessed and they can only be forced, by the local
does not hold a memory map of its own. Instead, system administrator, to migrate out.
it shares the main kernel map similarly to a ker- A process may need to perform some Mosix
nel thread. functions while logically stopped or sleeping.
Such processes would run Mosix functions “in cyclically alternate between a combination of
their sleep”, then resume sleeping, unless the computation and I/O.
event they were waiting for has meanwhile oc-
curred. An example is process migration, pos-
sibly done while the process is sleeping. For 4.4 The Mosix API
this purpose, Mosix maintains a logical state, The Mosix API has been traditionally imple-
describing how other processes should see the mented via a set of reserved system calls,
process, as opposed to its immediate state. that were used to configure, query and operate
Mosix. In line with the Linux convention, we
4.2 Migration constraints modified the API to be interfaced via the /proc
file system. This also prevents possible binary
Certain functions of the Linux kernel are not incompatibilities of user programs between dif-
compatible with process context division. Some ferent Linux versions.
obvious examples are direct manipulations of The API was implemented by extending the
I/O devices, e.g., direct access to privileged bus- Linux /proc file system tree with a new di-
I/O instructions, or direct access to device mem- rectory /proc/mosix. The calls to Mosix via
ory. Other examples include writable shared /proc include: synchronous and asynchronous
memory and real time scheduling. The last case migration requests; locking a process against au-
is not allowed because one can not guarantee it tomatic migrations; finding where the process
while migrating, as well as being unfair towards currently runs; finding about migration con-
processes of other nodes. strains; system setup and administration; con-
A process that uses any of the above is auto- trolling statistic collection and decay; informa-
matically confined to its UHN. If the process has tion about available resources on all configured
already been migrated, it is first migrated back nodes, and information about remote processes.
to the UHN.

4.3 Information collection 5 Conclusions


Statistics about a process' behavior are collected Mosix brings the new dimension of scaling to
regularly, such as at every system call and ev- cluster computing with Linux. It allows the
ery time the process accesses user data. This construction of a high-performance, scalable
information is used to assess whether the pro- CC from commodity components, where scal-
cess should be migrated from the UHN. These ing does not introduce any performance over-
statistics decay in time, to adjust for processes head. The main advantage of Mosix over other
that change their execution profile. They are CC systems is its ability to respond at run-time
also cleared completely on the “execve()” sys- to unpredictable and irregular resource require-
tem call, since the process is likely to change its ments by many users.
nature. The most noticeable properties of executing
Each process has some control over the col- applications on Mosix are its adaptive resource
lection and decay of its statistics. For instance, distribution policy, the symmetry and flexibil-
a process may complete a stage knowing that its ity of its configuration. The combined effect of
characteristics are about to change, or it may these properties implies that users do not have to
know the current state of the resource usage of and Reassignment in a Scalable Comput-
the various nodes, or even their number. Parallel ing Cluster. In Proc. PDCS '98, Oct. 1998.
applications can be executed by allowing Mosix
to assign and reassign the processes to the best [2] A. Barak and A. Braverman. Memory
possible nodes, almost like an SMP. Ushering in a Scalable Computing Cluster.
The Mosix R&D project is expanding in sev- Journal of Microprocessors and Microsys-
eral directions. We already completed the de- tems, 22(3-4), Aug. 1998.
sign of migratable sockets, which will reduce [3] A. Barak, A. Braverman, I. Gilderman, and
the inter process communication overhead. A O. Laden. Performance of PVM with the
similar optimization is migratable temporary MOSIX Preemptive Process Migration. In
files, which will allow a remote process, e.g. a Proc. Seventh Israeli Conf. on Computer
compiler, to create temporary files in the remote Systems and Software Engineering, pages
node. The general concept of these optimization 38–45, June 1996.
is to migrate more resources with the process, to
reduce remote access overhead. [4] A. Barak, S. Guday, and R.G. Wheeler.
In another project, we are developing new The MOSIX Distributed Operating Sys-
competitive algorithms for adaptive resource tem, Load Balancing for UNIX. In Lec-
management that can handle different kind of ture Notes in Computer Science, Vol. 672.
resources, e.g., CPU, memory, IPC and I/O [1]. Springer-Verlag, 1993.
We are also researching algorithms for network
[5] A. Barak and O. La'adan. The MOSIX
RAM, in which a large process can utilize avail-
Multicomputer Operating System for High
able memory in several nodes. The idea is to
Performance Cluster Computing. Journal
spread the process's data among many nodes,
of Future Generation Computer Systems,
and rather migrate the (usually small) process to
13(4-5):361–372, March 1998.
the data than bring the data to the process.
In the future, we consider extending [6] N.J. Boden, D. Cohen, R.E. Felderman,
Mosix to other platforms, e.g., DEC's Al- A.K. Kulawik, C.L. Seitz, J.N.Seizovic,
pha or SUN's Sparc. Details about the and W-K. Su. Myrinet: A Gigabit-per-
current state of Mosix are available at URL Second Local Area Network. IEEE Micro,
http://www.mosix.cs.huji.ac.il. 15(1):29–36, Feb. 1995.
[7] Platform Computing Corp. LSF Suite 3.2.
Acknowledgment 1998.

This research was supported in part by the gov- [8] A. Geist, A. Beguelin, J. Dongarra,
ernment of Israel. W. Jiang, R. Manchek, and V. Sunderam.
PVM - Parallel Virtual Machine. MIT
Press, Cambridge, MA, 1994.
References [9] W. Gropp, E. Lust, and A.Skjellum. Using
MPI. MIT Press, Cambridge, MA, 1994.
[1] Y. Amir, B. Averbuch, A. Barak, R.S.
Borgstrom, and A. Keren. An Opportu- [10] Red Hat. Extreme Linux. 1998.
nity Cost Approach for Job Assignment

You might also like