You are on page 1of 22

Draft version - 6/25/2010

Amril Nazir

A dissertation submitted in partial fulfillment


of the requirements for the degree of
Doctor of Philosophy
of the
University of London.

Department of Computer Science


University College London

June 25, 2010


Chapter 1

Introduction

From networks for workstations through to Internet, the high-performance computing community has
long advocated composing individual computing resources in an attempt to to solve the most demanding
computational and data-intensive problems. In recent years this progression has been driven by the vision
of "Grid computing" [Foster et al., 2001] where the computational power, storage power, and specialist
functionality of arbitrary networked devices is to be made available on-demand to any other connected
device that is allowed to access them.
The Grid computing vision promises to provide the required platform for users to run a new and
more demanding range of high performance computing (HPC) applications. Using grid technologies,
it is now possible to generate virtual computers with processing capabilities that rival those of the high
cost, dedicated supercomputer. The advent of high-speed networks has also enabled the integration of
computational resources which are geographically distributed and administered at different domains.
Such possibilities offer a way to solve most complex computation-intensive challenges such as protein
folding, molecular dynamics, financial analysis, fluid dynamics, structural analysis and many others.

1.1 Problem Statement


During the last seven years, computational scientists have started to adopt grid computing (Berman et
al., 2003) in earnest. This has, in part, been enabled by the increased computational power and avail-
able capacity from commodity equipment, and by the emergence of inter-organisational, national and
international grid computing infrastructures, such as the EGEE (Enabling Grids for E-Science) [Berlich
et al., 2006] project. The primary challenge offered by these computational grids is to allow scientists
to exploit resources on a much more ambitious scale than previously envisaged. Grid computing aims
to introduce the standards and middleware necessary for the usage of computational resources to take
place across sites, in order to support the computational requirements of the high performance computing
(HPC) applications.
Existing grid middleware and toolkits, such as the Globus toolkit and gLite, have started paving
the way towards a standardized global scale grid infrastructure - by providing a collection of services
which aim to standardize various aspects of remote access such as job submission, job scheduling and
job management. For supporting complex scientific and high performance applications (HPC), a grid
1.1. Problem Statement 3

workflow management engine and the grid workflow scheduler are provided as mechanisms to schedule
the execution of tasks on grid computing resources that span multiple administrative domains across
the grid. The workflow management engine provides the required quality-of-service (QoS) support to
guarantee the execution of each task based on advance reservation. When scheduling, the workfow
management engine produces a workflow model that outlines where and when each task component
is to be processed. Based on this plan, the resources are then reserved on a number of participating
grid resources. For the reservations to be effective, detailed knowledge is required with respect to the
amount of time each task execution will take. Approaches such as advance reservation and backfilling
are employed based on the very precise information of job execution time to efficiently schedule jobs.

This approach however can be a significant problem for emerging class of adaptive, real-time,
interactive-driven high performance computing (HPC) applications1 because the processing require-
ments for these applications always change and fluctuate throughout their execution. In most cases,
it is not often possible to predict job execution times during execution because of user interaction or
unpredictable changes between states representing a very broad spectrum of resource loads. Workflows
are not suitable for these applications because late completions are not allowed as they will interfere with
the reservation plan already agreed by all parties. Workflows that do not adhere to their deadlines will
consequently have to be terminated with potential significant losses to the efficiency for such applica-
tions. Consequently, the thesis takes a drastic departure from previous work in which we do not assume
any knowledge about job execution times, thereby freeing users of the need to estimate them. However,
this mandates the use of an alternate approach. An alternative solution is to make scheduling decision
at the time of task execution. This avoids the need to reserve each job on the required resources in the
future time slot. However, this approach is currently not feasible under grid environments because on-
demand allocation in grid environments may produce a poor schedule, since it is a dynamic environment
where utilization and availability of resources varies over time and resources from participating sites can
unpredictably join and leave at any time.

This thesis addresses the above problems by introducing the possibility of managing distributed re-
sources in smaller units. The units, nodes or the capacity, can be rented from the global grids as required.
Once rented, the nodes are moved outside the management of the resource providers (e.g., Grid manage-
ment systems) and will be managed by an entity which is known as a service provider. The creation of
such a unit maintains the global concept and at the same time introduces a local concept. This concept
has many benefits. Firstly, individual users no longer need to be recognised globally. The organization
they belong to can hire equipment and can create a local service, to serve its own users and/or appli-
cations. As a result, it will be able to schedule, with a minimum of hassle, as there are no competing
scheduling authorities and the resource pool is limited. This also provides the user with isolated, cus-
tomized execution environments needed, and, it promotes simplified resource administration. Secondly,
an immediate opportunity that this environment offers, is the opportunity for users to outsource work
to a third party provider. This outsourcing concept has many benefits such as; users avoid the hassle

1 this type of application is also referred to as urgent high performance computing applications in this thesis.
4 Chapter 1. Introduction

and expense of maintaining their own equipment and can specifically provision resources during peak
loads. Finally, the system can optimize use of the nodes processing capabilities more efficiently, because
resources are managed in significantly smaller units in comparison to global grids, and at the same time,
retain full control.
Incorporating such a concept within current grid systems does not require any major changes. The
only requirement is to have access to resources i.e. nodes, machines and processors for very long periods
of time (from a few hours to several days or even a week). In such a scenario, nodes are abstracted as
a service and given to applications or users. When the nodes are rented out, the owner looses usage
of these nodes and therefore, is paid a rental fee. During the rented period, the service provider is in
command of how it would like to utilise the nodes.
Since the system can be tailored accordingly, dynamic runtime negotiations for nodes are also
promoted. This enables users to customise their applications with a set of distinct resource type and
amounts to form ideal node configurations based on current load demand. Moreover, temporary and
unexpected spikes in demand for nodes can be accommodated by flexible rental arrangements. The
thesis defines the principles of the rental arrangements or the resource renting from the perspective
of the service provider. The nature of the problem is such that the demand for the resources and the
processing times of individual application requests are unknown or that estimates are unreliable. The
service provider is faced with the conflicting goals of renting sufficient computing nodes to provide an
adequate level of application satisfactions and keeping the cost of renting to a minimum. However,
renting too few nodes results in long wait times and application dissatisfaction. There is a need to
balance the cost of satisfying customer demand and the cost for renting computing resources. Therefore,
this thesis presents a novel costing model that can be used to identify important costs and to provide a
clear performance objective for the service provider. The cost model is useful for capacity planning and
for the evaluation of resource scheduling in the presence of bursty and unpredictable demand.
Using the proposed cost model, the service provider can explicitly measure and manage the trade-off
between the cost of renting computational nodes and the lost opportunity if customer demand is not met.
To this end, the thesis proposes several rental heuristics and policies that perform rental decisions based
on current and anticipated (future) workload. A rental policy provides the rules to decide what, when
and how many nodes to order. Such approach offers the possibility for a service provider to customise
ideal set of nodes to satisfy the current and future workload more effectively. For example, the service
provider may rent a large number of nodes for the period covered, in the case when the demands for node
amounts are very high. Alternatively, if the demand is consistent, the service provider may rent a few
nodes at any one time to reduce the rental cost. Overall, this provides greater flexibility, which allows
the systems to operate under dynamic environments such as existing Grids and potentially commercial
service Clouds.

1.2 Research Contributions


The goal of this thesis is to propose effective mechanisms that are able to support the QoS requirements of
a large class of HPC applications, ranging from best-effort to adaptive, interactive parallel applications.
1.2. Research Contributions 5

This thesis makes the following contributions:

1. We disclose a novel service provider approach or a rental-based system for supporting urgent
high-performance computing applications with minimal cost and infrastructure. The approach re-
solves the issues by aggregating worldwide computing resources to form a localised rental based
management system. The framework decomposes rental-based systems into three distinct tiers
which map well on the structure of modern grid computing systems. The first tier offers the abil-
ity for the applications to interact directly with the rental management system using simple API
calls. The calls are handled by the application agent (AA) which resides between the application
and the rental management system to enable dynamic resource allocation at an application level
to make the application execution benefited from the dynamic, and on-demand nature. The sec-
ond tier is a service provider that makes use of the QoS information provided from the AA and
appropriately schedules application jobs based on job description/requirement, SLA agreements,
and resource costs. A local scheduler uses information provided by the AA to estimate demand.
The third tier is a negotiator that obtains resources from resource providers which provide shared
pools of ready-to-use compute resources. The three-tier approach essentially differentiates the
roles of managing application processing requirements, scheduling application jobs and managing
distributed resources.

2. We present the fundamental requirements that a service provider or a rental-based system must
meet in order to support QoS requirements of a variety HPC applications. Based on the require-
ments, we further present an architectural framework for a rental management system that allows
autonomic and dynamic renting of resources for high performance computing applications. We
demonstrate how such systems are are capable to provide rapid provisioning of computing re-
sources, and to provide support for dynamic resource negotiation at run-time.

3. We describe HASEX, an prototype implementation of a rental-based system which is based on


the proposed architectural framework. A key requirement is the ability for a service provider
to dynamically negotiate for resources at runtime with external resource providers. We bring
to the forefront the specific design decisions made to address the fundamental requirements of
rental-based systems. We define important key features of HASEX and describe how they are
implemented in practice. The implementation is used to demonstrate, through experiment, that the
performance is significant compared to a dedicated static HPC cluster system.

4. We determine the cost-benefits of the rental management approach versus dedicated HPC sys-
tems. Based on an analysis of an existing grid infrastructure such as the international EGEE Grid
[Berlich et al., 2006], we identify the real cost of current grid infrastructure, and propose alter-
native mechanisms by which jobs can be managed in decentralized environments, providing new
avenues for agility, service improvement and cost control. We detail specific costs of EGEE Grid,
and Lawrence Livermore National Lab (LLNL) HPC system. EGEE Grid is currently a world-
leading production Grid across Europe and the world. On the other hand, LLNL HPC is a large
6 Chapter 1. Introduction

Linux cluster installed at the Lawrence Livermore National Lab which is currently being used to
run a broad class of HPC applications including best effort, adaptive, and interactive applications.
With these performance and monetary cost-benefits in mind, we demonstrate the performance of
a rental management system compared to a dedicated HPC system.

5. We present a comprehensive costing model for the evaluation of a service provider. The pro-
posed cost model provides the mechanism for the service provider to quantitatively evaluate its
conflicting objectives in minimising operating and rental related costs subject to an application
satisfaction-level constraint. Effectively, the evaluation metric is the profit which is the trade-off
between earning monetary values and paying for the rental cost. Although both service provider
and resource providers act independently, changes in behaviour from a resource provider may in-
fluence how a service provider makes decisions, and vice versa. For example, a resource provider
may choose to support long rental contracts, and therefore charge a lower unit rental fee for long
contracts than short rental contracts. This will in turn impact on how the service provider makes
rental decisions. Both have their own expectations and strategies: users adopt the strategy of
solving their problems at low cost within a required timeframe and resource providers adopt the
strategy of obtaining best possible return on their investment. The responsibility of a service
provider is to offer a competitive service access cost in order to attract users. It may have an op-
tion of choosing the providers that best meet users’ requirements. The proposed cost model can be
used to evaluate the cost and the effectiveness of specific adopted strategies from the perspective
of both service provider and resource provider.

6. We propose and examine several cost-aware scheduling and rental policies that incorporate execu-
tion deadlines and monetary values when making scheduling and rental decisions. The provision
of sophisticated rental policies is essential for the economic viability of a service provider. A
rigid rental policy is examined whereby additional nodes are rented exactly based on estimates of
demand. We further introduce more sophisticated aggressive and conservative heuristics that op-
erate in a reactionary mode, where a service provider does not rent additional nodes until specific
conditions have reached e.g., when there is a sudden increase in demand or when the nodes are
running low. We explore how these policies can be improved further by taking into account of
job deadlines, monetary values, system revenue and system profitability, and examine how load,
job mix, job values, jod deadlines, node heterogeneity, lease duration, node lead time, job sizes,
and rental price influence the service provider’s profit. We also examine the impact of uncertainty
in demand, uncertainty in resource availability and uncertainty in charging prices from resource
providers. Our results provide insight on the benefits of possible optimisations and is a step to-
wards understanding the balance of satisfying customer demand and the cost for renting computing
resources. The investigated policies serve as the foundation for improving productivity and return
on investment for satisfying demand without a heavy upfront investment and without the cost of
maintaining idle resources.
1.3. Scope and Assumptions 7

7. We propose service level agreements management (SLAM) framework that could be effectively
used to enhance the application quality-of-service (QoS) while minimising rental costs. Rental
policies which operate on reactionary mode are not sufficient for highly variable or chaotic work-
load. Simple per-job information does not capture the true intent of the user, leaving the service
provider to do the best it can, but risking unhappy users, under-utilized nodes, or both. An ad-
ditional control is proposed by means of service level agreements (SLA), or long-term contracts,
which specify the service to be provided, its quality and quantity levels (e.g., the load that the user
can impose), price, and penalties for non-compliance. As such, a service provider can optimise the
rental decisions because there will be a minimal unexpected surprise, and it will be able to respond
to unexpected demand more efficiently since an upper bound on workload can be established. We
present and evaluate three new SLA-aware policies that can be incorporated in such a framework
to improve scheduling and rental decisions.

The above contributions are very much complementary in nature. Used jointly, the resulting framework
presents a unique set of characteristics that distinguish it from existing distributed, grid, and cluster sys-
tems: the result is an adaptive self-centered approach to collaboration, allowing a service provider to
construct a dynamic resource management environment and to operate such environment accordingly in
the most cost effective manner. Unlike centralized approaches or approaches based on fully autonomous
behaviour, where independent participants operate mostly in isolation, our framework fosters collabora-
tion without compromising site autonomy through a separation of interests between the service provider
and the resource provider. Furthermore, it is designed to ensure that significant benefit can still be ob-
tained even when the scope of deployment is limited, allowing it to be integrated into current and existing
distributed and grid environments incrementally, without major modifications to the environment itself,
and without the need for complete cooperation by all providers. This thesis will demonstrate that the
use of these mechanisms can help achieve significant gains for both the applications and the resource
providers.

1.3 Scope and Assumptions


1.3.1 Quality of Service and Service Level Agreements
Users express utility as the budget or amount of real money (as in the human world) they are willing to
pay for the service [Buyya, 2002]. Real money is a well-defined currency [Lai, 2005] that will promote
resource owner and user participation in distributed system environments. A user’s budget, monetary
or utility value is limited by the amount of currency that he/she has which may be distributed and ad-
ministered through monetary authorities [Barmouta and Buyya, 2002]. In the thesis we focus mainly on
the resource allocation techniques to meet the application QoS requirements and their applicability in
the context of a service provider that rents resources, rather than own them. On-demand provisioning
of computing resources diminishes the need to purchase computing infrastructure since renting arrange-
ments provide the required computing capacity at low cost and low risk.
Resource providers may strategically behave in ways to keep their costs down and to maximise
8 Chapter 1. Introduction

their return on investment. The pricing2 policies consider the following question: “What should the
provider charge for the resources?” However, pricing policies and market dynamics imposed by resource
providers are beyond the scope of this thesis, although our rental heuristics are potentially compatible
with their pricing schemes. This thesis do not venture further into other market concepts such as user bid-
ding strategies [Wolski et al., 2001, Chun et al., 2005] and auction pricing [Lai et al., 2004, Waldspurger
et al., 1992, Das and Grosu, 2005] mechanisms.

1.3.2 Security Issues


Parallel computations that deal with geographically distributed computational resources need to establish
security relationships not simply between a client and a server, but among potentially thousands of jobs
or tasks that distributed across different geographical locations or availability zones. The security-related
issues have to be addressed before the any proposed solutions can be applied in practice. In the thesis,
we assume the existence of a security infrastructure that authenticatates and authorised users. Such
infrastructure should enable an authorised consumer or user to grant access right to computing resources
on a remote site. We would also not concern ourselves on the security policies, confidentiality and data
integrity issues. When discussing the prototype implementation of our architecture, we will limit our
discussions on practical strategies that we employ to alleviate the problems of providing secure access
to distributed resources behind firewalls.

1.3.3 Adaptive Applications


In the thesis, we are not interested in how the application-level scheduling and application-load balancing
mechanisms/policies are achieved; we assume that an application has a reasonable knowledge of its inner
workings and internal structure to be able to manage its own load balancing techniques to maximise its
performance. During runtime, application itself continuously measures and detects load imbalances and
tries to correct them by redistributing the data, or changing the granularity of the problem through load
balancing.
In static scheduling policies (e.g., [4], [11]), the number of processors allocated to a job is deter-
mined at allocation time. When allocated, applications will hold the processors assigned to them until
they terminate (i.e., for the lifetime of the application), which can lead to low utilisation problem. The
adaptive execution approach eliminates most of the problems associated with static approaches. The
basic idea behind the adaptive application execution is to make the jobs in the system share the proces-
sors equally as much as possible. This is achieved by varying the number of processors allocated to an
application during its execution. This means that additional processors may be added to an executing
application or job when processors become available.
In such settings, we assume that the applications are flexible to reconfigure themselves and adapt
automatically to their underlying execution environments. In particular, when resources are added or re-
moved during the execution, we assume the application is capable to dynamically perform load balancing
on existing and newly configured resources to make the most efficient use of resources. For example,

2 We differentiate between pricing nodes and charging for services in the thesis.
1.4. Organisation 9

after adding a resource, either process migration or data load balancing may take place to take advantage
of the new added resource.

1.3.4 Grid Computing


We assume that geographically distributed resources operate within the context of an existing resource
grid. Therefore, we also assume the existence of core components of a grid system such as meta-
scheduler, information registration and discovery services, security services, and distributed resource
managers (DRMs)3 . In the thesis, we focus on middleware and tools that operate within the context of
existing resource infrastructure middleware.

1.3.5 Computational Resources


In the thesis we focus our research specifically for the provision and usage of computational resources
i.e. processors, and CPUs. Providing QoS support for networks, storage systems etc. and other resource
types are not explicitly covered and investigated in the thesis. Without loss of generality, for this thesis
we assume one processor (CPU) per node, and an incoming job specifies how many CPUs it needs, and
executes one task (process) on each of the allocated CPUs.

1.4 Organisation
We end this introductory chapter with an outline of the remainder of this dissertation. The outline of the
thesis is as follows:
Chapter 2 introduces the research background and related work from a wide variety of areas related
to distributed systems for high performance computing applications.
Chapter 3 sets the stage of the thesis by discussing the motivation behind the adoption of rental-
based mechanisms. We then present fundamental requirements that a rental based system must meet in
order to deliver application QoS satisfactions for HPC applications.
Chapter 4 proposes a framework for building such a rental-based system. We also describe our
simulator framework we used to simulate the behaviour of rental-based systems. We discuss in details
our simulation environments and the experimental design including the application scenarios, workload,
traffic models, and performance metrics.
Chapter 5 presents HASEX, a prototype implementation of a rental-based system and we describe
HASEX’s key features and discuss how the implementation is realised.
Chapter 6 evaluates the cost-benefit of a rental-based system versus dedicated HPC systems.
Chapter 7 presents resource management strategies that incorporate cost-aware scheduling and
rental policies that consider both exact and heuristic approaches.
Chapter 8 introduces the service level agreements management (SLAM) framework that can used
to enhance the overall application satisfactions. We proposed three new SLA-aware policies that make
use of SLA information to improve scheduling and rental decisions.
Finally, Chapter 9 concludes the thesis work and outlines directions for future work.

3A DRM is also referred to as a cluster or a local scheduler.


Chapter 2

Motivation and Requirements

Grid computing can be considered as a consolidated field in high performance computing. However, it
still presents serious limitations from the point of view of urgent, HPC applications. Response time is
a normal handicap in such environments. Each administrative domain, in a grid infrastructure, has its
own entities that take care about information flow and scheduling issues. All these entities introduce
considerable delay to the jobs starting, which is a clear disadvantage for dynamic and interactive HPC
applications.
In the previous chapter, we have touched upon how current grid infrastructure lacks the mechanisms
necessary to harness the power of geographically distributed resources worldwide for HPC applications
that need to leverage these resources. In particular, we discussed the problems of the workflows in
providing the QoS support required for adaptive, interactive, and parallel HPC applications. For these
applications, their processing requirements are difficult to predict since problem size can grow or shrink
over time, and some of their tasks need to complete within scheduled times.
In the remainder of this chapter, we identify several significant shortcomings of the current Grid
models, and outline the approaches used to resolve these issues. Based on these approaches, we present
the fundamental requirements that a service provider must meet in order to address all the limitations.

2.1 Efficiency and Scalability Issues


It is very difficult to schedule applications in grid environments. The reasons are due to the existence of
local schedulers in different organizations, and the existence of local scheduling policies on resources.
In such an environment, a global Grid would need to consider a multitude of priorities and scheduling
policies. Current Grid systems in large distributed environments rely on a centralised controller compo-
nent, known as a meta-scheduler. A metascheduler operates in a centralized fashion to manage individual
local schedulers (Distributed Resource Managers) across participating sites.
In the previous chapter, we discussed that a meta-scheduler requires tight synchronisation of infor-
mation flow from each of participating sites. Faced with a potential large number of sites, the central
scheduling point can become a single point of scalability bottleneck for applications needing fast re-
sponse. Given that the resources cross different administrative domains, there is no direct control over
the resources. Workflows with advance reservation provides the mechanism to guarantee job execu-
2.2. Limited QoS Support 11

tion on resources but the approach is infeasible because the application requires prior knowledge of job
runtime estimates.
Our approach is to reduce the complexity of global scheduling optimisation problems. This can
be achieved by introducing the possibility of managing distributed resources in much smaller units, in
comparison to traditional grid approach. The management of these smaller units is controlled by a service
provider. A service provider defines a resource service that controls and manages local scheduling policy.
A service provider provides a localised exclusive environment for applications or job execution services
using virtual resources created from rented hardware, which is supported by a pool of resources that
it may either purchase or rent. The rental arrangement is possible because it makes agreements with
other pool owners (resource providers) to rent some of their resources in times of high utilization in
exchange for rental and usage fees. As such, a consumer is provided secure and controlled access
to individual managed resources. Such an environment does not differ much if at all from that of a
dedicated HPC cluster. It importantly removes the interaction between applications and the Distributed
Resource Managers since the system should look like any multiprocessor solution to the applications.
The notion described here is similar to the concept of elastic computing in Cloud computing.
However, it is also important to achieve the above objectives without requiring major changes to
the overall structure of the grid systems. Such systems must be able to operate within the context of an
existing resource grid, because of the high availability, variety of resource types and amounts that are
widely available from global grids.

2.2 Limited QoS Support


Current grid infrastructure was not designed intentionally to support Quality of Service (QoS) require-
ments during job submission, such as the deadline for completing a job. However, such requirements
are crucial for adaptive, interactive and parallel HPC applications. The standard method for sharing a
cluster among High Performance Computing (HPC) users is batch scheduling via Distributed Resource
Managers (DRMs). With batch scheduling, users wanting to run applications submit job requests. Each
request is placed in a queue and waits to be granted an allocation, that is, a subset of the cluster’s com-
pute nodes, or nodes for short. The job, i.e., the running application, has exclusive access to these nodes
for a bounded duration.
A problem with batch scheduling is that it inherently limits overall resource utilization because jobs
are allocated fixed (static) numbers of nodes. If a job uses only a fraction of a its requested nodes (e.g.,
half of the processor cores), then the remainder of it is wasted. It turns out that this is the case for many
jobs in HPC workloads. For example, in a 2006 log of a large Linux cluster D.G.Feitelson, more than
27% of the jobs effectively use less than 50% of the node’s CPU resource (due to highly dynamic task
execution and the time spent performing I/O or network communication, and synchronization). This
observation has been made repeatedly in the literature [Chowdhury et al., 1997, Park and Humphrey,
2008, Bucur and Epema, 2001].
Firstly, we aim to adopt a flexible execution approach to application-level scheduling. Such an
approach can be differentiated from traditional scheduling approaches where the number of processors
12 Chapter 2. Motivation and Requirements

allocated to a job is staticaly determined at allocation time. When allocated, applications will hold the
processors assigned to them until they terminate (i.e., for the lifetime of the application). with flexi-
ble execution approach, applications can dynamically request additional processors or release some of
their allocated processosr during their execution. There are several motivations for this. Our primary
motivation is to provide extensive QoS supports for different type of HPC applications. In the modern
distributed computing, the application can be dynamic and its execution behavior could not be predicted
in advance, i.e. its resource requirements may change in the middle of execution. This is an important
requirement in ensuring the adoption of the grid for everyday usage by users, who wish to launch com-
putations from their desktops. In our previous works [Liu et al., 2009], we have introduced a software
framework to support adaptive applications in distributed and parallel computing, but we soon realise that
current grid systems lack the mechanisms to support such features due to the issues in providing rapid
response to requests and exclusive access to resources. Hence, up until now, only batch type jobs are
currently adopted in grid systems due to the lack of mechanisms for such resource and QoS guarantees.

Secondly, such an adaptive HPC application requires an exclusive execution environment in order to
support dynamic resource addition and release during its execution. To accommodate this requirement,
it is desirable to have a mechanism that allows on-demand and rapid provisioning of additional resources
to accommodate fluctuations in demand. A service provider can choose to rent resources that reflect
the current load in the system. If it can predict the system load with reasonable accuracy within a short
interval or period, it will be able to schedule with a minimum of hassle because there are no competing
scheduling authorities and the resource pool is limited. Furthermore, it will be able to keep idle resources
to a minimum.

This however requires the capability for a service provider to expand and shrink in size depending
on its customers’ QoS requirements. Therefore, it is envisaged that such capability could achieved by
negotiations with third party external resource providers. Negotiations consider the issues of when a ser-
vice provider should rent resources and when should it release them with the aim to meet customers’ QoS
requirements. Similarly, with this capability, the applications should also need the ability to reconfigure
themselves with more processors and/or fewer processors at the run-time to make efficient and full use
of rented resources. Resource provision in our model is based on fine-grained, multiple requests, instead
of one-time allocation basis. Resource can be dynamically added or removed from the service provider
according to local needs as to meet consumers’ QoS and/or service level agreements (SLA) agreements.
In the case when there is high demand for resources, the service provider can always negotiate and rent
more resources during application execution to maintain certain expected QoS performance objectives.
On another occasion, when the demand is low, the service provider can choose to release rented re-
sources to remove the burden of unnecessary scheduling and management overhead. In this manner, the
system can be tailored according to demand; newer and faster resources will be allowed to replace the
slow resources if it deems more desirable in specific situations. For example, the system may choose
to rent large number but fewer powerful nodes for best-effort jobs because they have no strict QoS con-
straints. Alternatively, the system may choose to rent very fast nodes for interactive parallel jobs in order
2.3. Inadequate Performance Measures 13

to meet their deadlines. Such flexibility is desirable to support the requirements of multiple consumers
(applications) with different QoS (deadline) constraints.

2.3 Inadequate Performance Measures


Grid scheduling and resource allocation strategies are currently driven by system-centric parameters
only. For example, they aim to increase processor throughput and utilization for the system, and reduce
the average waiting time and response time for the jobs. They still assume that all job requests are of
equal user importance and thus neglect actual levels of service required by different users. As such, it is
almost impossible to differentiate the level of QoS required for each application without having a clear
defined performance objective.
Most importantly, both service providers and resource providers do not operate in a vacuum. They
will react to each other. For example, resource providers may choose to support long rental contracts
and therefore charge lower rent fee than for shorter periods. This will in turn impact on how the service
provider will build their compute nodes’ configuration. Since there are more than one stakeholder in-
volved (i.e., the consumers and the grid resource providers), we must be able to accommodate and protect
the interests of both parties. This can be achieved by having a comprehensive costing model that we can
use to quantitatively evaluate the efficiency of both a service provider and a resource provider. Metrics
for measuring system and interaction protocol overheads incurred by both parties are also necessary to
evaluate their efficiency.
It is envisaged that economics-based approach can provide a solution to this: consumers (applica-
tions) express the value of their jobs as the price they will pay to have them to be executed while the
service provider incurs some costs in executing the jobs for applications. The costs derived from the
need to purchase the physical hardware resources, and the costs to maintain them (operating cost). How-
ever, since the amount of work (jobs) that a service provider receive will vary, it is less risky and more
desirable to rent resources that we need from external providers, rather than owning these resources.
The service provider has two clear goals: it should try to minimise the rental cost, that is, the cost of
obtaining and operating external resources. If the only objective is to minimise costs, a service provider
could rent the minimum amount of nodes. However, it is also equally important to satisfy the different
level of service requirements for a wide range of HPCs applications including both best-effort type ap-
plications and more demanding interactive applications (which have specific QoS requirements). Hence,
a penalty charge is necessary so that these two conflicting metrics can be balanced. Drawing from these
observations, it is self-evident that the goal of the service provider is to maximise its profit. The profit is
simply the difference between the price paid by the consumers (minus the penalty charges) and the costs
to rent external resources. It is envisaged that the profit metric offers a clear, numerical measure for the
amount of net value added by a service.

2.4 Requirements for a Service Provider


In the previous sections, we have outlined our approaches toward resolving the limitations. In this
section, we now present the fundamental requirements that a service provider needs to have in order to
14 Chapter 2. Motivation and Requirements

achieve our goals.

2.4.1 Three-tier Approach


In conventional distributed and grid systems, a meta-scheduler is relied upon to control all resources
worldwide and make all decisions on resource allocation and scheduling of all received jobs. Faced
with a potential large number of sites, the central scheduling point represents a potential bottleneck for
requests needing fast response. Therefore, there is a strong need for the removal of a centralized meta-
scheduler to avoid the creation of bottlenecks and reducing the fine grained management of requests
and data transfers. In particular, the rental-based system (service provider) must scale to the use of large
numbers of computing resources. Because large numbers of resources are deployed in many applications,
scaling to 10s, 100s, and even 1,000s are relevant capabilities. Most importantly, systems must scale both
upward and downward performing well with reasonable effort at a variety of system scales.
This can only be achieved by deviating from the conventional centralised meta-scheduler paradigm
that is often adopted in traditional grids. We aim to resolve the scalability issues by adopting a three-tier
approach instead. The first tier offers the ability for the applications to interact directly with the system
using simple API calls. The calls are handled by the application agent (AA) which resides between
the application and the system to enable dynamic resource allocation at an application level to make
the application execution benefited from the adaptive, dynamic, and on-demand nature. The second tier
makes use of the quality of service (QoS) information provided from the AA and appropriately schedule
application jobs based on job description/requirement, SLA agreements, and resource costs. The third
tier forms a shared pool of ready-to-use compute resources which are rented from worldwide pool of
computing power. The three-tier approach essentially differentiates the roles of managing application
processing requirements, scheduling application jobs and managing distributed resources.
In such a system, resource nodes are acquired by renting them based on their types and rent pe-
riod. A node is supplied with a setup that enables the system to choose a specific operating system.
Alternatively, the resource provider may provide support for state-of-the-art virtual machine (VM) tech-
nology, allowing the system to install its own operating system on a physical hardware machine. By
renting resources only when necessary, the overall system becomes more flexible and cost effective in
comparison to a conventional meta-scheduling approach. Because resources can be added and removed
from a system at any time, resource levels can be anticipated to ensure that sufficient resources of the
right types are rented with minimal administrative overhead and costs. The resulting product is a system
that dynamically adds and removes resources based on local demand. Such flexibility could potentially
enable computational demanding distributed parallel applications to harness the combining power of
geographically distributed resources.

2.4.2 Rental Policies


We envisage that resource renting is a solution that can resolve the problem of efficiency and scalability.
One possible way to resolve the issue is to acquire on-demand resources (nodes) at the peak of process-
ing demand and release them when demand has fallen. In this manner, the system does not suffer from
the administrative overhead associated with the size of managing large global systems. However, such
2.4. Requirements for a Service Provider 15

systems could also potentially lead to resources not being fully utilised if the environment is not properly
managed. Unlike tangible goods were ownership is completely transferred from a seller to a buyer, com-
putational resource capacity simply represent the right to use a shared resource where a lease agreement
determines how resource sharing is initiated in time and space. Resource allocations vary mainly in the
quantity of the allocation and over what period of time that allocation is delivered. This applies for both
CPU allocations as well as specific application QoS constraint e.g., ensure that a job finishes within its
deadline of one minute. Meeting such a QoS constraint requires a sophisticated control on the level of
sharing such as the need for effective scheduling and rental decisions.

The system should effectively determine which type of nodes, the amount of nodes to rent, and
when to rent them. For example, if an application requires N nodes for a specified duration, the system
can react immediately by renting exactly N nodes for the period covered. It is important to note, how-
ever, that the renting resources is not necessarily a one-time activity that occurs at the start of application
execution. The system may not always estimate the right amount of resources it needs since the individ-
ual application processing requirements may fluctuate at runtime. Therefore, the system must evaluate
its rental decision on a periodic or reactionary basis in response to sudden demand. For example, the
system may rent fewer nodes one at a time until there is a sudden demand for additional nodes. In this
way, the system may also optimise the resource costs by allowing a single node to be used by multiple
applications at different periods of their execution. This can significantly reduce node idle times and un-
necessary overhead. To improve rental decisions further, the system should incorporate information on
both execution deadlines and monetary values (refer to Section 2.4.2.1 and Section 2.4.2.2) when making
scheduling and rental decisions. Nonetheless, careful rental decisions are important due to uncertainty
in resource availability from resource providers. In particular, nodes may not be necessarily available at
the time when the system needs to rent them. Therefore, the uncertainty in waiting times for obtaining
nodes can affect how a service provider makes a rental decision. Furthermore, the pricing options can
also influence rental decisions greatly. For example, if resource providers offer lower charges for long
rental contracts in comparison to short term contracts, the service provider may find that a long term core
supported by short term options is the best policy.

Therefore, a complete costing model is needed to measure several important overheads and metrics
that need to be considered by the service provider. The cost framework could be used for capacity plan-
ning and evaluate resource scheduling in the presence of bursty and unpredictable demand. In particular,
it could address the trade-off between the cost of rental and the lost opportunity if customer demand is
not met. The cost model should be incorporated as part of the rental decision-making process to identify
and balance the cost of rental and the lost opportunity if customer demand is not met. In this manner,
rental decisions can be made in a principled way, so that a proactive, rather than an ad-hoc approach to
performance measurement, is promoted. The cost model must rely on the following parameters which
should be provided directly by the application or the application agent on its behalf:
16 Chapter 2. Motivation and Requirements

2.4.2.1 Deadlines
The system needs to serve multiple running applications that compete for resources simultaneously.
Moreover, these competing applications often have diverse characteristics in terms of their computation,
communication and I/O requirements from the underlying system. They also impose diverse Quality-
of-Service (QoS) demands (low response time for execution and interactivity, high throughput, and even
bounded response times or guaranteed throughput) with different QoS parameters for each job, depend-
ing on how important the job is [Zhang et al., 2000]. A rental-based system should therefore be able
to determine correct priority of a job, or task before it can make decisions to prioritise allocation of
resources. Without this information, there is no meaningful way of inferring how valuable jobs are when
deciding how to allocate resources.
There should be an absolute parameter that the system could use to prioritise competing application
jobs with reasonable accuracy. One of these parameters is the job deadline. The job deadline can either
be soft or hard [Yeo and Buyya, 2006]. A hard deadline does not tolerate a missed deadline at all whereas
a soft deadline can tolerate a missed deadline as long as the total number of misses does not reach above
the probability of misses specified by the application’s QoS. In this thesis, we consider hard deadlines
for the purpose of evaluation. We assume that deadlines are specified directly in application code, or it
can be determined by the application agent that specifies the task deadline on the application behalf.
Jobs that are only asking for low response times or high throughput is referred to as best-effort (BE)
jobs [Zhang et al., 2000] whereas jobs that require bounded response times or guaranteed throughput
is referred to as interactive jobs in this thesis. Under this model, a best-effort job is likely to have a
longer soft deadline compared to an interactive job that needs to process its execution results in seconds
or minutes.

2.4.2.2 Virtual Currency


Applications should be relied upon to explicitly indicate the importance of their individual jobs by spec-
ifying the jobs’ deadlines. However, by giving the option to the applications for specifying deadline, the
system is also open to abuse; applications could simply specify low (strict) deadlines for all their jobs.
Current resource allocation strategies will not stop every user from claiming that his/her application
needs the highest quality of service.
This problem can be resolved by using virtual currency [Rappa, 2004]. In order to obtain compute
resources, users must obtain virtual currency. One possibility is to back virtual currency by real currency
which is referred to as “monetary value” in the thesis. The key feature of such approach is that it
discourages free-riding and gaming by consumers. Consumers who claim a higher priority will have to
pay for it, so they have an incentive to accurately reveal how important priority is to them. In addition, the
variable charging allows consumers with a low budget and low time-sensitivity to run during low demand
periods. These consumers would otherwise not be able to run at all in a fixed cost model. Conversely, at
high demand periods, users have a disincentive to run, but resources will nonetheless be available (for a
high monetary value) for users that really need them. By introducing the virtual currency or the monetary
value parameter, applications would have options to indicate how important their tasks are. For example,
2.4. Requirements for a Service Provider 17

a high monetary value will be given higher execution priority compared to a low monetary value. Hence,
all applications will be served accordingly to their reflected values. This provides an incentive for the
applications to specify honest deadlines for their jobs.
There are also several other potential issues when using virtual currency. These include the stability
of the system which can be greatly affected if a few applications decide to save currency for long periods
of time in order to amass a disproportionate amount of wealth. Any amount of strategy-proof design
cannot prevent such an application from dominating control over all resources, since a consumer can
over-specify the actual value of each task [Chun et al., 2005]. These issues are beyond the scope of this
thesis. In the thesis, we assume that the value assigned to each task represents the true monetary value
of the deadline. The user or the application agent acting on its behalf is solely responsible to specify a
value for each job based on the task deadline. From this information, it is then the responsibility for the
system to ensure that computing resources are sufficiently available to serve consumers’ processing and
QoS requirements.

2.4.3 Service Level Agreements (SLA)


In current grid systems, a job description defines the requirements that must be satisfied by the scheduler
in order to successfully execute the job. The requirements typically include parameters such as job
estimated runtime, job’s requested number of nodes, and job deadline etc. Such a description is intended
to be used by the schedulers or services that are responsible for allocating resources.
However, the job description of an individual job alone is not sufficient for the system to capture
the true intent of the application. It is argued that to achieve the required balance between satisfying
two conflicting objectives (satisfying QoS versus miniming cost), scheduling and rental decisions must
be done with the knowledge of the anticipated future workload and not relying solely on individual job
information alone. Since the resource requirements can vary considerably, the system will have to make
second-guess decisions to rent additional nodes, and sometimes those decisions will prove to either be
inaccurate or change over time. This puts the applications’ QoS at risk of being dissatisfied or it may
lead to over optimistic rental decisions.
An additional control therefore is needed so that the system has access to information pertaining
to overall application behaviour. The required QoS information must come from the application itself.
Therefore, a mechanism is needed that enables the application to freely express its overall processing
characteristscs and its anticipated resource demand, in a simplified manner for the service provider to
make use of. We anticipate that this control can be provided by means of service level agreements
(SLA), or long-term contracts. Unlike individual job description, such SLA contracts include additional
information such as the total load that the application can impose, the application total offered monetary
value, and its total running time. Such contracts may also include penalties for poor performance: agree-
ment on how much the system should be penalised for not meeting specific QoS requirements e.g., if the
response time is too high for many jobs or if too many deadlines are missed. In this manner, the service
provider can optimise rental decisions and plan ahead accordingly. This has the benefits of avoiding
unexpected surprise in processing and QoS requirements, and enables the system to respond quickly to
18 Chapter 2. Motivation and Requirements

changes in demand.

2.5 Chapter Summary


In this chapter, we have defined the motivation and high level requirements that will be adhered to
and investigated in the remainder of this thesis. There are of course many areas that have not been
specifically addressed in this chapter. For example, one may wish to address issues of security, trust
and communications operation. However, for the most part, it is assumed that these issues are addressed
at the resource layer and not at the scheduling layer, and that solutions for such issues will be inherited
through the systems and tools in place. The primary requirements for a service provider can be concluded
as:

• Three-tier Approach: A framework for a rental-based system is needed to support the flexible
runtime negotiation for reconfigurable applications and the dynamic negotiations for additional
resources from third-party (external) resource providers.

• Rental Policies: The provision of sophisticated rental heuristics that consider both execution dead-
lines and monetary values is essential to ensure the economic viability of rental-based systems.
The economic viability should be clearly measured by using a comprehensive cost model, which
enables the service provider to quantitatively evaluate its conflicting objectives in minimising op-
erating and rental related costs subject to application-level constraints.

• Service Level Agreements: A framework for enabling consumers or users to negotiate long-term
application/job execution contracts with the service provider is advocated. The provision of SLA-
aware policies is also essential for a service provider to make efficient scheduling and rental deci-
sions based on both per-contract and per-job information.

The above requirements were specifically motivated by practical experience with the deployment and use
of current grid technologies for high performance computing projects over the last five years. This expe-
rience has allowed us to identify the difficulties in operating on a grid scale and reconciling the interests
of the various participants from different administrative organisations. This has led to the cornerstone
of our localised-based rental approach, which is to optimise scheduling by compensating for the lack of
direct control over resources.
Bibliography

Alexander Barmouta and Rajkumar Buyya. Gridbank: A grid accounting services architecture (gasa) for
distributed systems sharing. In Proceedings of the 17th Annual International Parallel and Distributed
Processing Symposium (IPDPS 2003), IEEE Computer, pages 22–26. Society Press, 2002.

Ruediger Berlich, Marcus Hardt, Marcel Kunze, Malcolm Atkinson, and David Fergusson. Egee: Build-
ing a pan-european grid training organisation. In Rajkumar Buyya and Tianchi Ma, editors, Fourth
Australasian Symposium on Grid Computing and e-Research (AusGrid 2006), volume 54 of CRPIT,
pages 105–111, Hobart, Australia, 2006. ACS.

Anca I. D. Bucur and Dick H. J. Epema. The influence of communication on the performance of co-
allocation. In JSSPP ’01: Revised Papers from the 7th International Workshop on Job Scheduling
Strategies for Parallel Processing, pages 66–86, London, UK, 2001. Springer-Verlag. ISBN 3-540-
42817-8.

Rajkumar Buyya. Economic-based distributed resource management and scheduling for grid computing.
CoRR, cs.DC/0204048, 2002.

Abdur Chowdhury, Lisa D. Nicklas, Sanjeev K. Setia, and Elizabeth L. White. Supporting dynamic
space-sharing on clusters of non-dedicated workstations. In In Proceedings of the 17th International
conference on distributed computing, 1997.

B.N. Chun, P. Buonadonna, A. AuYoung, Chaki Ng, D.C. Parkes, J. Shneidman, A.C. Sno-
eren, and A. Vahdat. Mirage: a microeconomic resource allocation system for sensor-
net testbeds. Embedded Networked Sensors, IEEE Workshop on, 0:19–28, 2005. doi:
http://doi.ieeecomputersociety.org/10.1109/EMNETS.2005.1469095.

Anubhav Das and Daniel Grosu. Combinatorial auction-based protocols for resource allocation in grids.
In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Sym-
posium (IPDPS’05) - Workshop 13, page 251.1, Washington, DC, USA, 2005. IEEE Computer Soci-
ety. ISBN 0-7695-2312-9. doi: http://dx.doi.org/10.1109/IPDPS.2005.140.

D.G.Feitelson. Logs of real parallel workloads from production systems. URL


http://www.cs.huji.ac.il/labs/parallel/workload/logs.html.
20 BIBLIOGRAPHY

Ian Foster, Carl Kesselman, and Steven Tuecke. The anatomy of the grid: Enabling scalable virtual
organizations. Int. J. High Perform. Comput. Appl., 15(3):200–222, 2001. ISSN 1094-3420. doi:
http://dx.doi.org/10.1177/109434200101500302.

Kevin Lai. Markets are dead, long live markets. SIGecom Exch., 5(4):1–10, 2005. doi:
http://doi.acm.org/10.1145/1120717.1120719.

Kevin Lai, Lars Rasmusson, Eytan Adar, Stephen Sorkin, Li Zhang, and Bernardo A. Huberman. Ty-
coon: an Implemention of a Distributed Market-Based Resource Allocation System. Technical Report
arXiv:cs.DC/0412038, HP Labs, Palo Alto, CA, USA, December 2004.

Hao Liu, Amril Nazir, and Søren-Aksel Sørensen. A software framework to support adaptive applications
in distributed/parallel computing. In HPCC, pages 563–570, 2009.

Sang-Min Park and Marty Humphrey. Feedback-controlled resource sharing for predictable escience.
SC Conference, 0:1–11, 2008. doi: http://doi.ieeecomputersociety.org/10.1145/1413370.1413384.

M. A. Rappa. The utility business model and the future of computing services. IBM Syst. J., 43(1):
32–42, 2004. ISSN 0018-8670.

C.A. Waldspurger, T. Hogg, B.A. Huberman, J.O. Kephart, and W.S. Stornetta. Spawn: A distributed
computational economy. IEEE Transactions on Software Engineering, 18:103–117, 1992. ISSN
0098-5589. doi: http://doi.ieeecomputersociety.org/10.1109/32.121753.

Rich Wolski, James S. Plank, John Brevik, and Todd Bryan. Analyzing market-based resource allocation
strategies for the computational grid. Int. J. High Perform. Comput. Appl., 15(3):258–281, 2001. ISSN
1094-3420. doi: http://dx.doi.org/10.1177/109434200101500305.

Chee Shin Yeo and Rajkumar Buyya. A taxonomy of market-based resource management systems for
utility-driven cluster computing. Softw. Pract. Exper., 36(13):1381–1419, 2006. ISSN 0038-0644.
doi: http://dx.doi.org/10.1002/spe.v36:13.

Yanyong Zhang, Anand Sivasubramaniam, Jose Moreira, and Hubertus Franke. A simulation-based
study of scheduling mechanisms for a dynamic cluster environment. In ICS ’00: Proceedings of the
14th international conference on Supercomputing, pages 100–109, New York, NY, USA, 2000. ACM.
ISBN 1-58113-270-0. doi: http://doi.acm.org/10.1145/335231.335241.
Bibliography

Alexander Barmouta and Rajkumar Buyya. Gridbank: A grid accounting services architecture (gasa) for
distributed systems sharing. In Proceedings of the 17th Annual International Parallel and Distributed
Processing Symposium (IPDPS 2003), IEEE Computer, pages 22–26. Society Press, 2002.

Ruediger Berlich, Marcus Hardt, Marcel Kunze, Malcolm Atkinson, and David Fergusson. Egee: Build-
ing a pan-european grid training organisation. In Rajkumar Buyya and Tianchi Ma, editors, Fourth
Australasian Symposium on Grid Computing and e-Research (AusGrid 2006), volume 54 of CRPIT,
pages 105–111, Hobart, Australia, 2006. ACS.

Anca I. D. Bucur and Dick H. J. Epema. The influence of communication on the performance of co-
allocation. In JSSPP ’01: Revised Papers from the 7th International Workshop on Job Scheduling
Strategies for Parallel Processing, pages 66–86, London, UK, 2001. Springer-Verlag. ISBN 3-540-
42817-8.

Rajkumar Buyya. Economic-based distributed resource management and scheduling for grid computing.
CoRR, cs.DC/0204048, 2002.

Abdur Chowdhury, Lisa D. Nicklas, Sanjeev K. Setia, and Elizabeth L. White. Supporting dynamic
space-sharing on clusters of non-dedicated workstations. In In Proceedings of the 17th International
conference on distributed computing, 1997.

B.N. Chun, P. Buonadonna, A. AuYoung, Chaki Ng, D.C. Parkes, J. Shneidman, A.C. Sno-
eren, and A. Vahdat. Mirage: a microeconomic resource allocation system for sensor-
net testbeds. Embedded Networked Sensors, IEEE Workshop on, 0:19–28, 2005. doi:
http://doi.ieeecomputersociety.org/10.1109/EMNETS.2005.1469095.

Anubhav Das and Daniel Grosu. Combinatorial auction-based protocols for resource allocation in grids.
In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Sym-
posium (IPDPS’05) - Workshop 13, page 251.1, Washington, DC, USA, 2005. IEEE Computer Soci-
ety. ISBN 0-7695-2312-9. doi: http://dx.doi.org/10.1109/IPDPS.2005.140.

D.G.Feitelson. Logs of real parallel workloads from production systems. URL


http://www.cs.huji.ac.il/labs/parallel/workload/logs.html.
22 BIBLIOGRAPHY

Ian Foster, Carl Kesselman, and Steven Tuecke. The anatomy of the grid: Enabling scalable virtual
organizations. Int. J. High Perform. Comput. Appl., 15(3):200–222, 2001. ISSN 1094-3420. doi:
http://dx.doi.org/10.1177/109434200101500302.

Kevin Lai. Markets are dead, long live markets. SIGecom Exch., 5(4):1–10, 2005. doi:
http://doi.acm.org/10.1145/1120717.1120719.

Kevin Lai, Lars Rasmusson, Eytan Adar, Stephen Sorkin, Li Zhang, and Bernardo A. Huberman. Ty-
coon: an Implemention of a Distributed Market-Based Resource Allocation System. Technical Report
arXiv:cs.DC/0412038, HP Labs, Palo Alto, CA, USA, December 2004.

Hao Liu, Amril Nazir, and Søren-Aksel Sørensen. A software framework to support adaptive applications
in distributed/parallel computing. In HPCC, pages 563–570, 2009.

Sang-Min Park and Marty Humphrey. Feedback-controlled resource sharing for predictable escience.
SC Conference, 0:1–11, 2008. doi: http://doi.ieeecomputersociety.org/10.1145/1413370.1413384.

M. A. Rappa. The utility business model and the future of computing services. IBM Syst. J., 43(1):
32–42, 2004. ISSN 0018-8670.

C.A. Waldspurger, T. Hogg, B.A. Huberman, J.O. Kephart, and W.S. Stornetta. Spawn: A distributed
computational economy. IEEE Transactions on Software Engineering, 18:103–117, 1992. ISSN
0098-5589. doi: http://doi.ieeecomputersociety.org/10.1109/32.121753.

Rich Wolski, James S. Plank, John Brevik, and Todd Bryan. Analyzing market-based resource allocation
strategies for the computational grid. Int. J. High Perform. Comput. Appl., 15(3):258–281, 2001. ISSN
1094-3420. doi: http://dx.doi.org/10.1177/109434200101500305.

Chee Shin Yeo and Rajkumar Buyya. A taxonomy of market-based resource management systems for
utility-driven cluster computing. Softw. Pract. Exper., 36(13):1381–1419, 2006. ISSN 0038-0644.
doi: http://dx.doi.org/10.1002/spe.v36:13.

Yanyong Zhang, Anand Sivasubramaniam, Jose Moreira, and Hubertus Franke. A simulation-based
study of scheduling mechanisms for a dynamic cluster environment. In ICS ’00: Proceedings of the
14th international conference on Supercomputing, pages 100–109, New York, NY, USA, 2000. ACM.
ISBN 1-58113-270-0. doi: http://doi.acm.org/10.1145/335231.335241.

You might also like