You are on page 1of 28

Availability Models in Practice

Archana Sathaye1 , Srinivasan Ramani2, Kishor S. Trivedi2


1 Dept. of Mathematics and Computer Science
San Jose State University
San Jose, CA 95192-0103
sathaye@mathcs.sjsu.edu
Phone : (408) 924-5124
Fax : (408) 924-5080
2 Center

for Advanced Computing and Communication


Dept. of Electrical and Computer Engineering
Duke University, Durham, NC 27708-0291
fkst, sramanig@ee.duke.edu
Phone : (919) 660-5269
Fax : (919) 660-5293

Abstract
As computer systems continue to be applied to mission-critical environments, techniques to evaluate
their dependability become more and more important. Of the dependability measures used to
characterize a system, availability is one of the most important. Techniques to evaluate a system's
availability can be broadly categorized as measurement-based and model-based. Measurementbased evaluation is expensive as it requires building a real system and taking measurements and
then analyzing the data statistically. Model-based evaluation on the other hand is inexpensive and
relatively easier to perform. In this paper, we rst look at some availability modeling techniques
and take up a case study from an industrial setting to illustrate the application of the techniques
to a real problem. Although easier to perform, model-based availability analysis poses problems
like largeness and complexity of the models developed which makes the models dicult to solve.
This paper also illustrates several techniques to deal with largeness and complexity issues.

1 Introduction
Complex computer systems are widely used in di erent applications ranging from ight control,
command and control systems to commercial systems like information and nancial services. These
applications demand high performance and high availability. Availability evaluation addresses failure and recovery aspects of a system, while performance evaluation addresses processing aspects
and assumes that, the system components do not fail. For gracefully degrading systems, a measure that combines system performance and availability aspects is more meaningful than separate
measures of performance and availability. These composite measures are called performability measures. The two basic approaches to evaluate the availability/performance measures of a system are:
measurement-based and model-based. In measurement-based evaluation, the required measures
are estimated from measured data using statistical inference techniques. The data is measured
1

from a real system or its prototype. In case of availability evaluation, measurements are not always feasible. The reason being either the system has not been built yet, or it is too expensive
to conduct experiments. That is, in a high availability system one would need to measure data
from several systems to gather good sample data. On the other hand, injecting faults in a system
can be an expensive procedure. Model based evaluation is the cost-e ective solution as it allows
system evaluation without having to build and measure a system. In this paper we discuss availability modeling techniques and their usage in practice. To emphasize the practicality of these
techniques, we discuss their pros and cons with respect to a case study, VAXcluster systems1 of
Digital Equipment Corporation (DEC2 ).
In this paper, we rst discuss di erent availability modeling approaches in Section 2. In Section
3, we discuss the bene ts of utilizing a composite availability and performance model in practice
instead of a pure availability model. Our discussion emphasizes this point using a model developed
for multiprocessors to determine the optimal number of processors in the system. In Section 4, we
present a case study to demonstrate the utility of availability modeling in a corporate environment.

2 Modeling Approaches
Model-based evaluation can be through discrete-event simulation, or analytic models, or hybrid
models combining simulation and analytic parts. A discrete-event simulation model can depict
detailed system behavior, as it is essentially a program whose execution simulates the dynamic
behavior of the system and evaluates the required measures. An analytic model consists of a set
of equations describing the system behavior. The evaluation measures are obtained by solving
these equations. In simple cases closed-form solutions are obtained, but more frequently numerical
solutions of the equations are necessary.
The main bene t of discrete-event simulation is the ability to depict detailed system behavior
in the models. Also, the exibility of discrete-event simulation allows its use in performance, availability and performability modeling. The main drawback of discrete-event simulation is the long
execution time, particularly when tight con dence bounds are required in the solutions obtained.
Also, carrying out a \what if" analysis requires rerunning the model for di erent input parameters.
Advances in simulation speed-up such as regenerative simulation, importance sampling, importance
splitting, parallel and distributed simulation also need to be considered.
Analytic models are more of an abstraction of the real system than a discrete-event simulation
model. In general, analytic models tend to be easier to develop and faster to solve than a simulation
model. The main drawback is the set of assumptions that are often necessary to make analytic
models tractable. Recent advances in model generation and solution techniques as well as computing
power make analytic models more attractive. In this paper, we discuss model-based evaluation using
analytic techniques and how one can achieve results that are useful in practice.

2.1 Analytic Availability Modeling Approaches

A system modeler can either choose state space or non-state space analytical modeling techniques.
The choice of an appropriate modeling technique to represent the system behavior is dictated by
factors such as the measures of interest, level of detailed system behavior to be represented and
the capability of the model to represent it, ease of construction, and availability of software tools
1
2

Now known as VMSclusters


Now known as COMPAQ

to specify and solve the model. In this section, we discuss several non-state space and state space
modeling techniques.

2.1.1 Non-State Space Models


Non-state space models can be solved without generating the underlying state space. Practically
speaking, these models can be easily used for solving systems with hundreds of components because
there are many relatively good algorithms available for solving such models [15]. The non-state
space models can be evaluated to compute measures like system availability, reliability and system
mean time to failure (MTTF). The two main assumptions used by the models are statistically
independent failures and independent repair units for components. The non-state space modeling
techniques used to evaluate system availability are reliability block diagrams and fault trees. In
gracefully degradable systems, a knowledge of the performance of the system is also essential. Nonstate space modeling techniques used to evaluate system performance are product-form queuing
models and task-precedence graphs [9].

i. Reliability Block Diagram

In a reliability block diagram (RBD) each component of the system is represented as a block
[9, 12]. The blocks are then connected in series and/or parallel based on the operational dependency
between the components. If for the system to be up all the components need to be operational,
blocks in a RBD are connected in series. On the other hand, if the system can survive with at least
one component then blocks are connected in parallel. An RBD can be used to model availability
if the repair times (and failure times) are all independent. Figure 1(a) shows a multiprocessor
availability model with n processors where at least one processor is required for the system to be
up. From this we conclude that the RBD represents a simple parallel system. Given a failure rate
and repair rate  , the availability of processor Proci is given by,

Ai = +  :

The availability of the parallel system is then given by,

A=1,

n
Y
i=1

(1)

(1 , Ai ) = 1 , + 

ii. Fault Trees

n

(2)

A fault tree [9], like a reliability block diagram is useful for availability analysis. It is a pictorial
representation of the sequence of events/conditions to be satis ed for a failure to occur. A fault
tree uses and, or and k of n gates to represent this combination of events in a tree-like structure. To
represent situations where one failure event propagates failure along multiple paths in the fault tree,
fault trees can have repeated nodes. Several ecient algorithms for solving fault trees exist. These
include algorithms for series-parallel systems (for fault trees without repeated components) [17],
a multiple inversion (MVI) algorithm called the LT algorithm to obtain sum of disjoint products
(SDP) from mincut set [18] and the factoring/conditioning algorithm that works by factoring a
fault tree with repeated nodes into a set of fault trees without repeated nodes [20]. Binary decision
diagram (BDD)-based algorithms can be used to solve very large fault trees [21, 22]. Figure 1(b)
shows the fault tree model for our multiprocessor system. UAi represents the unavailability of
3

Proc 1
A1
FAILURE

Proc 2
A2

.
.
.
...

Proc n
An

UA UA
1
2

(a)

UA

(b)

Figure 1: A multiprocessor system


(a) Reliability block diagram, (b) Fault tree
processor i. The and gate indicates that the mutiprocessor system fails when all the n processors
becomes unavailable. The output of the top gate of the fault tree represents failure of the parallel
multiprocessor system.

2.1.2 State Space Models


Reliability block diagrams and fault trees cannot easily handle more complex situations such as
failure/repair dependencies and shared repair facilities. In such cases, more detailed models such
as the state space models are required. Here we discuss some Markovian state space models.

i. Markovian Models
Markov Chains
In this section we will consider homogeneous Markov chains. A homogeneous continuous
time Markov chain (CTMC)[12] is a state space model, in which each state represents various
conditions of the system. In homogeneous CTMCs, transitions from one state to another
occur after a time that is exponentially distributed. The arcs representing a transition from
one state to another are labeled by the constant rate corresponding to the exponentially
distributed time of the transition. If a state in the CTMC has no transitions leaving it, then
that state is called an absorbing state, and a CTMC with one or more such states is said to be
an absorbing CTMC. For the multiprocessor example, we now illustrate how a Markov chain
can be developed to capture shared repair and multiple failure modes.
The parameters associated with a system availability model that we will now develop for our
multiprocessor system are, the failure rate of each processor and the processor repair rate
 . The processor fault is covered with probability c and not covered with probability 1 , c.
After a covered fault, the system is up in a degraded mode after a recon guration delay.
On the other hand, an uncovered fault is followed by a longer delay imposed by a reboot
action. The recon guration and reboot delays are assumed to be exponentially distributed
with means 1= and 1= respectively. In practice the recon guration and reboot times are
extremely small compared to the times between failures and repairs, hence we assume that

n c

X
n

Yn

n-1

n-1

n
n (1-c)

(n-1) c

n-2

(n-1) (1-c)
Y

...

n-1

Figure 2: Muliprocessor Markov chain model


failures and repairs do not occur during these actions. System availability can be modeled
using the Markov chain shown in Figure 2. For the system to be up, at least one processor
out of the n processors needs to be operational. The state i, 1  i  n in the Markov
model represents that i processors are operational and n , i processors are waiting for on-line
repair. The states Xn,i and Yn,i, for i = 0; : : : ; n , 2 represent that a system is undergoing a
recon guration and is being rebooted respectively. We compute the steady state probability
i for each state
P i [2]. Then the system unavailability de ned as a function of n is given by,
UA(n) = 1 , ni=1 i .
By appropriately choosing reward rates (or weights) for each state the appropriate measure
can be obtained for the model on hand. For example, for the multiprocessor example, if the
reward rates are de ned as ri = 1 for states i = 1 : : : n and ri = 0 otherwise, then the expected
steady state reward rate gives the steady state availability.
Stochastic Petri Nets and Reward Nets [9]
A Petri net [9] is a more concise and intuitive way of representing a situation to be modeled.
It is also useful to automate the generation of large state spaces. A Petri net consists of
places, transitions, arcs and tokens. Tokens move from one place to another along arcs
through transitions. The number of tokens in the places represents the marking of a Petri
net. If the transition ring times are stochastically timed, the Petri net is called a stochastic
Petri net (SPN). If the transition ring times are exponentially distributed, the underlying
reachability graph, representing transitions from one marking to another gives the underlying
homogeneous CTMC for the situation being modeled.
For the multiprocessor system, let us say we are interested in nding the probability that
an incoming task is turned away because all n processors are tied up by other tasks being
processed. The parameters associated with this pure performance model are, arrival rate of
tasks, service rate of tasks, the number of bu ers and a deadline on task response times. The
performance model assumes that the arriving task forms a Poisson process of rate  and the
service requirements of tasks are independent, identically distributed with the exponential
distribution of mean 1=. A deadline d is associated with each task. Let us also take the
number of bu ers available for storing incoming tasks as b. We could use an M=M=n=b queue
represented by the generalised stochastic Petri net (GSPN, that allows immediate transitions
also) shown in Figure 3 for our performance model. Timed transitions are represented by
thick rectangles and immediate transitions by a thin line. Place proc contains the number
of processors available. Initially there are n tokens here representing n processors. When
transition arr res (this happens only if a bu er is available), a token is removed from proc
and put in place serving representing one less free processor. Transition service has a ring
rate that depends on the number of tokens in place serving (indicated by the notation #).
Transition arr is disabled (indicated by the inhibitor arc from place bu er) when there are
5

arr

buffer

request

serving

service

proc
n

b-n

Figure 3: GSPN model of M/M/n/b queue

b , n tokens in place bu er, since there can only be b tasks in the system : n in place serving
and b , n in place bu er. Therefore the probability that an incoming task is rejected is the

probability that transition arr is disabled.

In practical system design, a pure availability model may not be enough for systems such as
gracefully degradable ones. In conjunction with availability, the performance of the system as
it degrades needs to be considered. This requires a \performability" model that includes both
performance and availability measures. In the next section, we present an example of a system for
which a performability measure is needed.

3 Composite Availability and Performance Model


Consider the multiprocessor system again but with varying number of processors, n, each with the
same capacity. A key question asked in practice is regarding the number of processors needed.
As we discuss below, the optimal con guration in terms of the number of processors is a function
of the chosen measure of system e ectiveness [2]. We begin our sizing based on measures from a
\pure" availability model. Next we consider sizing based on system performance measures. Last
we consider a composite of performance and availability measure to capture the e ect of system
degradation.

3.1 Multiprocessor Sizing Based on Availability

A CTMC model for the failure/repair characteristics of the multiprocessor system is shown in Figure
2. The details of the model were already discussed when introducing Markov chains in the previous
section. The downtime during an observation interval of duration T is given by UA(n)  T . The
results shown in Figure 4 assume T is 1 year, i.e., 8760 hours. In Figure 4(a) we plot the downtime
D(n) against n for varying values of the mean recon guration delay using c = 0:9, = 1=6000 per
hour, and = 12 per hour. In Figure 4(b), we plot D(n) against n for di erent coverage values with
a mean recon guration delay of 10 seconds. We conclude from these results that the availability
bene ts of multiprocessing (i.e., increase in availability with increase in the number of operational
processors) is possible only if the coverage is near-perfect and the recon guration delay is very small
or most of the other processors are able to carry out useful work while a fault is being handled.
We further observe that for most practical parameter values, the optimal number of processors is 2
or 3. In the next subsection we consider a performance-based model for the multiprocessor sizing
problem.

(a)

(b)

Figure 4: Downtime Vs. Number of processors for


(a) di erent mean recon guration delays, (b) di erent coverage values

3.2 Multiprocessor Sizing Based on Performance

On the lines of the GSPN example discussed in the previous section, we used an M=M=n=b queuing
model with nite bu er capacity (see Figure 3) to compute the probability that a task is rejected
because the bu er is full. In Figure 5 we plot the loss probability as a function of n for di erent
values of arrival rates. We observe that the loss probability reduces as the number of processors
increases. The conclusion from the performance model of the fault-free system is that the system
improves as the number of processors is increased. The details of this model and results are
presented in [2].
The above models point out the de ciency of simply considering a pure availability or performance measure. The pure availability measure ignores di erent levels of performance at various
system states, while the pure performance measure ignores the failure/repair behavior of the system.
The next section considers combined measures of performance and availability.

3.3 Multiprocessor Sizing Based on Performance and Availability

Di erent levels of performance can be taken into account by attaching a reward rate ri corresponding
to some measure of performance to each state i of the failure/repair Markov model in Figure 2. The
resulting Markov reward model can then be analyzed for various combined measures of performance
and availability. The simplest reward rate assignment is to let ri = i for states with i operational
processors and ri = 0 for down states. With the reward assignment shown in Table 1, we can
compute the capacity-oriented availability, COA(n) as the expected reward rate in the steadystate. COA(n) is an upper bound on system performance that equates performance with system
capacity. When i processors are operational we used an M=M=i=b queuing model (such as the
7

Figure 5: Loss probability Vs. Number of processors

Reward rate, r
0

State
0

1in
X , and Y ,
i = 0; : : : ; n , 2
n

Table 1: Reward rates for COA

State
0

1in

X , and Y ,
i = 0; : : : ; n , 2
n

Reward rate, r
0
T (i), throughput of system
with i processors and b bu ers
0

Table 2: Reward rates for TOA

Figure 6: COA(n) and TOA(n) for di erent arrival rates


GSPN in Figure 3) to describe the performance of the multiprocessor system. We then assigned a
reward rate of ri = 0 for each down state i and a reward rate of ri = Tb (i), which is the throughput
for a system with i processors and b bu ers, for all other states (see Table 2). With this assignment,
the expected reward rate at steady state computes the throughput-oriented availability, TOA(n).
In Figure 6, we plot COA(n) and TOA(n) for di erent values of the arrival rate .
These two measures show that in order to process a heavier workload more than two processors
are needed. The measures COA and TOA are not adequate measures of system e ectiveness as
they obliterate the e ects of failure/repair and merely show the e ects of system capacity and the
load. In [2], a measure of system e ectiveness, total loss probability, is proposed that \equally"
re ects fault-free behavior and behavior in presence of faults. The total loss probability is de ned as
the sum of rejection probability due to system being down or full and the probability of a response
time deadline being violated. The total loss probability is computed by using the following reward
rate assignments (Table 3) : ri = 1 if i is a down state and ri = qb (i) + (1 , qb (i))(P (Ri (b) > d)) if
i is an operational state, where qb(i) is the probability of task rejection when the bu er is full for
9

State
0

1in
X , and Y ,
i = 0; : : : ; n , 2
n

Reward rate, r
1
q (i) + (1 , q (i))(P (R (b) > d))
1

Table 3: Reward rates for total loss probability

Figure 7: Total loss probability Vs. number of processors for di erent task arrival rates
a system with i operational processors, Ri (b) is the response time for a system with i operational
processors and b bu ers, and d is the deadline on task response time. In Figure 7, we plot the total
loss probability as a function of n for di erent values of the task arrival rate. We observe that
the optimal number of processors increases with the task arrival rate, tighter deadlines and smaller
bu er spaces.

4 Digital Equipment Corporation Case Study


In this section we discuss a case study to demonstrate that in practice, the choice of an appropriate
model type is dictated by the availability measures of interest, the level of detailed system behavior
to be represented, ease of model speci cation and solution, representation power of the model type
selected, and access to suitable tools or toolkits that can automate model speci cation and solution.
In particular we describe the availability models for Digital Equipment Corporation's (DEC)
VAXcluster system. VAXclusters are used in di erent application environments, and hence several
availability, reliability and performability measures need to be computed. VAXclusters used as com10

VAX

HSC
Disk

VAX

.
.
.

Star
Coupler

Disk
HSC
VAX

Figure 8: Architecture of the VAXcluster System.


mercial computer systems in a general data processing environment require us to evaluate system
availability and performability measures. To consider VAXclusters in highly critical applications
like life support systems and nancial data processing systems, we evaluate many system reliability
measures. These two measures were not adequate for some nancial institution customers of VAXclusters. We therefore evaluated task completion measures to compute probability of application
interruption during its execution period.
VAXclusters are closely coupled systems that consist of two or more VAX computers, one or
more hierarchical storage controllers (HSCs), a set of disk volumes and a star coupler [5]. The
processor (VAX) subsystem and the storage (HSC and disk volume) subsystem are connected
through the star coupler by a CI3 bus. The star coupler is omitted from the availability models as
it is assumed to be a passive connector, and hence extremely reliable. Figure 8, shows the hardware
topology of a VAXcluster.
Our availability modeling approach considers a VAXcluster as two independent subsystems
namely, the processing subsystem and the storage subsystem. Therefore, the availability of the
VAXcluster is the product of the availability of each subsystem. In the following sections, we
develop a sequence of increasingly powerful availability models, where the level of modeling power
is directly proportional to the level of complex behavior and characteristics of VAXclusters included
in the model.
Our discussion of the availability models developed for the VAXcluster system is organized as
follows. In Section 4.1 and 4.2 we develop models using non-state space techniques like reliability
block diagrams and fault-trees, respectively. In Section 4.3.1, we develop a CTMC or rather a
Markov reward model. The utility of this model in practice is limited as the size of the model grows
exponentially with the number of processors in the VAXcluster. A model that avoids largeness is
discussed in Sections 4.3.2 and 4.4. The model in Section 4.3.2 uses a two-level decomposition for
the processor subsystem of the VAXcluster [16]. On the other hand, for the model in Section 4.4
an iterative scheme is developed [10]. The approximate nature of the results prompted a largeness
tolerance approach. In Section 4.5, a concise stochastic Petri net is developed for VAXclusters
consisting of uniprocessors [1]. The next approach in Section 4.6 consists of realistic heterogeneous
con gurations, where the VAXclusters consist of uniprocessors and/or multiprocessors [3, 4] using
stochastic reward nets [14].
3

Computer Interconnect

11

4.1 Reliability Block Diagram Model (RBD)

The rst model of VAXclusters uses a non-state space method, namely, the reliability block diagram. This approach was seen in the availability model of VAXclusters by Balkovich et. al [6]. We
use this approach to partition the VAXcluster along functional lines, and this allows us to model
each component type separately. In Figure 9, the block diagram represents a VAXcluster con guration with n processors, n HSCs and n disks. We assume that the VAXcluster is down if all the
Processing
Subsystem

Storage
Subsystem

VAX
HSC

Disk

.
.
.

.
.
.

HSC

Disk

VAX

.
.
.
VAX

Figure 9: Reliability block diagram model for the VAXcluster System.


components of any of the three subsystems are down. We assume that the times to failure of all
components are mutually independent, and exponentially distributed random variables. We also
assume each component to have an independent repair facility. The repair time here is a 2-stage
hypoexponentially distributed random variable with the rst phase being the travel time for the
repairman to get to the eld and the second phase being the actual repair time. On evaluating this
model as a pure series-parallel availability model, the expression for the VAXcluster availability is
given by:

A = 1,

P

P + ( 1=

!n!
1
+1=

1,

H

H + ( 1=

!n!
1
+1=

1,

D

P + ( 1=

!n!
1
+1=

(3)

Here,
 1=P is the mean time between VAX processor failures.
 1=H is the mean time between HSC failures.
 1=D is the mean time between disk failures.
 1=F is the mean eld service travel time.
 1=P , 1=H and 1=D are the mean time to repair a VAX processor, HSC and disk respectively.
The assumption that a VAXcluster is down when all the components of any of the three subsystems are down is not in tune with reality. For a VAXcluster to be operational the system should
meet quorum, where quorum is the minimum number of VAXes required for the VAXcluster to
function.
12

Cluster failure

OR

(n-k+1) of n

(n-k+1) of n

...
U U

P P
1 2

(n-k+1) of n

...
U

U U

P
n

H H
1 2

...
U

U U
n

D D
1 2

Figure 10: Fault tree model for the VAXcluster system

4.2 Fault Tree Model

In this section, we present a fault tree model for the VAXcluster con guration discussed in Section
4.1. Figure 10, is a model for the VAXcluster with n processors, n HSCs, and n disks. Observe
that, in a block diagram model, the structure tells us when the system is functioning, while in a
fault tree model, the structure tells us when the system has failed. In addition, we have extended
the model to include a quorum required for operation. The cluster is operational as long as k out
of n processors, HSCs and disks are up. The negation of this operational information is depicted
in the fault tree as follows. The topmost node denotes \Cluster Failure" and the associated \OR"
gate speci es that, a cluster fails if (n , k + 1) processors, (n , k + 1) HSCs, or (n , k + 1) disks
are down. The steady state unavailability of the cluster, Ucluster is given by,

Ucluster = (1 , (1 , UP )(1 , UH )(1 , UD ))


(4)
P
Q
Q
where, Ui = jJ j(n,k+1) ( j 2J Ui )( j 2= J (1 , Ui )) for i = P, H or D (processors, HSCs or disks),
and J is the set of indices of all functioning components.
j

The RBD and fault tree VAXcluster availability models are very limited in their depiction of
the failure/recovery behavior of the VAXcluster. For example, they assume that each component
has its own repair facility, and that there is only one failure/recovery type. In fact, combinatorial
models like RBDs, fault trees and reliability graphs require system components to behave in a
stochastically independent manner. Dependencies of many di erent kinds exist in VAXclusters and
hence, combinatorial models are not entirely satisfactory for such systems. State space modeling
techniques like Markovian models can include di erent kinds of dependencies. In the following
sections, we develop state space models for the processing subsystem and the storage subsystem
separately, and use a hierarchical technique to combine the models of the two subsystems to obtain
an overall system availability model.
13

20p,1

11s,1

PB

P
10p,0

CB

10c,1

00t,1

10t,1
02r,1

2 (1-c)
P

2 c
P

2 (1-k)
I
01c,1

000,0

2 k
I

PB

2
01b,0

01t,1

PB

CB

Figure 11: CTMC for the two-processor VAXcluster

4.3 Availability Model for the VAXcluster Processing Subsystem


4.3.1 Continuous Time Markov Chain
We now develop a more detailed continuous time Markov chain (CTMC) model for the VAXcluster,
showing two types of failure and a coverage factor for each failure type. By using a state space
model, we are also able to incorporate shared repair for the processors in a cluster. We assumed that
the times between failures, the repair times and other recovery times are exponentially distributed
and developed an availability model for an n,processor (n  2) VAXcluster using a homogeneous
continuous-time Markov chain. A CTMC for a 2-processor VAXcluster developed in [1] is shown in
Figure 11. The following behavior is characterized in this CTMC. A processor is either up or down.
There are two types of failure: permanent and intermittent. A processor recovers from a permanent
failure by a physical repair and from an intermittent failure by a processor reboot. These failures
are further classi ed into covered or uncovered. A covered processor failure causes a brief (in the
order of seconds) cluster outage to recon gure the failed processor out of the cluster and back
into the cluster after it is xed. Thus if a quorum is still formed by the operational processors, a
covered failure causes a small loss in system time. An uncovered failure causes the entire cluster to
go down until it is rebooted even if the remaining operational processors still form quorum. The
permanent and intermittent mean failure times are 1=p and 1=I hours respectively. The mean
repair, mean processor reboot, and mean cluster reboot times are given by 1=p , 1=PB , 1=CB
respectively. Let c and k denote the coverage factors for permanent and intermittent failures
respectively. Realistically, the mean recon guration time (1=IN ) to map a processor into the
cluster and the time (1=OUT ) to map it out of the cluster are di erent. In the CTMC of Figure
11, we have assumed that IN = OUT = T . The states of the Markov chain are represented by
(abc; d), where,

14

a = number of processors down with permanent failure


b = number
of processors down with intermittent failure
8
>
0 if both processors are up
>
>
>
>
>
p if one processor is being repaired
>
>
>
>
>
<b if one processor is being rebooted
c = >c if cluster is undergoing a reboot
>
>
t if cluster is undergoing a recon guration
>
>
>
>
>
r if two processors are being rebooted
>
>
>
:s if one is being rebooted and other repaired
(
d = 0 cluster up state
1 cluster down state

The steady-state availability of the cluster is given by:

Availability = P000;0 + P10p;0 + P01b;0

(5)

where, Pabc;d denotes the steady-state probability that the process is in state (abc; d). We computed
the availability of the VAXcluster system by solving the above CTMC using SHARPE [9], which is
a software package for availability/performability analysis. The main problem with this approach
was that the size of the CTMC grew exponentially with the number of processors in the VAXcluster
system. The largeness posed the following challenges: (1) the capability of the software to solve the
model with thousands of states for VAXclusters with n > 5 processors. (2) the problem of actually
generating the state space. In the next section we address these two drawbacks.

4.3.2 Approximate Availability Model for the Processing Subsystem


In this section, we discuss a VAXcluster availability model that avoids the largeness associated with
a Markov model. To reduce the complexity of a large system, Ibe et. al [16] developed a two-level
hierarchical model. The bottom level is a homogeneous CTMC and the top level is a combinatorial
model. This top-level model was represented as a network of diodes (or three-state devices).
The approximate availability model developed for the analysis made the following assumptions
[16].
1. The behavior of each processor was modeled by a homogeneous CTMC and assumed that
this processor did not break the quorum rule. This assumption is justi ed by the fact that
the probability of VAXcluster failure due to loss of quorum is relatively low.
2. Each processor has an independent repairman. This assumption is justi ed as the authors
saw that the MTBF was large compared to the MTTR.
These assumptions allowed the authors to decompose the n-processor VAXcluster into n independent subsystems. Further, the states of the CTMC for the individual processors were classi ed
into the following three states:
15

 X = the set of states in which the processors are up.


 Y = the set of states in which the cluster is down due to a processor failure.
 Z = the set of states in which the processor is down but the cluster is up.
The authors compared the superstates to the three states of a diode. The three states X, Y
and Z represent the following states of the diode { up state, the short circuit state and the open
circuit state respectively. Then the availability, A, of the VAXcluster was de ned as follows [16]:

A = P [at least one processor in superstate X and none in superstate Y ]:


Let nX , nY and nZ denote the number of processors in superstates X,Y and Z, respectively.
Let PX , PY and PZ denote the probability that a processor is in superstate X, T and Z. Then Ibe
et al. [16] de ned the availability An of an n-processor VAXcluster as follows:
n 
X


n
An =
PXn PY0 PZn,n
n
0
n
,
n
X
X
n =1

n 
X
n
=
PXn PZn,n , 0!(nn,! 0)! PX0 PZn
n
X
n =0
= (PX + PZ )n , PZn
X

(6)

The authors could analyze di erent VAXcluster con gurations by simply varying the number
of processors n in the above equation. The main drawbacks of this approach are the approximate
nature of the solution versus an exact solution, and the need to make simplifying assumptions, one
of the assumptions being an independent repairman for each processor. In the next section, we
illustrate another approximation technique to deal with large subsystems.

4.4 Availability Model for the VAXcluster Storage Subsystem

In this section, we discuss a novel availability model for VAXclusters with large storage subsystems.
In [10], a xed-point iteration scheme was used over a set of CTMC sub-models. The decomposition
of the model into sub-models controlled the state space explosion and the iteration modeled the repair priorities between the di erent storage components. The model considered con gurations with
shadowed (commonly known as mirrored) disks and characterized system along with application
level recovery.
In Figure 12, we show the block diagram of an example storage system con guration, that will
be used to demonstrate the technique. The con guration shown consists of two HSCs, and a set
of disks. The disks are further classi ed into two system disks and two application disks. The
operating system resides on the system disk, and the user accounts and other application software
on the application disks. Further, it is assumed that the disks are shadowed4 and dual pathed
and ported between the two HSCs [10]. A disk dual pathed between two HSCs can be accessed
cluster-wide in a coordinated way through either HSC. In case, one of the HSC fails, a dual ported
disk can be accessed through the other HSC after a brief failover period.
4

Commonly referred to as mirrored

16

HSC1

System
Disk 1

Application
Disk 1

HSC2

System
Disk 2

Application
Disk 2

Figure 12: Reliability block diagram for the storage system


2H
2

1H

0H

2A

2S

1S

2H

0S

1H

(a)

1A

2S

0A

1S

(b)

2A

1A

(c)

Figure 13: CTMC models: Shared repair within a


(a) HSC subsystem, (b) SDisk subsystem, (c) ADisk subsystem
We now discuss the sequence of approximation models developed to compute the availability
of the storage system in Figure 12. The rst model assumed that each component in the block
diagram has its own repair facility. We assumed that the repair time is a 2-stage hypoexponentially
distributed random variable with the rst phase being the travel time and the second phase being
the actual repair time. This model can be solved as a pure series-parallel availability model to
compute the availability of the storage system, similar to the solution of the RBD in Section 4.1.
In the second improved model, we removed the assumption of independent repair. Instead, it
is assumed that a repair facility is shared within a subsystem. The storage system is now assumed
as a two-level hierarchical model. The bottom level consists of three independent CTMC models,
namely HSC, SDisk and ADisk, representing the HSC, system disk and application disk subsystems
respectively. The top level consists of a reliability block diagram representing a series network of
the three subsystems. In Figure 13(a), (b), (c) we show the CTMC models of the three subsystems.
The reliability block diagram at the top level is shown in Figure 14.
The states of the CTMC model adopt the following convention.

 State nX represents, n components of the subsystem are operational, where n can take the
values 0, 1 or 2.

HSC

SDISK

ADISK

Figure 14: Top level reliability block diagram for the storage subsystem
17

 State TnX represents that the eld service has arrived and that (n , 1) components of the
subsystem are operational and the rest are under repair.

In the above notation, the value of X is H , S or A, where H is associated with the HSC subsystem
model, S with the system disk subsystem model and A with the application disk subsystem model.
The steady state availability of the storage subsystem in Figure 14 is given by,

A = AH  AS  AA

(7)

AX is the availability of the X subsystem and is given by,


AX = P2X + P1X + PT2

(8)
where PiX and PT are the steady-state probability that the Markov chain is in state iX and state
TiX respectively.
In the third approximation we took into account disk reload and system recovery. This takes into
account the following activities. When a disk subsystem experiences a failure, data on the disk may
be corrupted or lost. After the disk is repaired the data is reloaded on to the disk from an external
source, such as a backup disk or tape. While the reload is a local activity of a disk subsystem,
recovery is a global system-wide activity. This behavior is incorporated in the Markov models of
Figure 15(a), (b), (c) as follows. The HSC Markov model is enhanced by including application
recovery states R2H and R1H after the loss of both the HSCs in the HSC subsystem. The system
disk Markov model is extended by incorporating reload states L2S and L1S , and application recovery
states R2S and R1S . The reload followed by application recovery starts immediately after the rst
disk is repaired. We further assume that a component could su er failures during a reload and/or
recovery. The application disk Markov model is extended similar to the system disk model by
including reload states L2A and L1A , and recovery states R2A and R1A . The expression for the
steady-state availability of the storage subsystem is similar to the expression obtained in the second
approximation.
In the fourth approximation, the assumption of independent repair facility for each subsystem is
eliminated. In this approximation, the repair facility is shared between subsystems, and when more
than one component is down, the following repair priority is assumed: (1) any subsystem with all
failed components is repaired rst; (2) otherwise, an HSC is repaired rst, system disk second, and
application disk third. This repair priority scheme does not change the Markov model for the HSC
subsystem, but changed the model for the system and application disk subsystems. The system
disk has the second highest priority and hence, the system disk repair rate D is slowed down by
multiplying it by P1 , the probability that both HSCs are operational, given that eld service is
present and the system is not in a recovery mode. Then P1 is given by,
X

iX

P1 = (P + PP2H + P ) :
2H
T1
T2
H

(9)

In [10] it is assumed that a component can be repaired during recovery. Then the system disk
repair rate, D from the recovery states is slowed down by multiplying it by P2 where,

P2 = (P PR+2 P ) :
R1
R2
H

18

(10)

R
2H

2H
2

1H

0H

R
1H

2H

2S

1S

1H

0S

2S

R
2S

(a)

(b)

2A
2

1A

2A

0A

R
2A

2A
2

1A

1A

1A

(c)
Figure 15: CTMC models:
(a) System Recovery Included for HSC subsystem,
(b) Disk Reload and System Recovery Included for SDisk subsystem,
(c) Disk Reload and System Recovery Included for ADisk subsystem

19

1S

1S

2S
2

1S

Here, PR (n = 1; 2) are the HSC recovery states.


The application disk has the lowest repair priority, and is enforced by probabilistically slowing
down the repair rate. The repair rate from the non-recovery states is slowed down by multiplying
D by P3 where,
nH

P3 = PA2H  PB2S :

(11)

Here A = P2H + PT1 + PT2 and B = P2S + PT1 + PT2 + PL1 + PL2 . Then P3 expresses the
probability that both HSCs are operational given that the HSC subsystem is not in the recovery
states or in states with less than two HSCs operational, and that both system disks are operational
given that the system disk is in non-recovery states or states with more than one system disk up.
The steady-state availability is computed as in the rst approximation.
In the above approximations we included the eld service travel time for each subsystem. In the
real world, if a eld service person is present and repairing a component in one subsystem, he would
respond to a failure in another subsystem. Thus in this case we should not be including travel time
twice. Also, the eld service would follow the repair priority described above. The Markov model
for each subsystem can be modi ed, by iteratively checking the presence of eld service person in
the other two Markov models. The eld service person is assumed to wait on site until reload and
recovery is completed in the SDisk and ADisk subsystem, and until recovery is completed in the
HSC subsystem.
The HSC subsystem is extended as follows. The rate of transition due to a component failure
is probabilistically split using the variable y1 (or 1 , y1 ). The probability that the eld service is
present for repairing a component in either of the two disk subsystems is,
H

y1 = (1 , P2S , P1S , P0S ) + (1 , P2A , P1A , P0A)


,((1 , P2S , P1S , P0S )  (1 , P2A , P1A , P0A)):

(12)

The initial value of y1 is assumed to be 0 in the rst iteration. Then the above value of y1 is
used for the next iteration.
The system (application) disk subsystem is extended as follows. The rate of every transition due
to a component failure that occurs in the absence of the repair person in the system (application)
disk subsystem is multiplied by y2 (or 1 , y2 ). The expression for y2 is similar to the expression
for y1 except S (A) is replaced by H . This takes into account that the eld service is present in the
HSC and/or application (system) disk subsystem.
In a similar manner, the next approximation re ned the model by taking into account the
global nature of system recovery. That is, if a recovery is ongoing in one subsystem the other
two subsystems are forced to go into recovery. The approximated e ect of global recovery is
achieved with an iterative scheme that allows for interaction between the sub-models. The nal
approximation only modi ed the HSC subsystem model to incorporate the e ect of an HSC failover5
as shown in Figure 16.
In state 2H , instead of a single failure transition labeled 2H , we now have three failure transitions. If the primary HSC fails the model transitions from state 2H to state PFAIL with a rate
H , and PFAIL transitions to state 1H after a failover to the secondary HSC with a rate FD . In
Failover is the procedure of switching to an alternate path or component after failure of a path or a component
[19]. During the HSC failover period all the disks are switched on to the operational HSC.
5

20


2H

2H

(1-P )
H

(1-P )

det

SFAIL

PFAIL

2H

1H

H det

0H

P
FD

1H

RSFAIL

RPFAIL
H

H det

FD

det

T
1H
F

Figure 16: HSC submodel with failover included


state 2H if the failure of the secondary HSC is detected with probability Pdet a transition to state
1H occurs with rate Pdet H and if not detected then a transition occurs with rate (1 , Pdet )H to
state SFAIL. The steady state availability of the HSC subsystem is then given by,

AHSC = P2H + P1H + PT2 + PSFAIL:


H

(13)

The steady state availability of the storage subsystem is given by Equation 7. In [10], after
various experiments it was observed that the storage downtime is more sensitive to detection of a
secondary HSC failure than the average failover time.

4.5 SPN Availability Model

In this section we discuss a VAXcluster availability model that tolerates largeness and automates
the generation of large Markov models. Ibe, Sathaye et al. [1], use generalized stochastic Petri
nets to model VAXclusters. The authors used the software tool SPNP [8] to generate and solve
the Markov model underlying the SPN. In fact, the SPN model in [1] allows extensions to permit
speci cations at the net level, hence the resulting model is a stochastic reward net.
In Figure 17 shows a partial SPN VAXcluster system model. The details of the entire model in
[1] are beyond the scope of the paper. The place PUP with N tokens represents the initial condition
that all the N processors are up. The processors can su er a permanent or intermittent failure,
represented by the timed transitions tINT and tPERM respectively. The ring rate of the transition
tPERM and tINT ) are marking dependent. This rate is expressed as #(PUP ; i)P and #(PUP ; i)I
respectively, where #(PUP ; i) represents the number of tokens in place PUP in any marking i.
The place PPERM (PINT ) represents that a permanent (intermittent) failure has occurred. When
permanent (intermittent) failure occurs, it will be covered with probability c (k) and uncovered
with probability 1 , c (1 , k). The covered permanent and intermittent failure is represented by
immediate transitions tPC and tIC respectively. The uncovered permanent and intermittent failure
is represented by immediate transitions tPU and tIU respectively. A failure is considered to be
covered only if the number of operational processors is at least l, the quorum. The input and
output arc with multiplicity l from and to the place PUP ensures quorum maintenance. In Figure
21

Intermittent

P
UIF

Block

Permanent
l

Block

P
CPF

l
t

IC

t
P

N
P
INT

P
UP

INT

1-k
P
REB

t
t

c
P
PERM

P
RP

PC

PERM
1-c
t

IU

PU

P
UPF

REB

RECONFIG

Cluster
Reconfiguration

RECONFIG

Block

IP

Figure 17: Partial SPN model of the VAXcluster system


17, the block labeled, \Cluster Recon guration Block" represents a group of recon guration places
in the complete model of [1]. In addition, the following behavior is represented by the SPN:

 A covered permanent failure is not possible while the cluster is being rebooted after an

uncovered failure (token in either PUIF or PUPF ). This is represented by an inhibitor arc
from PUIF and PUPF to the immediate transition tPC .
 It is assumed that a failure does not occur while the cluster is being recon gured. This is
represented by the inhibitor arcs from the \Cluster Recon guration Block" to tPERM and
tINT .
 A processor under reboot can su er a permanent failure. This is represented by the fact that
when there is a token in PREB both the transitions tREB and tIP are enabled.
The steady-state availability is given by:

A=

X
i2

where  is the set of tangible markings and,

22

ri i

(14)

8
>
1;
>
>
>
<
ri = >
>
>
>
:0;

if (#(PUP ; i)  l)
W
V
[#(PofClusterRebootPlaces; i) < 1 #(PUIF ; i) < 1]
[#(PClusterReconfigurationBlock; i) < 1]
Otherwise

4.6 SPN Availability Model for Heterogeneous VAXclusters

In this section, we present a SPN model that considers realistic VAXcluster con gurations which
include uniprocessors and multiprocessors [3]. The heterogeneity in the model allowed each multiprocessor to contain varying number of processors, and each VAX in the VAXcluster to belong
to di erent VAX families. Henceforth, we refer to the SPN model as the heterogeneous model.
Throughout this section we refer to a single multiprocessor system as a machine, which consists of
two components: one or more processors and a platform. The platform consists of the memory,
power, cooling, console interface module and I/O channel adapters [3]. As in the uniprocessor case
in the above sections, we depict covered and uncovered permanent and intermittent failures. In
addition, we depict the following failure/recovery behavior for a multiprocessor. A processor failure
in a machine requires a machine reboot to map the faulty processor oine. The entire machine is
not operational during the processor repair and the reboot following the repair. Before and after
every machine reboot a cluster recon guration maps the machine out and into the cluster. The
platform components of a machine are single points of failure for that machine [3]. In addition, the
following features are included:

 The option of including the quorum disk. A quorum disk acts as a virtual node in the

VAXcluster, casting a \tie-breaker" vote.


 The option of including failure/recovery behavior like unsuccessful repair and unsuccessful
reboots. An unsuccessful repair is modeled by introducing a faulty repair. Faulty repair
means that diagnostics called out a wrong FRU ( eld replaceable unit) or that the right FRU
was requested but is DOA (dead on arrival). Unsuccessful reboot means a processor reboot
did not complete and has to be repeated.
 The option of including detailed cluster recon guration information. For example, if the quorum does not exist in reality, the time to form a cluster is longer than the usual recon guration
time.
The overall model structure is shown in Figure 18. The VAXcluster is modeled as an 1  N
array, where each plane represents a subnet of a machine consisting of Mi processors. The place
PUP in plane i contains Mi tokens, and represents the initial condition that the machine is up
with Mi processors. The cluster recon guration subnet consists of a single place clust reconfig,
which initially consists of a single token. Whenever a machine i initiates a recon guration the
subnet associated with it ushes the token out of clust reconfig and returns the token after the
recon guration. This ensures that all machines in the cluster participate in a cluster state transition.
Similarly, the quorum disk is treated as a separate subnet which interacts with each of the N
subnets. The number of processors in a machine is varied by varying the number of tokens Mi
in each subnet. The model allows the machines in the cluster to belong to a di erent VAX series
because every subnet handles the failure/recovery behavior of a machine separately. If Mi = 1 in
23

Quorum
Disk

Machine N
Machine N-1

Machine 3
Machine 2
Machine 1

Field
Service

Cluster
Transition

Figure 18: Model structure for the VAXcluster system.


a subnet then the machine follows the failure/recovery behavior of a uniprocessor. The detailed
subnet associated with a machine is beyond the scope of this paper, and is presented in [3].
The heterogeneous SPN model included various extensions like variable arcs, enabling functions
or guards, rate type functions, etc., and hence is a Stochastic Reward Net, [14]. For example, when
an intermittent covered failure transition res, the rate type function for the rate IC is de ned as:
If (Mach UP + (mark(Pqdup )  NQV )  QU )
IC = platint + (mark(PUP )  I )
else
IC = k  (platint + (mark(PUP )  I ))
where Mach UP represents the number of machines up in the cluster, platint is the platform
intermittent failure rate, I is the processor intermittent failure rate and k is the probability of a
covered intermittent failure. The marking, mark(Pqdup) > 1 implies the quorum disk is up and
NQV is the number of quorum votes assigned to the quorum disk. The number of votes needed
for the VAXcluster to be operational is given by QU .
On the other hand, the rate type function for an intermittent uncovered failure rate is given by,
IU = (1 , k)  (platint + (mark(PUP )  I ))
The heterogeneous VAXcluster model is available if:
1. mark(clust reconfig) = 1, that is a cluster recon guration is not in progress,
2. NU + (mark(Pqdup )  NQV )  QU
where NU is obtained using the following algorithm.
24

Initial: NU = 0
For i = 1;    ; N
If ((mark(Platform failure; i) = 0)AND(mark(PUP ; i) > 0)AND
Repair and Reboot Transitions Disabled then
NU = NU + 1
NU represents that a machine is up if no platform failure has occurred and that at least one
processor is up and repair or reboot is not in progress.
This heterogeneous model was evaluated using the SPNP package [8]. This package solved the
SPN by analyzing the underlying CTMC. We resolved the problem in [3] by using a technique that
involved the truncation of the state space [7]. The state space cardinality of the CTMC isomorphic
with the heterogeneous model increased with the number of machines in the VAXcluster, as well
as the number of processors in each machine. To implement this state space reduction technique
by specifying a truncation level K for processor failures in the model, the maximum value of K is
M1 + M2 +    + MN . The value K speci es that the reachability graph and hence the corresponding
CTMC be generated up to K processor failures. This is implemented in the model by means of an
enabling function associated
PN with all the failure transitions. The enabling function disables all the
failure transitions if ( i=1 Mi , mark(PUP; i))  K . This technique is justi ed as follows:
 In real systems, most of the time the system has majority of its components operational [7].
This means the probability mass is concentrated on a relatively small number of states in
comparison to the total number of states in the model.
 We observed the impact of varying the truncation level on the availability measures for an
example heterogeneous cluster, and concluded that the e ect was minimal.
We used the heterogeneous model to not only evaluate measures associated with standard system
availability, but also with system reliability and task completion. In the system reliability class
measures, we evaluated measures like frequency of failures and frequency of disruptive outages. The
term disruptive is de ned as follows { any outage that exceeds the speci ed tolerance limit of the
user. The task completion measures evaluated the probability that the application is interrupted
during its application period.
In this paper we discuss an example measure from each of the three classes of measures [3]:
 Mean Cluster Downtime D in minutes per year: This is a system availability measure and
represents the average amount of time the cluster is not operating during a one year observation period. Then the expression for D in terms of the steady state cluster availability, A
is given by,
D = (1 , A)  8760  60:
(15)

 Frequency of Disruptive recon gurations(FDR): This is a system reliability measure and

represents the mean number of recon gurations which exceed the speci ed tolerance duration
during the one year observation period. We evaluate FDR as,

FDR = rg  e,thresh  Prg  8760  60


rg

(16)

where rg is the cluster recon guration (in, out or formation) rate, thresh is the time units of
the speci ed tolerance duration on the recon guration times and Prg is the probability that
a recon guration is in progress.
25

 Probability of Task Interruption under Pessimistic Assumption (Prob Psm): This is a task

completion measure. It measures the probability of a task that initially nds the system
available and which needs x hours for execution, but is interrupted by any failure in the
cluster. This is a pessimistic assumption because the system does not tolerate any interruption
including the brief recon guration delays. The expression for Prob Psm is given by:

Prob Psm =

,P

j 2Upstate (1:0 , e

N
k=1

(c ( + )+(
k;j

p;k

i;k

plt;k

+

plt int;k

))x

Pj

(17)

where for machine k, ck;j is the number of operational processors, Pj is the probability of
being in an operational state, p;k (i;k ) is the processor permanent(intermittent) failure rate,
plt;k (plt int;k ) is the platform permanent (intermittent) failure rates for machine k, and A
is the cluster availability.
In [3], we used these three measures for a particular con guration to study the impact of
truncation. In Table 4, we present the number of states and, the number of transitions of the
underlying CTMC.
Trunc.
Level
1
2
3
4
5

No. of No. of
States Arcs
348
948
2088
7110
6394 26686
13236 66596
20728 122746

Mean Cluster
Freq. of Disruptive
Downtime min./yr. recon g. threshold=10s
12.91732078
9.96432050
13.00257283
9.96751584
13.00258549
9.96751604
13.00258549
9.96751604
13.00258549
9.96751604

Prob. of task
Interruption t=1000s
0.00032767
0.00032767
0.00032767
0.00032767
0.00032767

Table 4: E ect of State Truncation on Output Measures.


On observing the results we can conclude that we could truncate the state space of the SPN
model for the heterogeneous cluster without impacting the results.

5 Conclusion
We started the paper by brie y discussing various non-state space and state space availability and
performance modeling approaches. Using the problem of deciding the optimal number of processors
in an n-component parallel multiprocessor system, we showed the limitations of a pure availability or
performance model, and emphasized the need for a composite availability and performance model.
Finally, we took a case study from a corporate environment and demonstrated an application of the
techniques in a real situation. Several approximations and assumptions were made and validated
before use, in order to deal with the size and complexity of the models encountered.

26

References
[1] O. Ibe, A. Sathaye, R. Howe and K. S. Trivedi, \Stochastic Petri Net Modeling of VAXcluster
System Availability", Proc. Third International Workshop on Petri Nets and Performance
Models (PNPM89), pp. 112-121, Kyoto, Japan, 1989.
[2] K. S. Trivedi, A. Sathaye, O. Ibe, and R. Howe, \Should I Add a Processor?", Proc. 23rd
Annual Hawaii Conference on System Sciences, pp. 214-221, January 1990.
[3] A. Sathaye, K. S. Trivedi and R. Howe, \Availability Modeling of Heterogeneous VAXcluster
Systems: A Stochastic Petri Net Approach", Proc. of International Conference on FaultTolerant Systems, Varna, January 1990.
[4] J. Muppala, A. Sathaye, R. Howe and K. S. Trivedi, \Dependability Modeling of a Heterogeneous VAXcluster System Using Stochastic Reward Nets", Hardware and Software Fault
Tolerance in Parallel Computing Systems, D. Avresky (ed.), pp. 33-59, Ellis Horwood Ltd.,
1992.
[5] N.P. Kronenberg, H.M. Levy, W.D. Strecker, R.J. Merwood, \VAXclusters: A Closely
Coupled Distributed System", ACM Trans. Computer Systems, Vol. 4, pp. 130-146, May 1986.
[6] E. Balkovich, P. Bhabhalia, W . Dunnington, and T. Weyant, \VAXcluster Availability Modeling", Digital Technical Journal, No. 5, pp. 69-79, September 1987.
[7] R. Muntz, E. de Souza e Silva, and A. Goyal, \Bounding Availability of Repairable Computer
Systems", IEEE Trans. on Computers, Vol. 38, No. 12, pp. 1714{1723, December, 1989.
[8] G. Ciardo, J. Muppala and K. S. Trivedi, \SPNP: Stochastic Petri Net Package", Proc. Third
Int. Workshop on Petri Nets and Performance Models (PNPM89), pp. 142 - 151, Kyoto, Japan,
1989.
[9] R. Sahner, A. Pulia to and K. S. Trivedi, Performance and Reliability Analysis of Computer
Systems: An Example-Based Approach Using the SHARPE Software Package, Kluwer Academic Publishers, Boston, 1995 (418 pages).
[10] A. Sathaye, K. Trivedi and D. Heimann, \Approximate Availability Models of the Storage
Subsystem," Technical Report, DEC., September 1988.
[11] D. Siewiorek and R. Swarz, The Theory and Practice of Reliable System Design, Digital Press,
1982.
[12] K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, Prentice-Hall, Englewood Cli s, NJ, 1982 (624 pages).
[13] L. Tomek and K. S. Trivedi, \Fixed Point Iteration in Availability Modeling", InformatikFachberichte, Vol. 283: Fehlertolerierende Rechensysteme, M. Dal Cin (ed.), pp. 229-240,
Springer-Verlag, Berlin, 1991.
[14] D. R. Avresky, Hardware and software fault tolerance in parallel computing systems, Ellis
Horwood Ltd., New York, 1992.
[15] H. Sun, X. Zang and K. S. Trivedi, \A BDD-based Algorithm for Reliability Analysis of
Phased-Mission Systems", IEEE Transactions on Reliability, Vol. 48, No. 1, pp. 50{60, March
1999.
27

[16] O. C. Ibe, R. C. Howe and K. S. Trivedi, \Approximate Availability Analysis of VAXcluster


Systems," IEEE Transactions on Reliability, Vol. 38, No. 1, pp. 146-152, April 1989.
[17] T. Luo and K. S. Trivedi, \An improved algorithm for coherent-system reliability", IEEE
Transactions on Reliability, Vol. 47, No. 1, pp. 73{78, 1998.
[18] J. Muppala and K. S. Trivedi, \Numerical transient solution of nite Markovian queueing
systems", Queueing and related models, U. N. Bhat and I. V. Basawa (eds.), pp. 262{284,
Oxford University Press, 1992.
[19] Introduction to VAXcluster Application Design, Digital Equipment Corporation, 1984.
[20] A. Satyanarayana and A. Prabhakar, \New topological formula and rapid algorithm for reliability analysis of complex networks", IEEE Transactions on Reliability, Vol. 27, pp. 82-100,
1978.
[21] S. A. Doyle and J. B. Dugan, \Dependability assessment using binary decision diagrams",
Proc. 25th Intl. Symposium on Fault Tolerant Computing, pp. 249-258, 1995.
[22] S. A. Doyle, J. B. Dugan and M. Boyd, \Combinatorial models and coverage: a binary decision
diagram (BDD) approach", Proc. Annual Reliability and Maintainability Symposium, pp. 8289, 1995.

28