Professional Documents
Culture Documents
a r t i c l e i n f o
a b s t r a c t
Article history:
Received 16 February 2010
Received in revised form
11 March 2012
Accepted 14 March 2012
Available online 18 April 2012
Grid computing has become relevant due to its applications to large-scale resource sharing, wide-area
information transfer, and multi-institutional collaborating. In general, in grid computing a service
requests the use of a set of resources, available in a grid, to complete certain tasks. Although analysis
tools and techniques for these types of systems have been studied, grid reliability analysis is generally
computation-intensive to obtain due to the complexity of the system. Moreover, conventional
reliability models have some common assumptions that cannot be applied to the grid systems.
Therefore, new analytical methods are needed for effective and accurate assessment of grid reliability.
This study presents a new method for estimating grid service reliability, which does not require prior
knowledge about the grid system structure unlike the previous studies. Moreover, the proposed
method does not rely on any assumptions about the link and node failure rates. This approach is based
on a data-mining algorithm, the K2, to discover the grid system structure from raw historical system
data, that allows to nd minimum resource spanning trees (MRST) within the grid then, uses Bayesian
networks (BN) to model the MRST and estimate grid service reliability.
& 2012 Elsevier Ltd. All rights reserved.
Keywords:
Grid systems
Bayesian networks
Reliability
Grid service
Minimum resource spanning tree
1. Introduction
Grid computing has become relevant due to its applications to
large-scale resource sharing, wide-area information transfer, and
multi-institutional collaborating. In general, in grid computing
services request a set of resources, available in a grid, to complete
certain tasks. Many experts believe that the grid technologies will
offer a chance to extend the benets of the Internet [1]. However,
it is difcult to analyze the grid reliability due to its highly
heterogeneous and distributed characteristics. Because the grid
systems involve cross-organizational sharing, they support existing distributed computing technologies. As an example, enterprise-level distributed computing systems can use the grid
technologies to achieve resource sharing across its different
institutions. Although, several development tools and techniques
for the grid systems have been studied, estimating grid reliability
is not straightforward due to the size and complexity of the
grid [2]. Therefore, new analytical methods are needed to evaluate
the grid reliability.
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
Nomenclature
Gi
Si
Ri
u
T
Pi
always open to unintentional mistakes that could cause discrepancies in the results [17].
To address these issues, this paper introduces a methodology
for estimating grid system reliability by combining techniques
such as BN construction from raw component and system data,
association rule mining and evaluation of conditional probabilities. Based on the extensive literature review, this is the rst
study that incorporates these methods for estimating grid system
reliability. With the increasing popularity of computer environments in systems engineering, grid systems have been widely
used in various system-related applications. Understanding the
grid system structure and the component relationships is essential for systems engineers for optimal resource allocation and
improving the system reliability. This study provides a methodology for automated discovery of component relationships and
estimation of reliability of grid services to help the systems
engineers.
The methodology suggested in this paper automates the
process of spanning tree discovery and BN construction by using
the K2 algorithm (a commonly used association rule mining
algorithm) that identies the associations among the grid system
components by using a predened scoring function and a heuristic. According to the proposed method, once the BN is efciently
and accurately constructed, reliabilities of grid services are
estimated with the help of Bayes rule. Unlike previous studies,
the methodology proposed in this paper does not rely on any
assumptions about the component failure rates in grid systems.
Moreover, the proposed method does not require prior knowledge
about the grid system structure.
2. Background information
This section provides background information about the grid
systems, BN and the K2 algorithm. Earlier studies on estimating
grid system reliability are also discussed in this section.
2.1. Grid systems
To represent distributed computing infrastructures for
advanced science and engineering, the term grid was rst used
in the 90s [3]. The grid concept was rst developed to enable
resource sharing within geographically diverse scientic organizations. The main problem that lies under the concept of grid
systems is coordinated resource sharing and problem solving in
dynamic and multi-institutional organizations [1]. Different than
typical distributed systems, the computational grid systems
require large-scale sharing of resources on different types of
components. A service request in a grid system involves a set of
nodes and links, through which the service can be provided. In a
grid system, the Resource Managers (RM) control and share
resources, while the Root Nodes (RN) request service from RM
(an RN may also share resources). Also, Dai and Wang [7] showed
that the links and nodes in each grid service form a spanning tree.
97
ti
f
m
98
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
99
Table 1
Example historical dataset.
Observation
G1
G2
G3
G4
G5
G6
G7
G8
G9
1
2
3
4
5
6
7
8
9
10
0
1
1
1
1
0
0
1
0
1
0
0
0
1
1
1
0
1
1
0
1
0
0
1
1
0
0
1
0
0
1
0
1
1
1
0
0
1
1
1
0
0
1
1
1
1
0
1
0
1
1
0
0
0
0
1
1
1
1
0
0
1
0
1
0
1
1
0
0
0
1
1
1
0
1
0
0
0
1
1
0
1
1
0
1
0
1
1
0
0
100
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
Table 2
Example historical dataset.
Observation
G1
G2
G3
G4
G5
MRST Behavior
1
2
3
4
5
6
7
8
9
10
0
1
1
1
1
0
0
1
0
1
0
0
0
1
1
1
0
1
1
0
1
0
0
1
1
0
0
1
0
0
1
0
1
1
1
0
0
1
1
1
0
0
1
1
1
1
0
1
0
1
0
0
0
1
1
0
0
1
0
0
101
Table 3
f scores for all possible candidate parent sets for G3.
Parent set
f score
1
1320
1
2800
1
1800
1
640
{G1}
{G2}
{G1, G2}
Skipping the details, f scores of the candidate parent sets for the
G3 component are given in Table 3. Because the K2 algorithm
iterates the components according to their ordering in dataset,
components G4 and G5 are not taken into account as candidate
parents for G3. In this step, the K2 algorithm selects the set {G1, G2}
as parent set of G3, because it has the highest f score. The number
of computations grows with the order of the component, and when
the K2 algorithm nishes processing the last column (MRST Behavior
in Table 2), it outputs the BN structure displayed in Fig. 3.
The next step of the proposed method is estimating the grid
service reliability using the BN that was constructed by the K2
algorithm. Besides the associations that were discovered in the
previous step, the inference rules described in Section 2.2 should
be used to calculate the conditional probabilities. The conditional
probabilities are calculated and stored in CPT and each component with a non-empty parent set in the BN is associated with a
CPT. The ones with no parents are independent of others and
associated with prior probabilities as explained in Section 2.2.
The probability values in the CPT are calculated by using the
raw data in Table 2 and can be expressed as the probability of an
instantiation of the parent set. For example the probability,
G3 being 0 given the parent instantiations as G1 0 and G2 0 is
0.5, since two out of ten observations parents are instantiated as 0
and 0; and for one of these cases G3 is instantiated as 0. In the next
step, with the help of CPT and the prior probabilities that G1 and
G2 have, the success probability value for G3 can be calculated.
According to the BN structure (in Fig. 3) components G1 and G2 are
independent of others; therefore their success probabilities can be
directly inferred from the observations dataset in Table 2. From
Table 2 it can be evaluated that p(G1 1)0.6 and p(G2 1)0.5.
While evaluating the other components in the BN, the success
probabilities for the rest of the components in the sample MRST
can be evaluated; such that p(G4 1) 0.6 and p(G5 1)0.75.
In the last step, the MRST reliability can be calculated by using
these probability values and the CPT of the MRST Behavior node in
the BN structure given in Fig. 3. The success probability of the
MRST Behavior node can be calculated as 0.35 or 35%; which is the
reliability of the MRST used in this section. The reader must recall
that this reliability value is calculated based on only 10 observations on the sample system. With more observations available,
the K2 algorithm could provide more accurate results on the
degrees of associations between the system components and
calculate more precise values in the CPT of the nodes; which will
increase the accuracy of the calculated service reliability.
4. Experimental analysis
This section provides experimental analysis of the proposed
method for grid service reliability estimation. For experimental
analysis, the proposed method is implemented in Matlab 8, using
a computer equipped with Intel Core 2 Duo 2.1 Ghz CPU and 2 GB
RAM. This computer runs on 32-bit Windows Vista Business
operating system.
102
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
Table 4
List of resources shared by each component.
Component
G1
G2
G3
G4
G5
G6
G7
G8
G9
G10
G11
G12
G13
R1,
R2,
R1,
R4,
R1,
R4,
R2,
R4,
R6,
R3,
R1,
R2,
R5,
R3,
R3,
R3,
R5,
R3,
R6,
R4,
R5,
R7,
R6,
R6,
R4,
R6,
R4,
R5,
R7,
R6,
R4,
R7,
R7,
R7,
R8,
R8,
R9,
R7,
R7,
Table 5
List of grid services and required resources.
Grid service
Requestor RN
Required resources
S1
S2
S3
S4
S5
S6
S7
S8
S9
G2
G2
G3
G3
G3
G6
G6
G9
G9
R2,
R2,
R1,
R4,
R2,
R3,
R1,
R1,
R1,
R5,
R3,
R3,
R7,
R4,
R4,
R2,
R2,
R2,
R7,
R4,
R6,
R8,
R6,
R6,
R5,
R3,
R4,
R8,
R7,
R8,
R9,
R7,
R7,
R6,
R4,
R5,
Table 6
Reliabilities of components in the grid system.
Component
G1
G2
G3
G4
G5
G6
G7
G8
G9
G10
G11
G12
G13
Reliability
0.99
0.91
0.95
0.93
0.99
0.94
0.95
0.98
0.98
0.99
0.91
0.94
0.91
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
103
Fig. 10. Performance of the proposed MRST discovery method on the performance of the MRST discovery step.
Table 7
Statistical details of the experimental results for MRST discovery times.
Case # Number
of MRST
Avg. MRST
size
Avg.
1
2
3
4
5
6
7
8
9
4.33
5.33
5
5.4
5.16
5.14
5.375
5.55
6
0.9
0.82
1.03
2.46
2.32
2.71
5.98
5.76
6.33
10.35 10.25 10.54
21.34 21.12 21.77
44.65 44.25 45.06
89.98 89.28 90.7
172.23 171.57 173.02
359.03 358.29 360.32
3
3
4
5
6
7
8
9
10
Min.
Max.
Std. Dev.
Dai and
Wang
0.105987
1.12
0.197569
3.35
0.28746
6.54
0.147309 12.67
0.330606 24.02
0.40501
57.9
0.710023 112.83
0.725971 259.49
1.027343 582.65
Table 8
Experimental results for the performance of BN construction and reliability
estimation.
Case #
Number of
MRST
MRST
size
Avg. BN
construction
time
Avg. reliability
estimation
time
1
2
3
4
5
6
7
8
9
3
3
4
5
6
7
8
9
10
4.33
5.33
5
5.4
5.16
5.14
5.375
5.55
6
4.24
8.12
14.51
21.44
29.54
37.91
44.04
50.12
58.47
0.77
1.44
2.29
3.76
5.05
7.11
8.59
9.91
11.54
Table 9
Reliability estimation results for the case grid services.
Case #
Average
Minimum
Maximum
Std. Dev.
1
2
3
4
5
6
7
8
9
0.9757
0.963
0.9557
0.9575
0.9786
0.9792
0.9795
0.9780
0.9817
0.9798
0.9588
0.9364
0.9381
0.9638
0.9719
0.9672
0.9683
0.9662
0.9832
0.979
0.9754
0.9905
0.983
0.9974
0.9946
0.9871
0.9942
0.1909
0.1528
0.1545
0.1920
0.1761
0.1203
0.1750
0.1787
0.1721
104
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
Fig. 12. Example grid system used in Dai and Wangs study [7].
Table 10
Failure rates (per second) of the components.
Component G1
li
G2
G3
G4
G5
G6
G7
G8
G9
Table 11
Comparison of the experimental results with Dai and Wangs results.
Average
Minimum
Maximum
Std. Dev.
0.96034
0.96108
0.9428
0.9538
0.9696
0.9714
0.005752
0.01873
Table 12
Running times of MRST discovery, BN construction and reliability estimation.
MRST discovery
BN construction
Reliability estimation
Average
Minimum
Maximum
Std. Dev.
23.77
34.05
6.88
19.45
30.84
5.61
25.78
38.9
8.19
0.405
1.352
0.535
In their example, they assumed that the failure rates are statistically distributed with l intervals as shown in Table 10.
Using this setup, Dai and Wang estimated the reliability of the
grid service S, where the resources R1, R2, R3 and R4 are requested by
the RN G1. The MRST for this service involves 5 components; G1, G2,
G3, G7 and G9. In order to estimate the reliability of the grid service S,
ho generated 100 historical datasets containing 100 observations
using the failure rate assumptions in Table 10. The implementation
of the proposed method is ran for 100 times with different historical
datasets. The average, minimum and maximum reliability values of
the experimental results (out of 100 experiments) and comparison
with Dai and Wangs results are provided in Table 11.
As it can be observed from Table 11, the experimental results
are very close (less than 0.1% difference) to Dai and Wangs
results. Moreover, the low standard deviation values show that
the results are statistically signicant. Table 12 shows the MRST
discovery, BN construction and reliability estimation times
(in seconds) for the example grid system shown in Fig. 12.
5. Conclusions
Grid systems are newly developed concepts for large-scale
distributed systems. In a grid system, there can be various nodes
References
[1] Foster I, Kesselman C, Tuecke S. The anatomy of the grid: enabling scalable
virtual organizations. International Journal of Supercomputer Applications
2001;15(3).
[2] Frey J, Tannenbaum T, Foster I, Livny M, Tuecke S, Condor G. A computation
management agent for multi-institutional grids. Cluster Computing
2002;5(3):237246.
[3] Foster I, Kesselman C. In computational grids. VECPAR, 1998; Morgan
Kaufmann, 1998. p. 1552.
[4] Buyya R, Date S, Mizuno-Matsumoto Y, Venugopal S, Abramson D. Economic
and on demand brain activity analysis on global grids. Computing Research
Repository 2003.
[5] Dai YS, Levitin G. Optimal resource allocation for maximizing performance
and reliability in tree-structured grid services. IEEE Transactions on Reliability 2007;56(3):444453.
[6] Dai YS, Pan Y, Zou X. A hierarchical modeling and analysis for grid service
reliability. IEEE Transactions on Computers 2007;56:681691.
[7] Dai YS, Wang X. Optimal resource allocation on grid systems for maximizing
service reliability using a genetic algorithm. Reliability Engineering & System
Safety 2006;91(9):10711082.
O. Doguc, J. Emmanuel Ramirez-Marquez / Reliability Engineering and System Safety 104 (2012) 96105
[8] Dai YS, Xie M, Poh KL. Reliability of grid service systems. Computers and
Industrial Engineering 2006;50:130147.
[9] Amasaki S, Takagi Y, Mizuno O, Kikuno T. In: a Bayesian belief network for
assessing the likelihood of fault content. 14th international symposium on
software reliability engineering, 2003, p. 125.
[10] Boudali H, Dugan JB. A continuous-time Bayesian network reliability modeling,
and analysis framework. IEEE Transaction on Reliability 2006;55(1):8697.
[11] Gran BA, Helminen A, Bayesian A. Belief network for reliability assessment.
Safecomp 2001 2187, 2001, p. 3545.
[12] Doguc O, Ramirez-Marquez JE. A generic method for estimating system
reliability using Bayesian networks. Reliability Engineering and System
Safety 2009;94(2):542550.
[13] Sigurdsson JH, Walls LA, Quigley JL. Bayesian belief nets for managing expert
judgment and modeling reliability. Quality and Reliability Engineering
International 2001;17:181190.
[14] Hugin Expert. /http://www.hugin.dkS.
[15] Gran BA, Dahll G, Eisinger S, Lund EJ, Norstrm JG, Strocka P, Ystanes BJ.
In: Estimating dependability of programmable systems using BBNs. Safecomp 2000, 2000; Springer 2000, p. 309320.
[16] Lagnseth H, Portinale L. Bayesian networks in reliability. Reliability Engineering and System Safety 2007;vol. 92(1) p. 92108.
[17] Inamura T, Inaba M, Inoue H. In: User adaptation of humanrobot interaction
model based on Bayesian network and introspection of interaction experience.
IEEE/RSJ international conference on intelligent robots and systems. 2000. p.
21392144.
[19] Levitin G, Dai YS. Performance and reliability of a star topology grid service
with data dependency and two types of failure. IIE Transactions 2007;39(8):
783.
105
[20] Dai YS, Levitin G. Optimal resource allocation for maximizing performance
and reliability in tree-structured grid services. IEEE Transactions on Reliability 2007.
[21] Barlow RE. Using inuence diagrams. Accelerated life testing and experts
opinions in reliability 1988:145150.
[22] Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ. Probabilistic networks
and expert systems. New York, NY: Springer-Verlag; 1999.
[23] Jensen FV. Bayesian networks and decision graphs. New York, NY: Springer
Verlag; 2001.
[24] Pearl J. Probabilistic reasoning in intelligent systems. San Francisco,
CA: Morgan Kaufmann; 1988.
[25] Fenton N, Krause P, Neil M. Software measurement: uncertainty and causal
modeling. IEEE Software 2002;10(4):116122.
[26] Bobbio A, Portinale L, Minichino M, Ciancamerla E. Improving the analysis of
dependable systems by mapping fault trees into Bayesian networks. Reliability Engineering and System Safety 2001;71(3):249260.
[27] Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic
networks from data. Machine Learning 1992;9(4):309347.
[28] Doguc O, Ramirez-Marquez JE. In a Bayesian approach for estimating grid
system reliability. International conference on grid computing and applications. Las Vegas, NV; July 1417, 2008.
[29] Chen DJ, Chen RS, Huang TH. A heuristic approach to generating le spanning
trees for reliability analysis of distributed computing systems. Computers
and Mathematics with Application 1997;34:115131.
[30] Dai YS, Xie M, Poh KL, Liu GQ. A study of service reliability and availability for
distributed systems. Reliability Engineering and System Safety 2003;79:
103112.
[31] Kumar A, Agrawal DP. A generalized algorithm for evaluating distributedprogram reliability. IEEE Transactions on Reliability 1993;42:416424.